[Intranet] Yasuhiro ABE, The University of Aizu. | Kubernetes Internal Mechanisms [For Intermediate and Advanced Users]

Summary

This section describes the components that make up Kubernetes (K8s) and explains how Kubernetes works.

References

The purpose of this document is to explain what you might need to know to read these documents.

Basic Configuration of K8s

The official documentation includes the following diagram.

When simply written "Node", it refers to the Woker Node. The special node on which the critical API server that makes up the K8s system runs is sometimes called the Control Plane Node to distinguish it. Usually, some of worker nodes are used as control plane nodes.

In the Kubernetes cluster used in this SCCP, the Control Plane functionality is running on some Worker Nodes, so our Control Plane Nodes are not independent.

You can check the node and its types with the following command.

$ kubectl get node

The Control-Plane node, where the api-server runs and is the heart of the system, has the role control-plane (since v1.20.x).

It used to be called the Master node, but after the BLM movement, some terms, such as master and slave, are replaced in the documentation of various projects, not only Kubernetes, in progress.

NAME       STATUS   ROLES           AGE      VERSION
u109ls01   Ready    control-plane   2y358d   v1.30.4
u109ls02   Ready    control-plane   2y358d   v1.30.4
u109ls03   Ready    <none>          2y358d   v1.30.4
u109ls04   Ready    <none>          2y358d   v1.30.4

Node (Worker Node) configuration

To run containers on the Kubernetes system, each node runs a Container Runtime Interface (CRI) compliant container engine such as Docker, containerd, CRI-O, etc. In addition, there are Kubernetes-specific components such as kubelet and kube-proxy are running.

Virtualized networking
CRI-compliant container engines (Docker, containerd, CRI-O, etc.)
kubelet process
kube-proxy process

This container engine, kubelet, and kube-proxy are always running on all of nodes and cooperate with the API-server (kube-apiserver, described later) to run user pods, etc.

Virtualized network

Although not explicitly shown in the diagrams in the official guide, the heart of Kubernetes is network virtualization.

The initial design is documented in GitHub.

Kubernetes network design philosophy

The document lists four basic features.

Communication between containers (inside pods)
Communication between Pods
Communication between Service and Pod
Communication between outside and inside

In the environment of Seminar Room 10, Calico is used to build a virtualized network.

Calico operates at the Layer-3 level and uses BGP to publicize the location of the /32 IP addresses assigned to each node Other solutions, such as Flannel, use VxLAN technology to provide virtualization at Layer-2, but there are performance concerns Calico, which uses Layer-3, is considered a better choice if Layer-2 functionality is not required.

The default value of calico network backend mode is vxlan from ipip. Please check the GitHubのissues page for details.

CRI compliant container engine (Runtime)

Official kubernetes documentation lists the following three CRI runtimes Kubernetes official documentation lists the following three CRI runtimes.

containerd
CRI-O
Docker

Described at the CRI: the Container Runtime Interface, the Protobuf API lists the services implemented by CRI compliant container engines.

The basic role is to provide the following functions described in the service definition

Runtime Service: Retrieving, storing, and deleting container images (container-image managements)
Image Service: Start, stop, and delete containers (pod operations)

Container engines provide a defined set of services, and the following functions may behave differently depending on the container engine that you choose.

Where to pull container images from (Not necessarily looking for docker.io when the Repository name is omitted)
where to store container images (Not necessarily images are stored under /var/lib/docker/)
How to set up a connection to your own Registry that uses your own TSL/SSL CA certificate file (Different locations for ca.key file)

The Kubernetes team used the docker as a container runtime at first. However, they decided that the support of docker was discontinued from v1.20 release. Now we use the containerd as the container runtime. For more details, please see the official blog.

Essential Processes of Kubernetes Node

kubelet

kubelet works with Control-Plane’s api-server (kube-apiserver) to manage pod (container) activities.

In each actual node, it exists as the following process

## ps auxww | grep kubelet output
root 151957 7.4 0.6 3357840 158592 ?      Ssl May31 5958:14 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=192.168.100.51 --hostname-override=u109ls01 --bootstrap- kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet. conf --pod-infra-container-image=k8s.gcr.io/pause:3.3 --runtime-cgroups=/systemd/system.slice --network-plugin=cni --cni-conf-dir=/etc /cni/net.d --cni-bin-dir=/opt/cni/bin

kube-proxy

kube-proxy controls pod communication (network).

It runs as a Docker container on each node.

## docker ps | grep kube-proxy output
39ff1f5995bd 9d368f4517bb "/usr/local/bin/kube..." 5 days ago Up 5 days k8s_kube-proxy_kube-proxy-pg7d4_kube-system_f95cad6f-482b -4c52-91f1-a6759cbe7a0b_2
622fa3ac83bd k8s.gcr.io/pause:3.3 "/pause" 5 days ago Up 5 days k8s_POD_kube-proxy-pg7d4_kube-system_f95cad6f-482b-4c52-91f1-a6759cbe7a0b_2

Since kube-proxy is running as a pod, you can see how it works from kubectl.

$ kubectl -n kube-system get pod -l k8s-app=kube-proxy -o wide

NAME               READY   STATUS    RESTARTS   AGE   IP               NODE       NOMINATED NODE   READINESS G
ATES
kube-proxy-56j9f   1/1     Running   3          55d   192.168.100.54   u109ls04   <none>           <none>
kube-proxy-gt7gg   1/1     Running   2          55d   192.168.100.52   u109ls02   <none>           <none>
kube-proxy-hlkn8   1/1     Running   2          55d   192.168.100.53   u109ls03   <none>           <none>
kube-proxy-pg7d4   1/1     Running   2          55d   192.168.100.51   u109ls01   <none>           <none>

In communication, the IP address used by Load-Balancer is associated with the MAC address of each node.
Other IP addresses used within K8s, such as ClusterIP, are associated with the local MAC address (x[26ae]:xx:xx:xx:xx:xx:xx) and assigned to one of the nodes.

## K8s internal network as observed from the outside
$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
192.168.100.160          ether   00:1b:21:bc:0c:3a   C                     ens1
192.168.100.52           ether   00:1b:21:bc:0c:89   C                     ens1
192.168.100.53           ether   00:1b:21:bc:0c:3a   C                     ens1
192.168.100.54           ether   00:1b:21:bc:0c:3b   C                     ens1
...

## K8s internal network that can be observed from the inside
$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
192.168.100.52           ether   00:1b:21:bc:0c:89   C                     enp1s0
192.168.100.53           ether   00:1b:21:bc:0c:3a   C                     enp1s0
192.168.100.54           ether   00:1b:21:bc:0c:3b   C                     enp1s0
10.233.113.131           ether   1a:73:20:a4:cd:b3   C                     cali1223486b3a4
10.233.113.140           ether   72:e9:69:66:14:dc   C                     cali79c5fc4a9e9
...

K8s System Components (Control Plane)

The main body of the Kubernetes system runs as a Control Plane.

api-server
kube-scheduler
etcd
kube-controller-manager (Controller Manager)
(optional) Cloud Controller Manager

All of these components except etcd are running under the control of the container engine as pods on the Control Plane Node.

Processes running as pods can use kubectl to check which server they are running on.

$ kubectl -n kube-system get pod

NAME                                      READY   STATUS    RESTARTS   AGE    IP               NODE       NOMINATED NODE   READINESS GATES
calico-kube-controllers-8b5ff5d58-rr4jp   1/1     Running   1          6d1h   192.168.100.54   u109ls04   <none>           <none>
calico-node-2cm5r                         1/1     Running   8          171d   192.168.100.53   u109ls03   <none>           <none>
calico-node-5x5pr                         1/1     Running   10         171d   192.168.100.51   u109ls01   <none>           <none>
calico-node-7v65s                         1/1     Running   10         171d   192.168.100.52   u109ls02   <none>           <none>
calico-node-l7hqn                         1/1     Running   7          171d   192.168.100.54   u109ls04   <none>           <none>
coredns-85967d65-7g7fb                    1/1     Running   0          6d1h   10.233.112.58    u109ls03   <none>           <none>
coredns-85967d65-hbtjj                    1/1     Running   3          55d    10.233.105.203   u109ls04   <none>           <none>
dns-autoscaler-5b7b5c9b6f-44jh8           1/1     Running   0          6d1h   10.233.105.9     u109ls04   <none>           <none>
kube-apiserver-u109ls01                   1/1     Running   20         110d   192.168.100.51   u109ls01   <none>           <none>
kube-apiserver-u109ls02                   1/1     Running   17         109d   192.168.100.52   u109ls02   <none>           <none>
kube-controller-manager-u109ls01          1/1     Running   8          171d   192.168.100.51   u109ls01   <none>           <none>
kube-controller-manager-u109ls02          1/1     Running   7          171d   192.168.100.52   u109ls02   <none>           <none>
kube-proxy-56j9f                          1/1     Running   3          55d    192.168.100.54   u109ls04   <none>           <none>
kube-proxy-gt7gg                          1/1     Running   2          55d    192.168.100.52   u109ls02   <none>           <none>
kube-proxy-hlkn8                          1/1     Running   2          55d    192.168.100.53   u109ls03   <none>           <none>
kube-proxy-pg7d4                          1/1     Running   2          55d    192.168.100.51   u109ls01   <none>           <none>
kube-scheduler-u109ls01                   1/1     Running   8          171d   192.168.100.51   u109ls01   <none>           <none>
kube-scheduler-u109ls02                   1/1     Running   7          171d   192.168.100.52   u109ls02   <none>           <none>
metrics-server-7c5f68c54d-zrtgl           2/2     Running   1          6d1h   10.233.105.248   u109ls04   <none>           <none>
nodelocaldns-9pz2w                        1/1     Running   7          171d   192.168.100.54   u109ls04   <none>           <none>
nodelocaldns-bzhwn                        1/1     Running   11         171d   192.168.100.51   u109ls01   <none>           <none>
nodelocaldns-nsgk7                        1/1     Running   12         171d   192.168.100.53   u109ls03   <none>           <none>
nodelocaldns-z44sj                        1/1     Running   12         171d   192.168.100.52   u109ls02   <none>           <none>

etcd, kubelet running directly under OS management do not appear here.

api-server

This is a server component that provides Kubernetes API and communicates with kubelets running on each Node and with the client kbuectl.

The received content is stored in etcd.

The other party we communicate directly with is api-server. If this function is lost, the

kube-scheduler

This is the component that determines which Node the Pod object will run on when it is registered.

etcd

etcd is an open source, Key-Value NoSQL distributed database. It is used in many projects other than Kubernetes.

etcd is not a pod, but in Ubuntu it is under the management of systemd and runs as a server process of the OS.

The command etcdctl can be used to check what data is stored in etcd. Although it must be run on the server, information on namespace: metallb-systemd can be checked as follows

## Running on 192.168.100.51-54
$ sudo etcdctl --endpoints https://192.168.100.51:2379 --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/member-u109ls01.pem -- key=/etc/etc/etcd/ssl/member-u109ls01-key.pem get /registry/namespaces/metallb-system

/registry/namespaces/metallb-system
k8s

v1 Namespace

metallb-system "*$50ded0f3-600b-4433-a4ac-0adc17a50f192ZB
appmetallbb
0kubectl.kubernetes.io/last-applied-configurationx{"apiVersion": "v1", "kind": "Namespace", "metadata":{"annotatio
ns":{}, "labels":{"app": "metallb"}, "name": "metallb-system"}}
z
kubectl-client-side-applyUpdatevFieldsV1:.
{"f:metadata":{"f:annotations":{"." :{}, "f:kubectl.kubernetes.io/last-applied-configuration":{}}, "f:labels":{".
":{}, "f:app":{}}, "f:status":{"f:phase":{}}


kubernetes
Active"

kube-controller-manager

A pod called kube-controller-manager is running, which works with a mechanism called ``Controller''.

The Controller object itself can be created on its own, but this component implements several Contorller objects that are required to run Kubernetes.

The official guide lists the following four, but it also controls Deployment and StatefulSet objects.

Node Controller - Controller that detects and notifies Node Up/Down.
Replication Controller - Controller to maintain an appropriate number of Pods.
EndPoint Controller - Controller that connects Pods to services
ServiceAccount/Token Controller - Controller to create default account and API access token when a new Namespace is created

The last one, ServiceAccount/Token, is not something you are aware of, but it stores the necessary information as a Secret object.

$ kubectl -n $(id -un) get secret

The default-token-xxxxx at the top of the output is the Secret object created by the ServiceAccount/Token Controller.

NAME                  TYPE                                  DATA   AGE
default-token-4pmsl   kubernetes.io/service-account-token   3      109d
objectstore           Opaque                                4      109d
ssh-auhorized-keys    Opaque                                1      13d
ssh-host-keys         Opaque                                3      13d

$ kubectl -n $(id -un) get secret default-token-4pmsl -o yaml

As follows: token: … and kubernetes.io/service-account.name: default and related information are registered.

apiVersion: v1
data:
  ca.crt: ....
  namespace: eWFzdS1hYmU=
  token: ....
kind: Secret
metadata: annotations
  annotations: default
    kubernetes.io/service-account.name: default
    kubernetes.io/service-account.uid: a37638b6-917e-41d0-b14b-7ba4eac7889c
  creationTimestamp: "2021-04-07T02:53:50Z"
  ....

The pod of kube-controller-manager can be checked with the following operation.

$ kubectl -n kube-system get pod -l component=kube-controller-manager -o wide

NAME                               READY   STATUS    RESTARTS   AGE    IP               NODE       NOMINATED N
ODE   READINESS GATES
kube-controller-manager-u109ls01   1/1     Running   8          170d   192.168.100.51   u109ls01   <none>           <none>
kube-controller-manager-u109ls02   1/1     Running   7          170d   192.168.100.52   u109ls02   <none>           <none>

This other k8s component

Other components running include a DNS server and the network controller required to run Calico.

Customizing K8s

In the Kubernetes system, api-server was the central system that communicated with kubelet, the client, and kube-controller-manager*, which implements the Controller object. If you can communicate with this api-server, you can customize its behavior.

Kubernetes has flexible mechanisms for extensions, and we will discuss the following mechanisms here.

Controller
Custom Resource Definition (CRD)

Both CRD and Controller are separate mechanisms, but both are usually used to extend the system.

Examples of customization

For example, the following project introduces its own Operator mechanism to simplify system implementation.

RabbitMQ Cluster Operator for Kubernetes (Source Code: link:. GitHub rabbitmq/cluster-operator)
Apache Solr Operator (Source Code: GitHub apache/ solr-operator)

As you can see from GitHub, the programming language used by these projects is Go. Kubernetes provides *client-go as a library to communicate with the API, and other libraries are also based on the Go language.

If you are creating a systems management application, not just Kubernetes, you should aim to learn not only scripting languages, but also C and Go. It is useful to be able to read code, make necessary modifications, compile, etc.

The Go language is likely to be needed more often in the future for customization, as in this case. In addition, it will be less likely to have problems like C or Perl, where different authors have different ways of doing things, the OS is not the same, and the shared libraries required do not work when copied to the production environment. For this reason, we expect to see more applications created in Go, and it is well suited for slightly more complex utility programming.

GitHub kubernetes/client-go

Client-go provides external commands like kubectl, including a sample that communicates with api-server.

To actually create your own Operator, client-go alone is not enough, so the following framework is provided as a tool to assist you.

code-generator (aka Kubernetes Way) (GitHub kubernetes/code-generator)
KubeBuilder (GitHub kubernetes-sig/kubebuilder) (SIGs - Special Interest Groups)
Operator SDK (hosted by RedHat Inc.)

Choose one of these to build your own Operator using CRD and Controller.

Controller

The Controller object is a program that manages a Resource and communicates with the api-server.

Already registered resources have corresponding controllers, and the kube-controller-manager manages basic objects such as Deployment, ReplicaSet, etc.

The end result is a Docker container, but if you check the Dockerfile of RabbitMQ Operator and others, as linked below, you will see that we have created a simple Docker container that simply copies the compiled (Build) executable file and executes it.

https://github.com/rabbitmq/cluster-operator/blob/main/ Dockerfile

## Excerpt of the main body of the Dockerfile, just copy the manager command and run
FROM scratch

ARG GIT_COMMIT
LABEL GitCommit=$GIT_COMMIT

WORKDIR /
COPY --from=builder /workspace/manager .
COPY --from=etc-builder /etc/passwd /etc/group /etc/
COPY --from=etc-builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt

USER 1000:1000

ENTRYPOINT ["/manager"]

Basic controller behavior

The Controller object is simply a command that communicates with the api-server, but it has a mechanism for constantly monitoring the system state and acting quickly when the state changes.

This mechanism is implemented by each controller separately from Kubernetes, and various frameworks exist to support the creation of this common mechanism. The frameworks also provide various innovations to reduce the load on the api-server, such as providing incremental updates using in-memory-cache.

Reconciliation Loop, as the move is called, is listed in the reference material link:https://speakerdeck.com/govargo/under-the-kubernetes-controller-36f9b71b-9781- 4846-9625-23c31da93014?slide=5[Kubernetes Controller^ starting from scratch]

In order for this to work, you will be limited to one controller working in concert with one resource.

Custom Resource Definition (CRD)

Any CRD can be registered with Kubernetes via kubectl. As a sample, let’s check what kind of CRDs RabbitMQ’s Operator has registered.

The definition is long, so you can either pipe it to "less", or drop it into a file and check it in the Editor, or check it from a web browser.

https://github.com/rabbitmq/ cluster-operator/releases/latest/download/cluster-operator.yml

$ curl -L "https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml" | less

---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.6.0
  labels:
    app.kubernetes.io/component: rabbitmq-operator
    app.kubernetes.io/name: rabbitmq-cluster-operator
    app.kubernetes.io/part-of: rabbitmq
  name: rabbitmqclusters.rabbitmq.com

So far, you can see that the controller-gen command is generated using KubeBuilder. (controller-gen.kubebuilder.io/version: v0.6.0)

Resource definitions follow from here.

spec:
  group: rabbitmq.com
  names:
    categories:
    - all
    kind: RabbitmqCluster
    listKind: RabbitmqClusterList
    plural: rabbitmqclusters
    shortNames:
    - rmq
    singular: rabbitmqcluster
  scope: Namespaced
  versions:
  - additionalPrinterColumns:
    - jsonPath: .status.conditions[?(@.type == 'AllReplicasReady')]status
      name: AllReplicasReady
      type: string
    - jsonPath: .status.conditions[?(@.type == 'ReconcileSuccess')]status
      name: ReconcileSuccess
      type: string
    - jsonPath: .metadata.creationTimestamp
      name: Age
      type: date
    name: v1beta1
    schema:
      openAPIV3Schema:
       ....

Here, in the kind: RabbitmqCluster section, we see that it is the definition of a new RabbitmqCluster resource.

It is quite a challenge to read the file generated by the tool, but if you look at ---, this file contains the following definitions.

Namespace
CustomResourceDefinition
ServiceAccount
Role
ClusterRole
RoleBinding
ClusterRoleBinding
Deployment

This file is applied with kubectl apply -f and the results are checked.

$ kubectl -n rabbitmq-system get all

The result of this command is as follows

NAME                                             READY   STATUS    RESTARTS   AGE
pod/rabbitmq-cluster-operator-5b4b795998-sfxmm   1/1     Running   0          9m32s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/rabbitmq-cluster-operator   1/1     1            1           9m32s

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/rabbitmq-cluster-operator-5b4b795998   1         1         1       9m32s
Error from server (Forbidden): rabbitmqclusters.rabbitmq.com is forbidden: User "yasu-abe@u-aizu.ac.jp" cannot list
resource "rabbitmqclusters" in API group "rabbitmq.com" in the namespace "rabbitmq-system"

The error is still there, but we will ignore it for now, as there is no change in what is displayed, even with administrator privileges.

In order to create a RabbitmqCluster object corresponding to the registered CRD, the following YAML file will be accepted.

rabbitmq-cluster.yaml

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: definition
spec:
  replicas: 3
  persistence:
    storageClassName: rook-ceph-block
    storage: 20Gi
  service:
    type: LoadBalancer
    annotations:
      metallb.universe.tf/address-pool: rabbitmq-pool

After running kubectl -n rabbitmq-system apply -f on this file and checking the status with administrator privileges, you should see something like this

$ kubectl -n rabbitmq-system get all

NAME                                             READY   STATUS     RESTARTS   AGE
pod/definition-server-0                          0/1     Init:0/1   0          30s
pod/definition-server-1                          0/1     Init:0/1   0          30s
pod/definition-server-2                          0/1     Init:0/1   0          29s
pod/rabbitmq-cluster-operator-5b4b795998-sfxmm   1/1     Running    0          29m

NAME                       TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                                          AGE
service/definition         LoadBalancer   10.233.53.4   <pending>     5672:32533/TCP,15672:30224/TCP,15692:32648/TCP   31s
service/definition-nodes   ClusterIP      None          <none>        4369/TCP,25672/TCP                               32s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/rabbitmq-cluster-operator   1/1     1            1           29m

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/rabbitmq-cluster-operator-5b4b795998   1         1         1       29m

NAME                                 READY   AGE
statefulset.apps/definition-server   0/3     30s

NAME                                      ALLREPLICASREADY   RECONCILESUCCESS   AGE
rabbitmqcluster.rabbitmq.com/definition   False              Unknown            32s

This shows how the Operator registers the necessary StatefulSet and Service objects, given the registered RabbitMQCluster object.

The best way to see how it actually works is to check the code on GitHub.

Above.