0%

深入剖析K8s筆記(如何保證的Pod的健康)

0x00 前言

Q: healthy check除了监控Pod/Container状态外,有没有更细致,监控Container内服务状态的方法呢?

A: 有,可以为Container定义一个“探针” Probekubelet就可以根据这个Probe的返回值断定Container的状态,而不只以Container运行为标准。而且探针有三种:

  • livenessProbe存活探针
  • readinessProbe就绪探针
  • startupProbe启动探针

0x01 健康检查-探针

livenessProbe存活探针

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: test-liveness-exec
spec:
containers:
- name: liveness
image: busybox
args:
- /bin/sh
- -c
#30秒自动删除/tmp/healthy探针
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
#使用livenessProbe定义检查命令
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
#启动后5秒开始检查
initialDelaySeconds: 5
#5秒一次
periodSeconds: 5

观察一下,kubelet的恢复机制(pod.spec.restartPolicy)在不停地拉起保活。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ kubectl get pod test-liveness-exec -o wide --watch                                                                                 
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-liveness-exec 0/1 CrashLoopBackOff 13 (80s ago) 35m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 1/1 Running 14 (5m7s ago) 39m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 1/1 Running 15 (<invalid> ago) 40m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 0/1 CrashLoopBackOff 15 (<invalid> ago) 42m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 1/1 Running 16 (5m1s ago) 47m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 1/1 Running 17 (<invalid> ago) 48m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 0/1 CrashLoopBackOff 17 (<invalid> ago) 49m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 1/1 Running 18 (5m ago) 54m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 1/1 Running 19 (0s ago) 56m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 0/1 CrashLoopBackOff 19 (<invalid> ago) 57m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 1/1 Running 20 (4m57s ago) 62m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 1/1 Running 21 (<invalid> ago) 63m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>
test-liveness-exec 0/1 CrashLoopBackOff 21 (<invalid> ago) 64m 10.42.2.178 storage-k3sw2.m4d3bug.com <none> <none>

$ kubectl describe pod test-liveness-exec
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 16m default-scheduler Successfully assigned default/test-liveness-exec to storage-k3sw2.m4d3bug.com
Normal Pulled 16m kubelet Successfully pulled image "busybox" in 2.340333537s
Normal Pulled 15m kubelet Successfully pulled image "busybox" in 2.281807839s
Normal Pulled 14m kubelet Successfully pulled image "busybox" in 2.366382447s
Normal Created 14m (x3 over 16m) kubelet Created container liveness
Normal Started 14m (x3 over 16m) kubelet Started container liveness
Warning Unhealthy 13m (x9 over 16m) kubelet Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
Normal Killing 13m (x3 over 15m) kubelet Container liveness failed liveness probe, will be restarted

这就是K8s的Pod的恢复机制,但为什么总是在同一节点恢复呢?这就是kind: Podkind: Deployment的区别。

0x02 恢复机制-pod.spec.restartPolicy

谈到恢复机制,pod.spec.restartPolicy的状态有以下三种。

  • Always:在任何情况下,只要Container不在运行状态,就自动重启Container;
  • OnFailure: 只在Container异常时才自动重启Container;
  • Never: 从来不重启Container。
    另外需要注意:
  • Container被重新自动创建,内容可能丢失。
  • 默认的恢复策略就是Always。
  • Pod内所有的容器都异常才会显示为Failed。
  • 更多

0x03 多一重保险-预设字段

Q: 那有没有办法可以从运维的角度去预设这些健康检查的字段呢?

A: 有,PodPresent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ kubectl apply -f preset.yaml 
apiVersion: settings.k8s.io/v1alpha1
kind: PodPreset
metadata:
name: allow-database
spec:
selector:
matchLabels:
role: frontend
env:
- name: DB_PORT
value: "6379"
volumeMounts:
- mountPath: /cache
name: cache-volume
volumes:
- name: cache-volume
emptyDir: {}

先后apply一下对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ kubectl apply -f pod.yaml 
apiVersion: v1
kind: Pod
metadata:
name: website
labels:
app: website
role: frontend
spec:
containers:
- name: website
image: nginx
ports:
- containerPort: 80

观察

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ kubectl get pod website -o yaml
apiVersion: v1
kind: Pod
metadata:
name: website
labels:
app: website
role: frontend
annotations:
podpreset.admission.kubernetes.io/podpreset-allow-database: "resource version"
spec:
containers:
- name: website
image: nginx
volumeMounts:
- mountPath: /cache
name: cache-volume
ports:
- containerPort: 80
env:
- name: DB_PORT
value: "6379"
volumes:
- name: cache-volume
emptyDir: {}

PodPreset 里定义的内容,只会在 Pod API 对象被创建之前追加在这个对象本身上,而不会影响任何 Pod 的控制器的定义。而多个PodPreset冲突以外的字段都将被合并。

0x04 总结

从健康检查,到探针,再到恢复(保活)机制。最后到预设字段PodPreset,填了不少坑也埋了不少坑,未完待续。

Welcome to my other publishing channels