Kubernetes Pod 生命周期与重启策略：从创建到终止的实战指南

适用场景 & 前置条件

适用场景：Pod 故障排查、优雅终止配置、健康检查设置、任务类 Pod 管理。

前置条件：

Kubernetes 1.20+
kubectl 访问权限
理解 Pod、容器概念

环境与版本矩阵

组件	版本	说明
Kubernetes	1.20-1.30	生命周期管理核心特性稳定
Container Runtime	containerd/Docker	容器运行时

Pod 生命周期完整流程

1. Pod 生命周期阶段

阶段（Phase）：

Pending：Pod 已创建，但容器未启动（等待调度/拉取镜像）
Running：至少一个容器正在运行
Succeeded：所有容器成功终止（Job/CronJob）
Failed：所有容器终止，至少一个失败
Unknown：无法获取 Pod 状态（节点失联）

查看 Pod 状态：

kubectl get pod mypod
# NAME    READY   STATUS    RESTARTS   AGE
# mypod   1/1     Running   0          5m

kubectl get pod mypod -o jsonpath='{.status.phase}'
# Running

2. 容器状态

三种状态：

Waiting：等待启动（拉取镜像/等待存储）
Running：正常运行
Terminated：已终止（成功或失败）

查看容器状态：

kubectl describe pod mypod
# State:          Running
#   Started:      2025-10-24 10:00:00 +0800 CST

kubectl get pod mypod -o jsonpath='{.status.containerStatuses[0].state}'
# {"running":{"startedAt":"2025-10-24T02:00:00Z"}}

重启策略（RestartPolicy）

1. 三种策略

策略定义：

apiVersion: v1  
kind: Pod      
metadata:
  name: mypod   
spec:
  restartPolicy: Always 
  # Always | OnFailure | Never
  containers:
    - name: myapp       
      image: nginx:1.21

策略说明：

策略	行为	适用场景
`Always`	容器终止后总是重启（默认）	Deployment、StatefulSet、DaemonSet
`OnFailure`	退出码非 0 时重启	Job
`Never`	永不重启	一次性任务、调试 Pod

2. 重启策略验证

Always 策略：

apiVersion: v1
kind: Pod
metadata:
  name: test-always
spec:
  restartPolicy: Always
  containers:
    - name: test
      image: busybox
      command: ["sh", "-c", "echo hello && sleep 10 && exit 1"]

# 创建pod      
kubectl apply -f test-always.yaml

# 观察重启
kubectl get pod test-always -w
# NAME          READY   STATUS    RESTARTS   AGE
# test-always   1/1     Running   0          5s
# test-always   0/1     Error     0          15s
# test-always   1/1     Running   1          16s  # 自动重启
# test-always   0/1     Error     1          26s
# test-always   0/1     CrashLoopBackOff   1   30s  # 重启延迟增加

OnFailure 策略：

apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: test
          image: busybox
          command: ["sh", "-c", "echo job running && exit 0"]

# 创建pod      
kubectl apply -f test-job.yaml
kubectl get pod -l job-name=test-job
# NAME             READY   STATUS      RESTARTS   AGE
# test-job-xxxxx   0/1     Completed   0          10s  # 成功完成，不重启

健康检查（Probes）

1. 三种探针

探针类型：

livenessProbe：存活探针，失败则重启容器
readinessProbe：就绪探针，失败则从 Service 移除
startupProbe：启动探针，慢启动应用专用（K8s 1.18+）

2. 探测方式

HTTP GET：

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
      - name: Custom-Header
        value: Awesome
  initialDelaySeconds: 3
  periodSeconds: 10
  timeoutSeconds: 1
  successThreshold: 1
  failureThreshold: 3

TCP Socket：

livenessProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 15
  periodSeconds: 20

Exec：

livenessProbe:
  exec:
    command:
      - cat
      - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5

3. 参数详解

参数	说明	默认值	推荐值
`initialDelaySeconds`	启动后延迟探测时间	0	根据应用启动时间设置
`periodSeconds`	探测间隔	10	10-30
`timeoutSeconds`	超时时间	1	1-5
`successThreshold`	成功阈值（连续成功次数）	1	1（liveness） 1-3（readiness）
`failureThreshold`	失败阈值（连续失败次数）	3	3-5

4. 完整示例

apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  containers:
    - name: nginx
      image: nginx:1.21
      ports:
        - containerPort: 80

  # 启动探针（慢启动应用）
  startupProbe:
    httpGet:
      path: /healthz
      port: 80
    initialDelaySeconds: 0
    periodSeconds: 5
    failureThreshold: 30  # 30*5=150秒启动时间

  # 存活探针
  livenessProbe:
    httpGet:
      path: /healthz
      port: 80
    initialDelaySeconds: 10
    periodSeconds: 10
    timeoutSeconds: 2
    failureThreshold: 3

  # 就绪探针
  readinessProbe:
    httpGet:
      path: /ready
      port: 80
    initialDelaySeconds: 5
    periodSeconds: 5
    timeoutSeconds: 2
    failureThreshold: 3

验证探针：

# 查看探针状态
kubectl describe pod web-app | grep -A 10 Liveness
kubectl describe pod web-app | grep -A 10 Readiness

# 查看事件
kubectl get events --field-selector involvedObject.name=web-app

生命周期钩子（Lifecycle Hooks）

1. PostStart 钩子

容器启动后立即执行：

lifecycle:
  postStart:
    exec:
      command: ["/bin/sh", "-c", "echo 'Container started' > /tmp/start.log"]

注意：

• postStart 和容器 ENTRYPOINT 异步执行
• 如果 postStart 失败，容器会被杀死并重启

2. PreStop 钩子

容器终止前执行（优雅停机）：

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "nginx -s quit; while killall -0 nginx; do sleep 1; done"]

执行流程：

K8s 发送 TERM 信号前先执行 preStop
preStop 执行完毕后发送 TERM 信号
等待 terminationGracePeriodSeconds（默认 30 秒）
超时后发送 SIGKILL 强制终止

3. 完整优雅停机示例

apiVersion: v1
kind: Pod
metadata:
  name: graceful-shutdown
spec:
  terminationGracePeriodSeconds: 60  # 增加等待时间
  containers:
    - name: nginx
      image: nginx:1.21
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 10 && nginx -s quit"]
      ports:
        - containerPort: 80

验证优雅停机：

# 删除 Pod
kubectl delete pod graceful-shutdown

# 另一个终端观察
kubectl get pod graceful-shutdown -w
# NAME                 READY   STATUS        RESTARTS   AGE
# graceful-shutdown    1/1     Running       0          1m
# graceful-shutdown    1/1     Terminating   0          1m10s  # 执行 preStop
# graceful-shutdown    0/1     Terminating   0          1m20s  # 容器停止

Init 容器

1. 基本概念

特点：

在主容器启动前顺序执行
每个 init 容器必须成功完成，下一个才会执行
失败则重启整个 Pod（受 restartPolicy 影响）

2. 常用场景

等待依赖服务：

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  initContainers:
    - name: wait-for-db
      image: busybox:1.34
      command: ['sh', '-c', 'until nslookup mysql; do echo waiting for mysql; sleep 2; done']
  containers:
    - name: myapp
      image: myapp:1.0

初始化配置文件：

initContainers:
  - name: init-config
    image: busybox:1.34
    command: ['sh', '-c', 'cp /config/*.conf /app/config/']
    volumeMounts:
      - name: config
        mountPath: /config
      - name: app-config
        mountPath: /app/config

设置权限：

initContainers:
  - name: fix-permissions
    image: busybox:1.34
    command: ['sh', '-c', 'chown -R 1000:1000 /data']
    volumeMounts:
      - name: data
        mountPath: /data
    securityContext:
      runAsUser: 0

故障排查

场景 1：CrashLoopBackOff

症状：

kubectl get pod
# NAME    READY   STATUS             RESTARTS   AGE
# mypod   0/1     CrashLoopBackOff   5          5m

排查步骤：

# 1. 查看日志
kubectl logs mypod
kubectl logs mypod --previous  # 查看上一次运行日志

# 2. 查看事件
kubectl describe pod mypod | grep -A 20 Events

# 3. 查看容器退出码
kubectl get pod mypod -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'

# 4. 进入调试（如果容器快速退出）
kubectl debug mypod -it --image=busybox --target=myapp

常见原因：

应用启动失败（配置错误、依赖缺失）
健康检查过于严格
资源不足（OOMKilled）
镜像 ENTRYPOINT 错误

场景 2：Pod Pending

排查：

# 查看调度事件
kubectl describe pod mypod | grep -i "fail\|warn"

# 常见原因：
# - 资源不足（CPU/内存）
# - PVC 未绑定
# - 节点亲和性不匹配
# - 污点/容忍度问题

场景 3：Pod Terminating 无法删除

排查：

# 查看 finalizers
kubectl get pod mypod -o jsonpath='{.metadata.finalizers}'

# 强制删除（危险）
kubectl delete pod mypod --force --grace-period=0

最佳实践

1.合理设置探针：

startupProbe：慢启动应用必须设置
livenessProbe：避免过于严格，防止误杀
readinessProbe：新版本发布时避免流量过早进入

2.优雅停机：

设置 preStop 钩子处理未完成请求
terminationGracePeriodSeconds >= 应用最长请求时间 + 10 秒

3.Init 容器：

等待依赖服务，避免主容器启动失败
轻量化 init 镜像（如 busybox）

4.重启策略：

Deployment：Always（默认）
Job：OnFailure 或 Never
调试 Pod：Never

5.资源限制：

   resources:
     requests:
       memory: "256Mi"
       cpu: "500m"
     limits:
       memory: "512Mi"
       cpu: "1000m"

6.日志与监控：

集成 Prometheus 监控 Pod 重启次数
告警规则：kube_pod_container_status_restarts_total > 5

7.测试生命周期：

删除 Pod 验证优雅停机
手动触发健康检查失败

8.版本兼容性：

startupProbe 需 K8s 1.18+
旧版本用 initialDelaySeconds 替代

9.避免僵尸进程：

使用 tini 或 dumb-init 作为 PID 1

10.文档化：

 - 记录探针端点与预期响应
 - 记录优雅停机流程

Z笔记

左老师的课堂笔记

Pod 生命周期与重启策略