Kubernetes生产环境最佳实践大全


categories: - Kubernetes运维 tags: - Kubernetes - - 最佳实践 - 高可用 - 生产环境 - 性能优化


集群架构

高可用Master配置


┌─────────────────────────────────────────────────────────────┐
│                     Kubernetes HA Cluster                   │
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Master 1  │  │   Master 2  │  │   Master 3  │         │
│  │  etcd Leader│  │   etcd      │  │   etcd      │         │
│  │  API Server │  │  API Server │  │  API Server │         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘         │
│         │                 │                 │                │
│         └─────────────────┼─────────────────┘                │
│                           │                                │
│              ┌────────────┴────────────┐                   │
│              │    Load Balancer        │                   │
│              │    (HAProxy/Keepalived) │                   │
│              └────────────┬────────────┘                   │
└──────────────────────────┼─────────────────────────────────┘
                           │
              ┌────────────┴────────────┐
              │      Worker Nodes       │
              │   (Auto-scaling Group)  │
              └─────────────────────────┘

资源规划

节点配置建议

节点类型 CPU 内存 磁盘 数量
Master 4核 16GB 100GB SSD 3
Worker-通用 8核 32GB 200GB SSD N
Worker-计算 16核 64GB 500GB SSD N
Worker-内存 8核 64GB 200GB SSD N

命名规范


# 环境标签
labels:
  environment: production    # production/staging/development
  team: platform              # 团队名称
  project: order-system       # 项目名称
  tier: frontend/backend      # 层级
  version: v1.2.3             # 版本
  cost-center: CC001         # 成本中心

# 资源命名
# Deployment: --
# Service: -
# ConfigMap: -config
# Secret: -secret

Pod最佳实践

安全配置


apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-app
  labels:
    app: secure-app
    environment: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: secure-app
  template:
    metadata:
      labels:
        app: secure-app
        environment: production
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 10000
        runAsGroup: 10000
        fsGroup: 10000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: app
        image: myapp:v1.0.0
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

Deployment策略

金丝雀发布


# v1 Deployment(当前版本)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-v1
spec:
  replicas: 10
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1
---
# v2 Deployment(金丝雀)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-v2
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: app
        image: myapp:v2
        env:
        - name: CANARY
          value: "true"
---
# HPA for v2
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-v2-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-v2
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Object
    object:
      describedObject:
        kind: Service
        name: app-v2
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

监控告警


apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: production-alerts
  namespace: monitoring
spec:
  groups:
  - name: production.rules
    rules:
    # Pod异常告警
    - alert: PodNotReady
      expr: kube_pod_status_phase{phase!="Running"} > 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
        description: "Pod has been not ready for more than 2 minutes"

    # 资源不足告警
    - alert: PodResourceLimit
      expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU throttled"

    # 节点压力告警
    - alert: NodePressure
      expr: (node_memory_MemAvailable / node_memory_MemTotal) < 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node {{ $labels.node }} memory pressure"

灾难恢复

etcd备份


# 定期备份脚本
#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-backup-${DATE}.db
aws s3 cp /backup/etcd-backup-${DATE}.db s3://my-backup-bucket/etcd/

集群恢复


# 从快照恢复
export ETCDCTL_API=3
etcdctl snapshot restore /backup/etcd-backup.db \
  --data-dir=/var/lib/etcd-restore

# 修改kubelet使用新数据目录
systemctl edit kubelet
# 添加 --root-dir=/var/lib/etcd-restore

成本优化

资源配额


apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    cpu: "100"
    memory: 200Gi
    pods: "200"
    requests.nvidia.com/gpu: "10"

Pod中断预算


apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: production-app

常用检查清单

  • [ ] 使用最新稳定版本K8s
  • [ ] 启用RBAC权限控制
  • [ ] 配置NetworkPolicy
  • [ ] 启用etcd加密
  • [ ] 配置资源请求和限制
  • [ ] 设置健康检查(liveness/readiness)
  • [ ] 启用Pod安全策略
  • [ ] 配置监控告警
  • [ ] 设置Pod中断预算
  • [ ] 定期备份etcd
  • [ ] 使用镜像版本标签而非latest
  • [ ] 配置资源配额
  • [ ] 启用审计日志

发表回复

后才能评论