Kubernetes生产环境最佳实践大全
categories: - Kubernetes运维 tags: - Kubernetes - - 最佳实践 - 高可用 - 生产环境 - 性能优化
集群架构
高可用Master配置
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes HA Cluster │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Master 1 │ │ Master 2 │ │ Master 3 │ │
│ │ etcd Leader│ │ etcd │ │ etcd │ │
│ │ API Server │ │ API Server │ │ API Server │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ │ │
│ ┌────────────┴────────────┐ │
│ │ Load Balancer │ │
│ │ (HAProxy/Keepalived) │ │
│ └────────────┬────────────┘ │
└──────────────────────────┼─────────────────────────────────┘
│
┌────────────┴────────────┐
│ Worker Nodes │
│ (Auto-scaling Group) │
└─────────────────────────┘
资源规划
节点配置建议
| 节点类型 | CPU | 内存 | 磁盘 | 数量 |
|---|---|---|---|---|
| Master | 4核 | 16GB | 100GB SSD | 3 |
| Worker-通用 | 8核 | 32GB | 200GB SSD | N |
| Worker-计算 | 16核 | 64GB | 500GB SSD | N |
| Worker-内存 | 8核 | 64GB | 200GB SSD | N |
命名规范
# 环境标签
labels:
environment: production # production/staging/development
team: platform # 团队名称
project: order-system # 项目名称
tier: frontend/backend # 层级
version: v1.2.3 # 版本
cost-center: CC001 # 成本中心
# 资源命名
# Deployment: --
# Service: -
# ConfigMap: -config
# Secret: -secret
Pod最佳实践
安全配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-app
labels:
app: secure-app
environment: production
spec:
replicas: 3
selector:
matchLabels:
app: secure-app
template:
metadata:
labels:
app: secure-app
environment: production
spec:
securityContext:
runAsNonRoot: true
runAsUser: 10000
runAsGroup: 10000
fsGroup: 10000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:v1.0.0
imagePullPolicy: Always
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Deployment策略
金丝雀发布
# v1 Deployment(当前版本)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-v1
spec:
replicas: 10
template:
spec:
containers:
- name: app
image: myapp:v1
---
# v2 Deployment(金丝雀)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-v2
spec:
replicas: 1
template:
spec:
containers:
- name: app
image: myapp:v2
env:
- name: CANARY
value: "true"
---
# HPA for v2
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-v2-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app-v2
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
describedObject:
kind: Service
name: app-v2
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
监控告警
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: production-alerts
namespace: monitoring
spec:
groups:
- name: production.rules
rules:
# Pod异常告警
- alert: PodNotReady
expr: kube_pod_status_phase{phase!="Running"} > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
description: "Pod has been not ready for more than 2 minutes"
# 资源不足告警
- alert: PodResourceLimit
expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU throttled"
# 节点压力告警
- alert: NodePressure
expr: (node_memory_MemAvailable / node_memory_MemTotal) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} memory pressure"
灾难恢复
etcd备份
# 定期备份脚本
#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-backup-${DATE}.db
aws s3 cp /backup/etcd-backup-${DATE}.db s3://my-backup-bucket/etcd/
集群恢复
# 从快照恢复
export ETCDCTL_API=3
etcdctl snapshot restore /backup/etcd-backup.db \
--data-dir=/var/lib/etcd-restore
# 修改kubelet使用新数据目录
systemctl edit kubelet
# 添加 --root-dir=/var/lib/etcd-restore
成本优化
资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
cpu: "100"
memory: 200Gi
pods: "200"
requests.nvidia.com/gpu: "10"
Pod中断预算
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: production-app
常用检查清单
- [ ] 使用最新稳定版本K8s
- [ ] 启用RBAC权限控制
- [ ] 配置NetworkPolicy
- [ ] 启用etcd加密
- [ ] 配置资源请求和限制
- [ ] 设置健康检查(liveness/readiness)
- [ ] 启用Pod安全策略
- [ ] 配置监控告警
- [ ] 设置Pod中断预算
- [ ] 定期备份etcd
- [ ] 使用镜像版本标签而非latest
- [ ] 配置资源配额
- [ ] 启用审计日志
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。







