Kubernetes监控与日志:Prometheus、Grafana、ELK完全指南


categories: - Kubernetes运维 tags: - Kubernetes - Prometheus - Grafana - ELK - 监控 - 日志


Prometheus部署

安装Prometheus Operator


# 使用Helm安装
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d

Prometheus配置示例


apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 2
  retention: 30d
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      team: frontend
  alerting:
    alertmanagers:
    - namespace: monitoring
      name: main
      port: web

ServiceMonitor


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  namespace: monitoring
  labels:
    team: frontend
spec:
  namespaceSelector:
    matchNames:
    - production
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Grafana部署


# 使用Helm安装
helm install grafana grafana/grafana \
  --namespace monitoring \
  --set persistence.enabled=true \
  --set adminPassword="your-password"

Grafana数据源配置


apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      url: http://prometheus.monitoring:9090
      access: proxy
      isDefault: true

自定义监控指标


from prometheus_client import Counter, Histogram, start_http_server
import random
import time

# 启动监控端口
start_http_server(8000)

# 定义指标
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests')
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')

# 使用指标
@REQUEST_LATENCY.time()
def handle_request():
    REQUEST_COUNT.inc()
    time.sleep(random.random())

日志收集:Loki + Promtail


# 安装Loki
helm install loki grafana/loki-stack \
  --namespace monitoring \
  -f loki-values.yaml

Promtail配置


server:
  http_listen_port: 3100

clients:
  - url: http://loki.monitoring:3100/loki/api/v1/push

scrape_configs:
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace

日志查询示例


# 使用LogQL查询日志
{app="myapp", namespace="production"} |= "ERROR"

# 统计错误日志数量
count_over_time({app="myapp"}[5m])

# 按时间排序
{app="myapp"} | json | line_format "{{.timestamp}} {{.message}}"

告警配置


apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  namespace: production
spec:
  groups:
  - name: myapp.rules
    rules:
    - alert: HighErrorRate
      expr: rate(http_errors_total[5m]) > 0.1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }}"
    - alert: HighCPUUsage
      expr: container_cpu_usage_seconds_total > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage"

常用监控命令


# 查看Pod资源使用
kubectl top pods -A

# 查看节点资源使用
kubectl top nodes

# 查看容器日志
kubectl logs -f myapp-pod -c app

# 查看所有Pod事件
kubectl get events -A --sort-by='.metadata.creationTimestamp'

常用查询语句


# Pod CPU使用率
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Pod内存使用
sum(container_memory_working_set_bytes) by (pod)

# 请求率
sum(rate(http_requests_total[5m])) by (method)

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (pod)

发表回复

后才能评论