Kubernetes监控与日志:Prometheus、Grafana、ELK完全指南
categories: - Kubernetes运维 tags: - Kubernetes - Prometheus - Grafana - ELK - 监控 - 日志
Prometheus部署
安装Prometheus Operator
# 使用Helm安装
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d
Prometheus配置示例
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
retention: 30d
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
team: frontend
alerting:
alertmanagers:
- namespace: monitoring
name: main
port: web
ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: monitoring
labels:
team: frontend
spec:
namespaceSelector:
matchNames:
- production
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 15s
path: /metrics
Grafana部署
# 使用Helm安装
helm install grafana grafana/grafana \
--namespace monitoring \
--set persistence.enabled=true \
--set adminPassword="your-password"
Grafana数据源配置
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus.monitoring:9090
access: proxy
isDefault: true
自定义监控指标
from prometheus_client import Counter, Histogram, start_http_server
import random
import time
# 启动监控端口
start_http_server(8000)
# 定义指标
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests')
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
# 使用指标
@REQUEST_LATENCY.time()
def handle_request():
REQUEST_COUNT.inc()
time.sleep(random.random())
日志收集:Loki + Promtail
# 安装Loki
helm install loki grafana/loki-stack \
--namespace monitoring \
-f loki-values.yaml
Promtail配置
server:
http_listen_port: 3100
clients:
- url: http://loki.monitoring:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
日志查询示例
# 使用LogQL查询日志
{app="myapp", namespace="production"} |= "ERROR"
# 统计错误日志数量
count_over_time({app="myapp"}[5m])
# 按时间排序
{app="myapp"} | json | line_format "{{.timestamp}} {{.message}}"
告警配置
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: production
spec:
groups:
- name: myapp.rules
rules:
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}"
- alert: HighCPUUsage
expr: container_cpu_usage_seconds_total > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
常用监控命令
# 查看Pod资源使用
kubectl top pods -A
# 查看节点资源使用
kubectl top nodes
# 查看容器日志
kubectl logs -f myapp-pod -c app
# 查看所有Pod事件
kubectl get events -A --sort-by='.metadata.creationTimestamp'
常用查询语句
# Pod CPU使用率
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Pod内存使用
sum(container_memory_working_set_bytes) by (pod)
# 请求率
sum(rate(http_requests_total[5m])) by (method)
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (pod)
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。







