Prometheus监控系统完整指南:从入门到精通

Prometheus监控系统完整指南

一、Prometheus简介

1.1 什么是Prometheus?

Prometheus是一个开源的监控和报警系统,由SoundCloud开发,现已成为Cloud Native Computing Foundation(CNCF)的毕业项目。

核心特点:

  • 多维度数据模型
  • 强大的查询语言PromQL
  • 不依赖分布式存储
  • 支持Pull和Push两种模式
  • 易于部署和使用

1.2 架构原理


┌──────────────────────────────────────────────────────┐
│                   Prometheus Server                    │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────┐ │
│  │   TSDB       │  │   Retriever │  │ HTTP Server│ │
│  │ (时序数据库)  │  │  (数据采集)  │  │  (API)   │ │
│  └──────────────┘  └──────────────┘  └─────────┘ │
└──────────────────────────────────────────────────────┘
           │                    │
           ▼                    ▼
    ┌────────────┐      ┌────────────┐
    │  Grafana    │      │ Alertmanager│
    │  (可视化)   │      │  (告警管理) │
    └────────────┘      └────────────┘

二、安装与配置

2.1 二进制安装


# 下载Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz

# 解压
tar xvf prometheus-2.47.0.linux-amd64.tar.gz

# 创建目录
sudo mkdir -p /etc/prometheus /var/lib/prometheus

# 复制二进制文件
sudo cp prometheus-2.47.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.47.0.linux-amd64/promtool /usr/local/bin/

2.2 配置文件


# /etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

2.3 Systemd服务


# /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus Monitoring System
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus   --config.file=/etc/prometheus/prometheus.yml   --storage.tsdb.path=/var/lib/prometheus   --storage.tsdb.retention.time=15d   --web.enable-lifecycle

Restart=on-failure

[Install]
WantedBy=multi-user.target

三、核心概念

3.1 数据模型

指标类型:

  • Counter(计数器):只增不减
  • Gauge(仪表盘):可增可减
  • Histogram(直方图):累积直方图
  • Summary(摘要):分位数统计

# Counter示例
http_requests_total{method="GET", status="200"}

# Gauge示例
cpu_usage{core="0"}

# Histogram示例
http_request_duration_seconds_bucket{le="0.1"}

3.2 PromQL查询


# 查询所有HTTP请求
http_requests_total

# 查询特定标签
http_requests_total{method="GET"}

# 速率计算
rate(http_requests_total[5m])

# 使用函数
increase(http_requests_total[1h])
topk(10, http_requests_total)

四、监控配置

4.1 监控Linux主机


# /etc/prometheus/node_exporter.yml

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

4.2 监控Docker


# 使用cAdvisor监控Docker

- job_name: 'cadvisor'
  static_configs:
    - targets:
      - localhost:8080

4.3 告警规则


# /etc/prometheus/rules/node_alerts.yml

groups:
  - name: node_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU使用率过高"
          description: "服务器 {{ $labels.instance }} CPU使用率超过80%"

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 10m
        labels:
          severity: critical

五、Alertmanager配置


# /etc/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'

receivers:
  - name: 'default'
    email_configs:
      - smtp_smarthost: 'smtp.example.com:465'
        smtp_auth_username: 'alerts@example.com'
        smtp_auth_password: 'your-password'
        to: 'admin@example.com'

  - name: 'critical-alerts'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/webhook1/send'

六、Grafana集成


# 安装Grafana
sudo apt-get install -y grafana

# 启动服务
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

# 访问地址
# http://localhost:3000
# 默认用户名: admin
# 默认密码: admin

添加数据源:

七、性能优化

7.1 存储优化


# prometheus.yml

storage:
  tsdb:
    retention.time: 15d
    retention.size: 100GB
    head_chunk_bytes: 32768

7.2 采集优化


scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

八、常见问题

8.1 启动失败


# 检查配置
promtool check config /etc/prometheus/prometheus.yml

# 查看日志
journalctl -u prometheus -n 100

8.2 性能问题


# 查看性能指标
curl http://localhost:9090/metrics | grep prometheus_http

# 分析查询
http://localhost:9090/graph

九、总结

本文介绍了Prometheus监控系统的完整配置。

核心要点:

相关阅读:

  • Prometheus官方文档
  • PromQL查询语言
  • Alertmanager配置

发表回复

后才能评论