華麗的監控系統 Prometheus + Alertmanager + Grafana

Grafana + Prometheus 是時下最潮的監控解決方案，透過 Prometheus 細緻的數據收集、藉由 Grafana 精美的 UI 主控台展示、並輔以 Alertmanager 能串接各種渠道來發出告警，一套華麗且精緻的監控系統就此誕生。

環境說明

本例作業系統為 RockyLinux 8.8
Grafana 版本為 v11.1.3
Prometheus 版本為 v2.53.1
Alertmanager 版本為 v0.27

安裝 Grafana

前往官方安裝頁面並安裝 Grafana

[root@grafana grafana]# dnf install -y https://dl.grafana.com/oss/release/grafana-11.1.3-1.x86_64.rpm

啟動 grafana 服務並設定為開機啟動

[root@grafana grafana]# systemctl enable grafana-server --now

將 WebUI 服務 port 號加入防火牆白名單裡

firewall-cmd --zone=public --add-port=3000/tcp --permanent
firewall-cmd --reload

安裝 Prometheus

前往官方頁面下載最新版本的 prometheus

[root@grafana grafana]# cd /opt
[root@grafana grafana]# wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz
[root@grafana opt]# tar zxvf ./prometheus-2.53.1.linux-amd64.tar.gz
[root@grafana opt]# mv prometheus-2.53.1.linux-amd64.tar.gz prometheus-2.53.1
[root@grafana opt]# cd prometheus-2.53.1

若不依賴系統服務的執行方式 ./prometheus-2.53.1 --config.file=prometheus.yml &

建立系統服務

[root@grafana prometheus-2.53.1]# useradd --no-create-home --shell /bin/false prometheus
[root@grafana prometheus-2.53.1]# chown -R prometheus:prometheus /opt/prometheus-2.53.1

[root@grafana prometheus-2.53.1]# cat > /usr/lib/systemd/system/prometheus.service <<-'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus-2.53.1/prometheus \
    --config.file /opt/prometheus-2.53.1/prometheus.yml \
    --storage.tsdb.path /opt/prometheus-2.53.1/data \
    --web.console.templates=/opt/prometheus-2.53.1/consoles \
    --web.console.libraries=/opt/prometheus-2.53.1/console_libraries \
    --web.listen-address=:9090

[Install]
WantedBy=multi-user.target
EOF

啟動 prometheus 服務並設定為開機啟動

[root@grafana prometheus-2.53.1]# systemctl enable prometheus && systemctl start prometheus

將 prometheus 服務 port 號加入防火牆白名單裡

firewall-cmd --zone=public --add-port=9090/tcp --permanent
firewall-cmd --reload

瀏覽器訪問 yourip:9090

安裝客戶端 Node Exporter 並以 Grafana 儀表板展示數據

Prometheus 向 target（目標伺服器或其他採集對象）收集監控數據的方式是透過 job 向指定的數據方提取資料，而這些數據的生成則是透過了 node exporter 服務來提供，本例就直接在本機上安裝 Node Exporter 以供 Prometheus 採集數據來做為示範。

前往官方頁面下載最新版本的 node exporter

[root@grafana prometheus-2.53.1]# cd /opt
[root@grafana opt]# wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
[root@grafana opt]# tar zxvf ./node_exporter-1.8.2.linux-amd64.tar.gzz
[root@grafana opt]# mv node_exporter-1.8.2.linux-amd64 node_exporter-1.8.2
[root@grafana opt]# cd node_exporter-1.8.2

若不依賴系統服務的執行方式 ./node_exporter-1.8.2 &

建立系統服務

[root@grafana node_exporter-1.8.2]# cat > /usr/lib/systemd/system/node-exporter.service <<-'EOF'
[Unit]
Description=This is prometheus node exporter
After=docker.service

[Service]
Type=simple
ExecStart=/opt/node_exporter-1.8.2/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

啟動、開機預設啟動 node-exporter 系統服務

[root@grafana node_exporter-1.8.2]# systemctl daemon-reload && systemctl enable node-exporter.service && systemctl start node-exporter.service

將 node-exporter 服務 port 號加入防火牆白名單裡

firewall-cmd --zone=public --add-port=9100/tcp --permanent
firewall-cmd --reload

將 node-exporter 被監控端主機的資訊新增寫入至 prometheus.yml

- job_name: 'grafana_node_exporter'
    static_configs:
    - targets: ['192.168.88.98:9100']

重啟 prometheus 服務

[root@grafana node_exporter-1.8.2]# systemctl restart prometheus

將此 node-exporter 加入 grafana 的 data source，並將此 data source 命名為「prometheus-grafana」，以供後續 dashboard 的使用

至畫面右下方處匯入欲使用的 dashboard 「Node Exporter Dashboard 20240520 通用分組版」

安裝 AlertManager

前往官方頁面下載最新版本的 alertmanager

[root@grafana prometheus-2.53.1]# cd /opt
[root@grafana opt]# wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
[root@grafana opt]# tar zxvf ./alertmanager-0.27.0.linux-amd64.tar.gz
[root@grafana opt]# mv alertmanager-0.27.0.linux-amd64 alertmanager-0.27.0
[root@grafana opt]# mkdir -p /opt/alertmanager-0.27.0/data

建立系統服務

[root@grafana opt]# cat >/usr/lib/systemd/system/alertmanager.service<<EOF
[Unit]
Description=alertmanager

[Service]
WorkingDirectory=/opt/alertmanager-0.27.0
ExecStart=/opt/alertmanager-0.27.0/alertmanager --config.file=/opt/alertmanager-0.27.0/alertmanager.yml --storage.path=/opt/alertmanager-0.27.0/data --web.listen-address=:9093 --data.retention=120h
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

將 alertmanager 服務 port 號加入防火牆白名單裡

firewall-cmd --zone=public --add-port=9093/tcp --permanent
firewall-cmd --reload

啟動 altermanager 服務並設定為開機啟動

[root@grafana opt]# systemctl daemon-reload
[root@grafana opt]# systemctl enable alertmanager --now

瀏覽器訪問 yourip:9093

製作告警規則，以下採用官方範例：
groups 作為 yaml 檔的根結構，若有任何 target 抓不到資料超過5分鐘，則發出告警

[root@grafana opt]# nano /opt/prometheus-2.53.1/alert.yml

groups:
  - name: example
    rules:
      # Alert for any instance that is unreachable for >5 minutes.
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

將 alertmanager 告警集成至 prometheus

[root@grafana opt]# nano /opt/prometheus-2.53.1/prometheus.yml

...
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["192.168.88.98:9093"]
          # - alertmanager:9093
...
......

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - alert.yml
...
......

[root@grafana opt]# systemctl restart prometheus

Prometheus 頁面確認告警規則已經載入

通過 Gmail 或 Telegram 發出告警並驗證結果

配置 Alertmanager 設定檔

[root@grafana opt]# nano /opt/alertmanager-0.27.0/alertmanager.yml

global:
  resolve_timeout: 1m
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'sender@gmail.com'
  smtp_auth_username: 'sender@gmail.com'
  smtp_auth_password: 'zhxxlvxxixxxbxxt'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email'
receivers:
  - name: 'email'
    email_configs:
    - send_resolved: true
      to: 'receiver@168.com'
      headers:
        subject: 'Prometheus Mail Alerts'
  - name: 'telegram'
    telegram_configs:
    - api_url: 'https://api.telegram.org'
      bot_token: '5x50xx54:AAxxxxuttbxfLCfBkgxxyGsZIu-F_xxFuxx'
      chat_id: -6xx22xx12
      disable_notifications: false
...
......

gmail的代理轉寄須先設定「google的應用程式密碼」

telegram的告警須先向BotFather申請專用機器人，再將機器人token與告警聊天室id填入「telegram_configs」

過往alertmanager要使用telegram告警都要通過外掛插件方式，從v0.24起已經成為內建的功能，本文也是使用內建的方式進行

重啟服務

[root@grafana opt]# systemctl restart alertmanager

將被監控的某一個 target（本例為 minio 機器）關機後確認 alertmanager 是否正常告警

若 receiver 選擇 telegram，則能收到 telegram 告警

AlertManager 後台可以看到已發出成功的告警

若 receiver 選擇 telegram，則後台可以看到 telegram 發出成功的告警

本文內容參閱以下連結：

Alertmanager configration

Alertmanager with Slack, PagerDuty, and Gmail

tomy

來自台灣的系統工程師，一直熱衷於 Open source 相關技術的學習、建置、應用與分享。

華麗的監控系統 Prometheus + Alertmanager + Grafana

環境說明

安裝 Grafana

安裝 Prometheus

安裝客戶端 Node Exporter 並以 Grafana 儀表板展示數據

安裝 AlertManager

通過 Gmail 或 Telegram 發出告警並驗證結果

tomy

0 Comments:

張貼留言

華麗的監控系統 Prometheus + Alertmanager + Grafana

環境說明

安裝 Grafana

安裝 Prometheus

安裝客戶端 Node Exporter 並以 Grafana 儀表板展示數據

安裝 AlertManager

通過 Gmail 或 Telegram 發出告警並驗證結果

tomy

RELATED POSTS

0 Comments:

張貼留言