Skip to content

RabbitMQ 告警规则配置

概述

告警规则定义了触发告警的条件和逻辑。合理配置告警规则可以准确识别问题,避免误报和漏报。本文将详细介绍 RabbitMQ 告警规则的配置方法、规则模板和最佳实践。

核心知识点

告警规则结构

yaml
alert_name:
  condition: 触发条件
  duration: 持续时间
  severity: 严重程度
  labels: 标签
  annotations: 注释

规则类型

类型说明示例
阈值规则超过阈值触发内存使用率 > 80%
趋势规则趋势变化触发消息堆积持续增长
状态规则状态变化触发节点从运行变为停止
组合规则多条件组合触发内存高且消息堆积

告警状态流转

Normal -> Pending -> Firing -> Resolved -> Normal

配置示例

Prometheus 告警规则

yaml
groups:
  - name: rabbitmq_alerts
    interval: 30s
    rules:
      - alert: RabbitMQNodeDown
        expr: rabbitmq_up == 0
        for: 1m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "RabbitMQ 节点宕机"
          description: "RabbitMQ 节点 {{ $labels.instance }} 已宕机超过 1 分钟"
          runbook_url: "https://wiki.example.com/rabbitmq/node-down"

      - alert: RabbitMQMemoryHigh
        expr: (rabbitmq_memory_used_bytes / rabbitmq_memory_limit_bytes) * 100 > 80
        for: 5m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "RabbitMQ 内存使用率过高"
          description: "节点 {{ $labels.instance }} 内存使用率 {{ $value | printf \"%.2f\" }}%"
          runbook_url: "https://wiki.example.com/rabbitmq/memory-high"

      - alert: RabbitMQMemoryCritical
        expr: (rabbitmq_memory_used_bytes / rabbitmq_memory_limit_bytes) * 100 > 90
        for: 2m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "RabbitMQ 内存使用率严重过高"
          description: "节点 {{ $labels.instance }} 内存使用率 {{ $value | printf \"%.2f\" }}%,即将触发流控"

      - alert: RabbitMQMemoryAlarm
        expr: rabbitmq_mem_alarm == 1
        for: 0m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "RabbitMQ 内存告警已触发"
          description: "节点 {{ $labels.instance }} 已触发内存告警,所有生产者将被阻塞"

      - alert: RabbitMQDiskSpaceLow
        expr: rabbitmq_disk_free_bytes / 1024 / 1024 / 1024 < 10
        for: 5m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "RabbitMQ 磁盘空间不足"
          description: "节点 {{ $labels.instance }} 剩余磁盘空间 {{ $value | printf \"%.2f\" }} GB"

      - alert: RabbitMQDiskSpaceCritical
        expr: rabbitmq_disk_free_bytes / 1024 / 1024 / 1024 < 5
        for: 1m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "RabbitMQ 磁盘空间严重不足"
          description: "节点 {{ $labels.instance }} 剩余磁盘空间 {{ $value | printf \"%.2f\" }} GB,请立即处理"

      - alert: RabbitMQDiskAlarm
        expr: rabbitmq_disk_free_alarm == 1
        for: 0m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "RabbitMQ 磁盘告警已触发"
          description: "节点 {{ $labels.instance }} 已触发磁盘告警,所有生产者将被阻塞"

      - alert: RabbitMQQueueMessagesHigh
        expr: rabbitmq_queue_messages > 50000
        for: 5m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "队列消息堆积过多"
          description: "队列 {{ $labels.queue }} 消息数量 {{ $value }},可能存在消费瓶颈"

      - alert: RabbitMQQueueMessagesCritical
        expr: rabbitmq_queue_messages > 100000
        for: 3m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "队列消息堆积严重"
          description: "队列 {{ $labels.queue }} 消息数量 {{ $value }},请立即处理"

      - alert: RabbitMQNoConsumer
        expr: rabbitmq_queue_consumers == 0 and rabbitmq_queue_messages > 1000
        for: 5m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "队列无消费者"
          description: "队列 {{ $labels.queue }} 有 {{ $value }} 条消息但无消费者"

      - alert: RabbitMQConnectionHigh
        expr: rabbitmq_connections > 800
        for: 5m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "连接数过多"
          description: "节点 {{ $labels.instance }} 连接数 {{ $value }},建议优化连接池"

      - alert: RabbitMQConnectionCritical
        expr: rabbitmq_connections > 950
        for: 2m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "连接数接近上限"
          description: "节点 {{ $labels.instance }} 连接数 {{ $value }},即将达到上限"

      - alert: RabbitMQFileDescriptorsHigh
        expr: (rabbitmq_fd_used / rabbitmq_fd_total) * 100 > 80
        for: 5m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "文件描述符使用率过高"
          description: "节点 {{ $labels.instance }} 文件描述符使用率 {{ $value | printf \"%.2f\" }}%"

      - alert: RabbitMQFileDescriptorsCritical
        expr: (rabbitmq_fd_used / rabbitmq_fd_total) * 100 > 90
        for: 2m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "文件描述符使用率严重过高"
          description: "节点 {{ $labels.instance }} 文件描述符使用率 {{ $value | printf \"%.2f\" }}%"

      - alert: RabbitMQClusterPartition
        expr: rabbitmq_partitions > 0
        for: 0m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "集群网络分区"
          description: "节点 {{ $labels.instance }} 检测到网络分区,请立即处理"

      - alert: RabbitMQQueueGrowing
        expr: deriv(rabbitmq_queue_messages[5m]) > 10
        for: 10m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "队列消息持续增长"
          description: "队列 {{ $labels.queue }} 消息数量持续增长,增长速率 {{ $value | printf \"%.2f\" }} 条/秒"

      - alert: RabbitMQConsumerLag
        expr: rabbitmq_queue_messages_ready - rabbitmq_queue_messages_unacked > 10000
        for: 5m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "消费者延迟过高"
          description: "队列 {{ $labels.queue }} 消费者延迟 {{ $value }} 条消息"

      - alert: RabbitMQMessageRedeliverHigh
        expr: (rate(rabbitmq_messages_redelivered_total[5m]) / rate(rabbitmq_messages_delivered_total[5m])) * 100 > 10
        for: 5m
        labels:
          severity: warning
          service: rabbitmq
        annotations:
          summary: "消息重投递率过高"
          description: "消息重投递率 {{ $value | printf \"%.2f\" }}%,可能存在消费者问题"

      - alert: RabbitMQNodeNotRunning
        expr: rabbitmq_node_running == 0
        for: 1m
        labels:
          severity: critical
          service: rabbitmq
        annotations:
          summary: "RabbitMQ 节点未运行"
          description: "节点 {{ $labels.node }} 未处于运行状态"

PHP 告警规则引擎

php
<?php

class AlertRuleEngine
{
    private $rules;
    private $stateStore;
    
    public function __construct($stateStore = null)
    {
        $this->rules = $this->loadDefaultRules();
        $this->stateStore = $stateStore ?: new FileStateStore('/tmp/alert_states.json');
    }
    
    private function loadDefaultRules()
    {
        return [
            'node_down' => [
                'name' => '节点宕机',
                'condition' => function($metrics) {
                    return !$metrics['node_status'];
                },
                'duration' => 60,
                'severity' => 'critical',
                'message' => 'RabbitMQ 节点已宕机',
                'runbook' => 'https://wiki.example.com/rabbitmq/node-down',
            ],
            
            'memory_high' => [
                'name' => '内存使用率过高',
                'condition' => function($metrics) {
                    return $metrics['memory_usage'] > 80;
                },
                'duration' => 300,
                'severity' => 'warning',
                'message' => '内存使用率超过 80%',
                'runbook' => 'https://wiki.example.com/rabbitmq/memory-high',
            ],
            
            'memory_critical' => [
                'name' => '内存使用率严重过高',
                'condition' => function($metrics) {
                    return $metrics['memory_usage'] > 90;
                },
                'duration' => 120,
                'severity' => 'critical',
                'message' => '内存使用率超过 90%,即将触发流控',
                'runbook' => 'https://wiki.example.com/rabbitmq/memory-critical',
            ],
            
            'memory_alarm' => [
                'name' => '内存告警触发',
                'condition' => function($metrics) {
                    return $metrics['memory_alarm'] === true;
                },
                'duration' => 0,
                'severity' => 'critical',
                'message' => '内存告警已触发,所有生产者将被阻塞',
                'runbook' => 'https://wiki.example.com/rabbitmq/memory-alarm',
            ],
            
            'disk_low' => [
                'name' => '磁盘空间不足',
                'condition' => function($metrics) {
                    return $metrics['disk_free'] < 10;
                },
                'duration' => 300,
                'severity' => 'warning',
                'message' => '磁盘剩余空间不足 10GB',
                'runbook' => 'https://wiki.example.com/rabbitmq/disk-low',
            ],
            
            'disk_critical' => [
                'name' => '磁盘空间严重不足',
                'condition' => function($metrics) {
                    return $metrics['disk_free'] < 5;
                },
                'duration' => 60,
                'severity' => 'critical',
                'message' => '磁盘剩余空间不足 5GB,请立即处理',
                'runbook' => 'https://wiki.example.com/rabbitmq/disk-critical',
            ],
            
            'disk_alarm' => [
                'name' => '磁盘告警触发',
                'condition' => function($metrics) {
                    return $metrics['disk_alarm'] === true;
                },
                'duration' => 0,
                'severity' => 'critical',
                'message' => '磁盘告警已触发,所有生产者将被阻塞',
                'runbook' => 'https://wiki.example.com/rabbitmq/disk-alarm',
            ],
            
            'queue_messages_high' => [
                'name' => '队列消息堆积',
                'condition' => function($metrics) {
                    return $metrics['queue_messages'] > 50000;
                },
                'duration' => 300,
                'severity' => 'warning',
                'message' => '队列消息堆积超过 50000 条',
                'runbook' => 'https://wiki.example.com/rabbitmq/queue-backlog',
            ],
            
            'queue_messages_critical' => [
                'name' => '队列消息堆积严重',
                'condition' => function($metrics) {
                    return $metrics['queue_messages'] > 100000;
                },
                'duration' => 180,
                'severity' => 'critical',
                'message' => '队列消息堆积超过 100000 条,请立即处理',
                'runbook' => 'https://wiki.example.com/rabbitmq/queue-backlog-critical',
            ],
            
            'no_consumer' => [
                'name' => '队列无消费者',
                'condition' => function($metrics) {
                    return $metrics['queue_consumer_count'] === 0 && $metrics['queue_messages'] > 1000;
                },
                'duration' => 300,
                'severity' => 'warning',
                'message' => '队列有消息但无消费者',
                'runbook' => 'https://wiki.example.com/rabbitmq/no-consumer',
            ],
            
            'connection_high' => [
                'name' => '连接数过多',
                'condition' => function($metrics) {
                    return $metrics['connections_total'] > 800;
                },
                'duration' => 300,
                'severity' => 'warning',
                'message' => '连接数超过 800',
                'runbook' => 'https://wiki.example.com/rabbitmq/connection-high',
            ],
            
            'fd_high' => [
                'name' => '文件描述符使用率过高',
                'condition' => function($metrics) {
                    return $metrics['fd_usage'] > 80;
                },
                'duration' => 300,
                'severity' => 'warning',
                'message' => '文件描述符使用率超过 80%',
                'runbook' => 'https://wiki.example.com/rabbitmq/fd-high',
            ],
            
            'cluster_partition' => [
                'name' => '集群网络分区',
                'condition' => function($metrics) {
                    return $metrics['cluster_partition'] === true;
                },
                'duration' => 0,
                'severity' => 'critical',
                'message' => '检测到集群网络分区',
                'runbook' => 'https://wiki.example.com/rabbitmq/partition',
            ],
        ];
    }
    
    public function addRule($name, array $rule)
    {
        $this->rules[$name] = $rule;
    }
    
    public function removeRule($name)
    {
        unset($this->rules[$name]);
    }
    
    public function evaluate(array $metrics)
    {
        $alerts = [];
        $now = time();
        
        foreach ($this->rules as $ruleName => $rule) {
            $conditionMet = $rule['condition']($metrics);
            $state = $this->stateStore->get($ruleName);
            
            if ($conditionMet) {
                if ($state === null) {
                    $this->stateStore->set($ruleName, [
                        'status' => 'pending',
                        'started_at' => $now,
                        'first_value' => $metrics,
                    ]);
                } elseif ($state['status'] === 'pending') {
                    $elapsed = $now - $state['started_at'];
                    
                    if ($elapsed >= $rule['duration']) {
                        $this->stateStore->set($ruleName, [
                            'status' => 'firing',
                            'started_at' => $state['started_at'],
                            'first_value' => $state['first_value'],
                        ]);
                        
                        $alerts[] = $this->createAlert($ruleName, $rule, $metrics, 'firing');
                    }
                } elseif ($state['status'] === 'firing') {
                    $alerts[] = $this->createAlert($ruleName, $rule, $metrics, 'firing');
                }
            } else {
                if ($state !== null && $state['status'] === 'firing') {
                    $this->stateStore->set($ruleName, [
                        'status' => 'resolved',
                        'resolved_at' => $now,
                    ]);
                    
                    $alerts[] = $this->createAlert($ruleName, $rule, $metrics, 'resolved');
                } elseif ($state !== null) {
                    $this->stateStore->remove($ruleName);
                }
            }
        }
        
        return $alerts;
    }
    
    private function createAlert($ruleName, $rule, $metrics, $status)
    {
        return [
            'rule_name' => $ruleName,
            'name' => $rule['name'],
            'status' => $status,
            'severity' => $rule['severity'],
            'message' => $rule['message'],
            'runbook' => $rule['runbook'],
            'metrics' => $metrics,
            'timestamp' => date('Y-m-d H:i:s'),
        ];
    }
    
    public function getRuleStates()
    {
        return $this->stateStore->getAll();
    }
    
    public function getRules()
    {
        return $this->rules;
    }
}

class FileStateStore
{
    private $file;
    
    public function __construct($file)
    {
        $this->file = $file;
    }
    
    public function get($key)
    {
        $data = $this->load();
        return $data[$key] ?? null;
    }
    
    public function set($key, $value)
    {
        $data = $this->load();
        $data[$key] = $value;
        $this->save($data);
    }
    
    public function remove($key)
    {
        $data = $this->load();
        unset($data[$key]);
        $this->save($data);
    }
    
    public function getAll()
    {
        return $this->load();
    }
    
    private function load()
    {
        if (file_exists($this->file)) {
            return json_decode(file_get_contents($this->file), true) ?: [];
        }
        return [];
    }
    
    private function save($data)
    {
        $dir = dirname($this->file);
        if (!is_dir($dir)) {
            mkdir($dir, 0755, true);
        }
        file_put_contents($this->file, json_encode($data, JSON_PRETTY_PRINT));
    }
}

Zabbix 告警规则

xml
<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
    <version>5.0</version>
    <templates>
        <template>
            <template>Template RabbitMQ Alerts</template>
            <triggers>
                <trigger>
                    <name>RabbitMQ 节点宕机</name>
                    <expression>{Template RabbitMQ:rabbitmq.status[status].last()}=0</expression>
                    <priority>HIGH</priority>
                    <description>RabbitMQ 节点已宕机,请立即检查</description>
                </trigger>
                
                <trigger>
                    <name>RabbitMQ 内存使用率过高</name>
                    <expression>{Template RabbitMQ:rabbitmq.memory_usage.last()}>80</expression>
                    <priority>WARNING</priority>
                    <description>内存使用率超过 80%</description>
                </trigger>
                
                <trigger>
                    <name>RabbitMQ 内存使用率严重过高</name>
                    <expression>{Template RabbitMQ:rabbitmq.memory_usage.last()}>90</expression>
                    <priority>HIGH</priority>
                    <description>内存使用率超过 90%,即将触发流控</description>
                </trigger>
                
                <trigger>
                    <name>RabbitMQ 内存告警触发</name>
                    <expression>{Template RabbitMQ:rabbitmq.node[mem_alarm].last()}=1</expression>
                    <priority>DISASTER</priority>
                    <description>内存告警已触发,所有生产者将被阻塞</description>
                </trigger>
                
                <trigger>
                    <name>RabbitMQ 磁盘空间不足</name>
                    <expression>{Template RabbitMQ:rabbitmq.disk_free_gb.last()}&lt;10</expression>
                    <priority>WARNING</priority>
                    <description>磁盘剩余空间不足 10GB</description>
                </trigger>
                
                <trigger>
                    <name>RabbitMQ 磁盘告警触发</name>
                    <expression>{Template RabbitMQ:rabbitmq.node[disk_alarm].last()}=1</expression>
                    <priority>DISASTER</priority>
                    <description>磁盘告警已触发,所有生产者将被阻塞</description>
                </trigger>
                
                <trigger>
                    <name>RabbitMQ 消息堆积过多</name>
                    <expression>{Template RabbitMQ:rabbitmq.overview[messages_total].last()}>50000</expression>
                    <priority>WARNING</priority>
                    <description>消息堆积超过 50000 条</description>
                </trigger>
                
                <trigger>
                    <name>RabbitMQ 连接数过多</name>
                    <expression>{Template RabbitMQ:rabbitmq.overview[connections].last()}>800</expression>
                    <priority>WARNING</priority>
                    <description>连接数超过 800</description>
                </trigger>
            </triggers>
        </template>
    </templates>
</zabbix_export>

实际应用场景

场景一:动态告警规则

php
<?php

class DynamicAlertRules
{
    private $ruleEngine;
    private $configFile;
    
    public function __construct(AlertRuleEngine $ruleEngine, $configFile)
    {
        $this->ruleEngine = $ruleEngine;
        $this->configFile = $configFile;
    }
    
    public function loadRulesFromConfig()
    {
        if (!file_exists($this->configFile)) {
            return;
        }
        
        $config = json_decode(file_get_contents($this->configFile), true);
        
        foreach ($config['rules'] ?? [] as $name => $rule) {
            $this->ruleEngine->addRule($name, $rule);
        }
    }
    
    public function addBusinessRule($name, $queuePattern, $threshold, $duration = 300)
    {
        $this->ruleEngine->addRule($name, [
            'name' => "业务告警: {$name}",
            'condition' => function($metrics) use ($queuePattern, $threshold) {
                foreach ($metrics['queues'] ?? [] as $queue) {
                    if (fnmatch($queuePattern, $queue['name'])) {
                        if ($queue['messages'] > $threshold) {
                            return true;
                        }
                    }
                }
                return false;
            },
            'duration' => $duration,
            'severity' => 'warning',
            'message' => "匹配 {$queuePattern} 的队列消息超过 {$threshold}",
        ]);
    }
    
    public function addTimeBasedRule($name, $baseRule, $timeRanges)
    {
        $originalRule = $this->ruleEngine->getRules()[$baseRule] ?? null;
        
        if (!$originalRule) {
            return false;
        }
        
        $this->ruleEngine->addRule($name, [
            'name' => $originalRule['name'] . ' (时间增强)',
            'condition' => function($metrics) use ($originalRule, $timeRanges) {
                $hour = (int)date('H');
                $dayOfWeek = (int)date('w');
                
                $inRange = false;
                foreach ($timeRanges as $range) {
                    if ($hour >= $range['start'] && $hour < $range['end']) {
                        if (empty($range['days']) || in_array($dayOfWeek, $range['days'])) {
                            $inRange = true;
                            break;
                        }
                    }
                }
                
                if (!$inRange) {
                    return false;
                }
                
                return $originalRule['condition']($metrics);
            },
            'duration' => $originalRule['duration'],
            'severity' => $originalRule['severity'],
            'message' => $originalRule['message'],
        ]);
        
        return true;
    }
}

场景二:告警规则测试

php
<?php

class AlertRuleTester
{
    private $ruleEngine;
    
    public function __construct(AlertRuleEngine $ruleEngine)
    {
        $this->ruleEngine = $ruleEngine;
    }
    
    public function testRule($ruleName, array $testCases)
    {
        $results = [];
        $rule = $this->ruleEngine->getRules()[$ruleName] ?? null;
        
        if (!$rule) {
            return ['error' => 'Rule not found'];
        }
        
        foreach ($testCases as $caseName => $metrics) {
            $conditionMet = $rule['condition']($metrics);
            
            $results[$caseName] = [
                'metrics' => $metrics,
                'condition_met' => $conditionMet,
                'expected' => $caseName === 'should_trigger',
                'passed' => $conditionMet === ($caseName === 'should_trigger'),
            ];
        }
        
        return $results;
    }
    
    public function testAllRules()
    {
        $rules = $this->ruleEngine->getRules();
        $results = [];
        
        $testScenarios = [
            'normal' => [
                'memory_usage' => 50,
                'disk_free' => 50,
                'queue_messages' => 1000,
                'connections_total' => 100,
            ],
            'memory_warning' => [
                'memory_usage' => 85,
                'disk_free' => 50,
                'queue_messages' => 1000,
                'connections_total' => 100,
            ],
            'disk_critical' => [
                'memory_usage' => 50,
                'disk_free' => 3,
                'queue_messages' => 1000,
                'connections_total' => 100,
            ],
            'queue_backlog' => [
                'memory_usage' => 50,
                'disk_free' => 50,
                'queue_messages' => 80000,
                'connections_total' => 100,
            ],
        ];
        
        foreach ($rules as $ruleName => $rule) {
            $results[$ruleName] = $this->testRule($ruleName, $testScenarios);
        }
        
        return $results;
    }
}

常见问题与解决方案

问题一:告警规则冲突

现象:同一指标触发多个告警。

解决方案

yaml
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

问题二:告警抖动

现象:告警频繁触发和恢复。

解决方案

yaml
rules:
  - alert: RabbitMQMemoryHigh
    for: 5m

问题三:告警规则过多

现象:难以管理和维护。

解决方案

php
class RuleGroupManager
{
    public function groupRules($rules)
    {
        $groups = [];
        
        foreach ($rules as $name => $rule) {
            $category = $this->categorizeRule($name);
            $groups[$category][$name] = $rule;
        }
        
        return $groups;
    }
    
    private function categorizeRule($name)
    {
        if (strpos($name, 'memory') !== false || strpos($name, 'disk') !== false) {
            return 'resource';
        }
        if (strpos($name, 'queue') !== false || strpos($name, 'message') !== false) {
            return 'queue';
        }
        if (strpos($name, 'connection') !== false || strpos($name, 'channel') !== false) {
            return 'connection';
        }
        if (strpos($name, 'cluster') !== false || strpos($name, 'node') !== false) {
            return 'cluster';
        }
        return 'other';
    }
}

最佳实践

1. 规则命名规范

<service>_<category>_<condition>_<severity>
rabbitmq_memory_high_warning
rabbitmq_disk_low_critical

2. 持续时间设置

严重程度建议持续时间
Critical0-2 分钟
Warning3-5 分钟
Info5-10 分钟

3. 规则文档化

yaml
- alert: RabbitMQMemoryHigh
  expr: (rabbitmq_memory_used_bytes / rabbitmq_memory_limit_bytes) * 100 > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "RabbitMQ 内存使用率过高"
    description: "节点 {{ $labels.instance }} 内存使用率 {{ $value }}%"
    runbook_url: "https://wiki.example.com/rabbitmq/memory-high"
    impact: "可能导致流控,影响消息吞吐"
    action: "检查内存使用情况,考虑增加内存或优化消息处理"

相关链接