Appearance
RabbitMQ 告警规则配置
概述
告警规则定义了触发告警的条件和逻辑。合理配置告警规则可以准确识别问题,避免误报和漏报。本文将详细介绍 RabbitMQ 告警规则的配置方法、规则模板和最佳实践。
核心知识点
告警规则结构
yaml
alert_name:
condition: 触发条件
duration: 持续时间
severity: 严重程度
labels: 标签
annotations: 注释规则类型
| 类型 | 说明 | 示例 |
|---|---|---|
| 阈值规则 | 超过阈值触发 | 内存使用率 > 80% |
| 趋势规则 | 趋势变化触发 | 消息堆积持续增长 |
| 状态规则 | 状态变化触发 | 节点从运行变为停止 |
| 组合规则 | 多条件组合触发 | 内存高且消息堆积 |
告警状态流转
Normal -> Pending -> Firing -> Resolved -> Normal配置示例
Prometheus 告警规则
yaml
groups:
- name: rabbitmq_alerts
interval: 30s
rules:
- alert: RabbitMQNodeDown
expr: rabbitmq_up == 0
for: 1m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "RabbitMQ 节点宕机"
description: "RabbitMQ 节点 {{ $labels.instance }} 已宕机超过 1 分钟"
runbook_url: "https://wiki.example.com/rabbitmq/node-down"
- alert: RabbitMQMemoryHigh
expr: (rabbitmq_memory_used_bytes / rabbitmq_memory_limit_bytes) * 100 > 80
for: 5m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "RabbitMQ 内存使用率过高"
description: "节点 {{ $labels.instance }} 内存使用率 {{ $value | printf \"%.2f\" }}%"
runbook_url: "https://wiki.example.com/rabbitmq/memory-high"
- alert: RabbitMQMemoryCritical
expr: (rabbitmq_memory_used_bytes / rabbitmq_memory_limit_bytes) * 100 > 90
for: 2m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "RabbitMQ 内存使用率严重过高"
description: "节点 {{ $labels.instance }} 内存使用率 {{ $value | printf \"%.2f\" }}%,即将触发流控"
- alert: RabbitMQMemoryAlarm
expr: rabbitmq_mem_alarm == 1
for: 0m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "RabbitMQ 内存告警已触发"
description: "节点 {{ $labels.instance }} 已触发内存告警,所有生产者将被阻塞"
- alert: RabbitMQDiskSpaceLow
expr: rabbitmq_disk_free_bytes / 1024 / 1024 / 1024 < 10
for: 5m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "RabbitMQ 磁盘空间不足"
description: "节点 {{ $labels.instance }} 剩余磁盘空间 {{ $value | printf \"%.2f\" }} GB"
- alert: RabbitMQDiskSpaceCritical
expr: rabbitmq_disk_free_bytes / 1024 / 1024 / 1024 < 5
for: 1m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "RabbitMQ 磁盘空间严重不足"
description: "节点 {{ $labels.instance }} 剩余磁盘空间 {{ $value | printf \"%.2f\" }} GB,请立即处理"
- alert: RabbitMQDiskAlarm
expr: rabbitmq_disk_free_alarm == 1
for: 0m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "RabbitMQ 磁盘告警已触发"
description: "节点 {{ $labels.instance }} 已触发磁盘告警,所有生产者将被阻塞"
- alert: RabbitMQQueueMessagesHigh
expr: rabbitmq_queue_messages > 50000
for: 5m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "队列消息堆积过多"
description: "队列 {{ $labels.queue }} 消息数量 {{ $value }},可能存在消费瓶颈"
- alert: RabbitMQQueueMessagesCritical
expr: rabbitmq_queue_messages > 100000
for: 3m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "队列消息堆积严重"
description: "队列 {{ $labels.queue }} 消息数量 {{ $value }},请立即处理"
- alert: RabbitMQNoConsumer
expr: rabbitmq_queue_consumers == 0 and rabbitmq_queue_messages > 1000
for: 5m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "队列无消费者"
description: "队列 {{ $labels.queue }} 有 {{ $value }} 条消息但无消费者"
- alert: RabbitMQConnectionHigh
expr: rabbitmq_connections > 800
for: 5m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "连接数过多"
description: "节点 {{ $labels.instance }} 连接数 {{ $value }},建议优化连接池"
- alert: RabbitMQConnectionCritical
expr: rabbitmq_connections > 950
for: 2m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "连接数接近上限"
description: "节点 {{ $labels.instance }} 连接数 {{ $value }},即将达到上限"
- alert: RabbitMQFileDescriptorsHigh
expr: (rabbitmq_fd_used / rabbitmq_fd_total) * 100 > 80
for: 5m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "文件描述符使用率过高"
description: "节点 {{ $labels.instance }} 文件描述符使用率 {{ $value | printf \"%.2f\" }}%"
- alert: RabbitMQFileDescriptorsCritical
expr: (rabbitmq_fd_used / rabbitmq_fd_total) * 100 > 90
for: 2m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "文件描述符使用率严重过高"
description: "节点 {{ $labels.instance }} 文件描述符使用率 {{ $value | printf \"%.2f\" }}%"
- alert: RabbitMQClusterPartition
expr: rabbitmq_partitions > 0
for: 0m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "集群网络分区"
description: "节点 {{ $labels.instance }} 检测到网络分区,请立即处理"
- alert: RabbitMQQueueGrowing
expr: deriv(rabbitmq_queue_messages[5m]) > 10
for: 10m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "队列消息持续增长"
description: "队列 {{ $labels.queue }} 消息数量持续增长,增长速率 {{ $value | printf \"%.2f\" }} 条/秒"
- alert: RabbitMQConsumerLag
expr: rabbitmq_queue_messages_ready - rabbitmq_queue_messages_unacked > 10000
for: 5m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "消费者延迟过高"
description: "队列 {{ $labels.queue }} 消费者延迟 {{ $value }} 条消息"
- alert: RabbitMQMessageRedeliverHigh
expr: (rate(rabbitmq_messages_redelivered_total[5m]) / rate(rabbitmq_messages_delivered_total[5m])) * 100 > 10
for: 5m
labels:
severity: warning
service: rabbitmq
annotations:
summary: "消息重投递率过高"
description: "消息重投递率 {{ $value | printf \"%.2f\" }}%,可能存在消费者问题"
- alert: RabbitMQNodeNotRunning
expr: rabbitmq_node_running == 0
for: 1m
labels:
severity: critical
service: rabbitmq
annotations:
summary: "RabbitMQ 节点未运行"
description: "节点 {{ $labels.node }} 未处于运行状态"PHP 告警规则引擎
php
<?php
class AlertRuleEngine
{
private $rules;
private $stateStore;
public function __construct($stateStore = null)
{
$this->rules = $this->loadDefaultRules();
$this->stateStore = $stateStore ?: new FileStateStore('/tmp/alert_states.json');
}
private function loadDefaultRules()
{
return [
'node_down' => [
'name' => '节点宕机',
'condition' => function($metrics) {
return !$metrics['node_status'];
},
'duration' => 60,
'severity' => 'critical',
'message' => 'RabbitMQ 节点已宕机',
'runbook' => 'https://wiki.example.com/rabbitmq/node-down',
],
'memory_high' => [
'name' => '内存使用率过高',
'condition' => function($metrics) {
return $metrics['memory_usage'] > 80;
},
'duration' => 300,
'severity' => 'warning',
'message' => '内存使用率超过 80%',
'runbook' => 'https://wiki.example.com/rabbitmq/memory-high',
],
'memory_critical' => [
'name' => '内存使用率严重过高',
'condition' => function($metrics) {
return $metrics['memory_usage'] > 90;
},
'duration' => 120,
'severity' => 'critical',
'message' => '内存使用率超过 90%,即将触发流控',
'runbook' => 'https://wiki.example.com/rabbitmq/memory-critical',
],
'memory_alarm' => [
'name' => '内存告警触发',
'condition' => function($metrics) {
return $metrics['memory_alarm'] === true;
},
'duration' => 0,
'severity' => 'critical',
'message' => '内存告警已触发,所有生产者将被阻塞',
'runbook' => 'https://wiki.example.com/rabbitmq/memory-alarm',
],
'disk_low' => [
'name' => '磁盘空间不足',
'condition' => function($metrics) {
return $metrics['disk_free'] < 10;
},
'duration' => 300,
'severity' => 'warning',
'message' => '磁盘剩余空间不足 10GB',
'runbook' => 'https://wiki.example.com/rabbitmq/disk-low',
],
'disk_critical' => [
'name' => '磁盘空间严重不足',
'condition' => function($metrics) {
return $metrics['disk_free'] < 5;
},
'duration' => 60,
'severity' => 'critical',
'message' => '磁盘剩余空间不足 5GB,请立即处理',
'runbook' => 'https://wiki.example.com/rabbitmq/disk-critical',
],
'disk_alarm' => [
'name' => '磁盘告警触发',
'condition' => function($metrics) {
return $metrics['disk_alarm'] === true;
},
'duration' => 0,
'severity' => 'critical',
'message' => '磁盘告警已触发,所有生产者将被阻塞',
'runbook' => 'https://wiki.example.com/rabbitmq/disk-alarm',
],
'queue_messages_high' => [
'name' => '队列消息堆积',
'condition' => function($metrics) {
return $metrics['queue_messages'] > 50000;
},
'duration' => 300,
'severity' => 'warning',
'message' => '队列消息堆积超过 50000 条',
'runbook' => 'https://wiki.example.com/rabbitmq/queue-backlog',
],
'queue_messages_critical' => [
'name' => '队列消息堆积严重',
'condition' => function($metrics) {
return $metrics['queue_messages'] > 100000;
},
'duration' => 180,
'severity' => 'critical',
'message' => '队列消息堆积超过 100000 条,请立即处理',
'runbook' => 'https://wiki.example.com/rabbitmq/queue-backlog-critical',
],
'no_consumer' => [
'name' => '队列无消费者',
'condition' => function($metrics) {
return $metrics['queue_consumer_count'] === 0 && $metrics['queue_messages'] > 1000;
},
'duration' => 300,
'severity' => 'warning',
'message' => '队列有消息但无消费者',
'runbook' => 'https://wiki.example.com/rabbitmq/no-consumer',
],
'connection_high' => [
'name' => '连接数过多',
'condition' => function($metrics) {
return $metrics['connections_total'] > 800;
},
'duration' => 300,
'severity' => 'warning',
'message' => '连接数超过 800',
'runbook' => 'https://wiki.example.com/rabbitmq/connection-high',
],
'fd_high' => [
'name' => '文件描述符使用率过高',
'condition' => function($metrics) {
return $metrics['fd_usage'] > 80;
},
'duration' => 300,
'severity' => 'warning',
'message' => '文件描述符使用率超过 80%',
'runbook' => 'https://wiki.example.com/rabbitmq/fd-high',
],
'cluster_partition' => [
'name' => '集群网络分区',
'condition' => function($metrics) {
return $metrics['cluster_partition'] === true;
},
'duration' => 0,
'severity' => 'critical',
'message' => '检测到集群网络分区',
'runbook' => 'https://wiki.example.com/rabbitmq/partition',
],
];
}
public function addRule($name, array $rule)
{
$this->rules[$name] = $rule;
}
public function removeRule($name)
{
unset($this->rules[$name]);
}
public function evaluate(array $metrics)
{
$alerts = [];
$now = time();
foreach ($this->rules as $ruleName => $rule) {
$conditionMet = $rule['condition']($metrics);
$state = $this->stateStore->get($ruleName);
if ($conditionMet) {
if ($state === null) {
$this->stateStore->set($ruleName, [
'status' => 'pending',
'started_at' => $now,
'first_value' => $metrics,
]);
} elseif ($state['status'] === 'pending') {
$elapsed = $now - $state['started_at'];
if ($elapsed >= $rule['duration']) {
$this->stateStore->set($ruleName, [
'status' => 'firing',
'started_at' => $state['started_at'],
'first_value' => $state['first_value'],
]);
$alerts[] = $this->createAlert($ruleName, $rule, $metrics, 'firing');
}
} elseif ($state['status'] === 'firing') {
$alerts[] = $this->createAlert($ruleName, $rule, $metrics, 'firing');
}
} else {
if ($state !== null && $state['status'] === 'firing') {
$this->stateStore->set($ruleName, [
'status' => 'resolved',
'resolved_at' => $now,
]);
$alerts[] = $this->createAlert($ruleName, $rule, $metrics, 'resolved');
} elseif ($state !== null) {
$this->stateStore->remove($ruleName);
}
}
}
return $alerts;
}
private function createAlert($ruleName, $rule, $metrics, $status)
{
return [
'rule_name' => $ruleName,
'name' => $rule['name'],
'status' => $status,
'severity' => $rule['severity'],
'message' => $rule['message'],
'runbook' => $rule['runbook'],
'metrics' => $metrics,
'timestamp' => date('Y-m-d H:i:s'),
];
}
public function getRuleStates()
{
return $this->stateStore->getAll();
}
public function getRules()
{
return $this->rules;
}
}
class FileStateStore
{
private $file;
public function __construct($file)
{
$this->file = $file;
}
public function get($key)
{
$data = $this->load();
return $data[$key] ?? null;
}
public function set($key, $value)
{
$data = $this->load();
$data[$key] = $value;
$this->save($data);
}
public function remove($key)
{
$data = $this->load();
unset($data[$key]);
$this->save($data);
}
public function getAll()
{
return $this->load();
}
private function load()
{
if (file_exists($this->file)) {
return json_decode(file_get_contents($this->file), true) ?: [];
}
return [];
}
private function save($data)
{
$dir = dirname($this->file);
if (!is_dir($dir)) {
mkdir($dir, 0755, true);
}
file_put_contents($this->file, json_encode($data, JSON_PRETTY_PRINT));
}
}Zabbix 告警规则
xml
<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
<version>5.0</version>
<templates>
<template>
<template>Template RabbitMQ Alerts</template>
<triggers>
<trigger>
<name>RabbitMQ 节点宕机</name>
<expression>{Template RabbitMQ:rabbitmq.status[status].last()}=0</expression>
<priority>HIGH</priority>
<description>RabbitMQ 节点已宕机,请立即检查</description>
</trigger>
<trigger>
<name>RabbitMQ 内存使用率过高</name>
<expression>{Template RabbitMQ:rabbitmq.memory_usage.last()}>80</expression>
<priority>WARNING</priority>
<description>内存使用率超过 80%</description>
</trigger>
<trigger>
<name>RabbitMQ 内存使用率严重过高</name>
<expression>{Template RabbitMQ:rabbitmq.memory_usage.last()}>90</expression>
<priority>HIGH</priority>
<description>内存使用率超过 90%,即将触发流控</description>
</trigger>
<trigger>
<name>RabbitMQ 内存告警触发</name>
<expression>{Template RabbitMQ:rabbitmq.node[mem_alarm].last()}=1</expression>
<priority>DISASTER</priority>
<description>内存告警已触发,所有生产者将被阻塞</description>
</trigger>
<trigger>
<name>RabbitMQ 磁盘空间不足</name>
<expression>{Template RabbitMQ:rabbitmq.disk_free_gb.last()}<10</expression>
<priority>WARNING</priority>
<description>磁盘剩余空间不足 10GB</description>
</trigger>
<trigger>
<name>RabbitMQ 磁盘告警触发</name>
<expression>{Template RabbitMQ:rabbitmq.node[disk_alarm].last()}=1</expression>
<priority>DISASTER</priority>
<description>磁盘告警已触发,所有生产者将被阻塞</description>
</trigger>
<trigger>
<name>RabbitMQ 消息堆积过多</name>
<expression>{Template RabbitMQ:rabbitmq.overview[messages_total].last()}>50000</expression>
<priority>WARNING</priority>
<description>消息堆积超过 50000 条</description>
</trigger>
<trigger>
<name>RabbitMQ 连接数过多</name>
<expression>{Template RabbitMQ:rabbitmq.overview[connections].last()}>800</expression>
<priority>WARNING</priority>
<description>连接数超过 800</description>
</trigger>
</triggers>
</template>
</templates>
</zabbix_export>实际应用场景
场景一:动态告警规则
php
<?php
class DynamicAlertRules
{
private $ruleEngine;
private $configFile;
public function __construct(AlertRuleEngine $ruleEngine, $configFile)
{
$this->ruleEngine = $ruleEngine;
$this->configFile = $configFile;
}
public function loadRulesFromConfig()
{
if (!file_exists($this->configFile)) {
return;
}
$config = json_decode(file_get_contents($this->configFile), true);
foreach ($config['rules'] ?? [] as $name => $rule) {
$this->ruleEngine->addRule($name, $rule);
}
}
public function addBusinessRule($name, $queuePattern, $threshold, $duration = 300)
{
$this->ruleEngine->addRule($name, [
'name' => "业务告警: {$name}",
'condition' => function($metrics) use ($queuePattern, $threshold) {
foreach ($metrics['queues'] ?? [] as $queue) {
if (fnmatch($queuePattern, $queue['name'])) {
if ($queue['messages'] > $threshold) {
return true;
}
}
}
return false;
},
'duration' => $duration,
'severity' => 'warning',
'message' => "匹配 {$queuePattern} 的队列消息超过 {$threshold}",
]);
}
public function addTimeBasedRule($name, $baseRule, $timeRanges)
{
$originalRule = $this->ruleEngine->getRules()[$baseRule] ?? null;
if (!$originalRule) {
return false;
}
$this->ruleEngine->addRule($name, [
'name' => $originalRule['name'] . ' (时间增强)',
'condition' => function($metrics) use ($originalRule, $timeRanges) {
$hour = (int)date('H');
$dayOfWeek = (int)date('w');
$inRange = false;
foreach ($timeRanges as $range) {
if ($hour >= $range['start'] && $hour < $range['end']) {
if (empty($range['days']) || in_array($dayOfWeek, $range['days'])) {
$inRange = true;
break;
}
}
}
if (!$inRange) {
return false;
}
return $originalRule['condition']($metrics);
},
'duration' => $originalRule['duration'],
'severity' => $originalRule['severity'],
'message' => $originalRule['message'],
]);
return true;
}
}场景二:告警规则测试
php
<?php
class AlertRuleTester
{
private $ruleEngine;
public function __construct(AlertRuleEngine $ruleEngine)
{
$this->ruleEngine = $ruleEngine;
}
public function testRule($ruleName, array $testCases)
{
$results = [];
$rule = $this->ruleEngine->getRules()[$ruleName] ?? null;
if (!$rule) {
return ['error' => 'Rule not found'];
}
foreach ($testCases as $caseName => $metrics) {
$conditionMet = $rule['condition']($metrics);
$results[$caseName] = [
'metrics' => $metrics,
'condition_met' => $conditionMet,
'expected' => $caseName === 'should_trigger',
'passed' => $conditionMet === ($caseName === 'should_trigger'),
];
}
return $results;
}
public function testAllRules()
{
$rules = $this->ruleEngine->getRules();
$results = [];
$testScenarios = [
'normal' => [
'memory_usage' => 50,
'disk_free' => 50,
'queue_messages' => 1000,
'connections_total' => 100,
],
'memory_warning' => [
'memory_usage' => 85,
'disk_free' => 50,
'queue_messages' => 1000,
'connections_total' => 100,
],
'disk_critical' => [
'memory_usage' => 50,
'disk_free' => 3,
'queue_messages' => 1000,
'connections_total' => 100,
],
'queue_backlog' => [
'memory_usage' => 50,
'disk_free' => 50,
'queue_messages' => 80000,
'connections_total' => 100,
],
];
foreach ($rules as $ruleName => $rule) {
$results[$ruleName] = $this->testRule($ruleName, $testScenarios);
}
return $results;
}
}常见问题与解决方案
问题一:告警规则冲突
现象:同一指标触发多个告警。
解决方案:
yaml
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']问题二:告警抖动
现象:告警频繁触发和恢复。
解决方案:
yaml
rules:
- alert: RabbitMQMemoryHigh
for: 5m问题三:告警规则过多
现象:难以管理和维护。
解决方案:
php
class RuleGroupManager
{
public function groupRules($rules)
{
$groups = [];
foreach ($rules as $name => $rule) {
$category = $this->categorizeRule($name);
$groups[$category][$name] = $rule;
}
return $groups;
}
private function categorizeRule($name)
{
if (strpos($name, 'memory') !== false || strpos($name, 'disk') !== false) {
return 'resource';
}
if (strpos($name, 'queue') !== false || strpos($name, 'message') !== false) {
return 'queue';
}
if (strpos($name, 'connection') !== false || strpos($name, 'channel') !== false) {
return 'connection';
}
if (strpos($name, 'cluster') !== false || strpos($name, 'node') !== false) {
return 'cluster';
}
return 'other';
}
}最佳实践
1. 规则命名规范
<service>_<category>_<condition>_<severity>
rabbitmq_memory_high_warning
rabbitmq_disk_low_critical2. 持续时间设置
| 严重程度 | 建议持续时间 |
|---|---|
| Critical | 0-2 分钟 |
| Warning | 3-5 分钟 |
| Info | 5-10 分钟 |
3. 规则文档化
yaml
- alert: RabbitMQMemoryHigh
expr: (rabbitmq_memory_used_bytes / rabbitmq_memory_limit_bytes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ 内存使用率过高"
description: "节点 {{ $labels.instance }} 内存使用率 {{ $value }}%"
runbook_url: "https://wiki.example.com/rabbitmq/memory-high"
impact: "可能导致流控,影响消息吞吐"
action: "检查内存使用情况,考虑增加内存或优化消息处理"