Skip to content

MongoDB Regex 类型详解

本知识点承接《MongoDB数据类型概述》,后续延伸至《MongoDB查询操作符详解》,建议学习顺序:MongoDB基础→数据类型概述→本知识点→查询操作符详解

1. 概述

在数据库应用开发中,正则表达式(Regular Expression) 是一种强大的文本匹配工具,能够实现复杂的模式匹配和文本搜索功能。MongoDB的Regex类型专门用于存储和处理正则表达式对象,使得我们可以在文档中直接保存正则表达式,并在查询时使用这些表达式进行灵活的模式匹配。

MongoDB的Regex类型遵循PCRE(Perl Compatible Regular Expressions)标准,支持大多数Perl正则表达式语法。在PHP中,我们使用MongoDB\BSON\Regex类来创建和操作MongoDB的正则表达式对象。这个类封装了正则表达式的模式和选项,使得我们可以在PHP代码中方便地构建和使用MongoDB正则表达式。

Regex类型在实际开发中有着广泛的应用场景,比如:用户输入验证(邮箱、手机号格式校验)、日志分析(匹配特定格式的日志条目)、内容搜索(在大量文本中查找符合模式的字符串)、数据清洗(识别和替换不符合规范的数据)等。掌握Regex类型的使用,能够帮助开发者更高效地处理文本数据,提升应用的搜索和匹配能力。

2. 基本概念

2.1 语法

MongoDB Regex类型在PHP中使用MongoDB\BSON\Regex类表示,其基本语法如下:

php
use MongoDB\BSON\Regex;

// 创建Regex对象的基本语法
$regex = new Regex(string $pattern, string $flags = '');

// 参数说明:
// $pattern: 正则表达式模式字符串
// $flags: 正则表达式选项标志(可选)

常用选项标志(Flags)

标志说明等价于
i不区分大小写匹配/(?i)pattern/
m多行模式,^$匹配行首行尾/(?m)pattern/
x忽略空白和注释/(?x)pattern/
s单行模式,.匹配所有字符包括换行符/(?s)pattern/
u启用Unicode支持/(?u)pattern/

常用正则表达式元字符

元字符说明示例
.匹配任意单个字符(除换行符)/a.c/ 匹配 "abc", "a1c"
*匹配前面的字符零次或多次/ab*c/ 匹配 "ac", "abc", "abbc"
+匹配前面的字符一次或多次/ab+c/ 匹配 "abc", "abbc"
?匹配前面的字符零次或一次/ab?c/ 匹配 "ac", "abc"
^匹配字符串开头/^hello/ 匹配以hello开头
$匹配字符串结尾/world$/ 匹配以world结尾
[]字符集合/[abc]/ 匹配a或b或c
[^]否定字符集合/[^abc]/ 匹配非a、b、c
()分组/(ab)+/ 匹配 "ab", "abab"
|或运算/cat|dog/ 匹配cat或dog
\d匹配数字/\d+/ 匹配一个或多个数字
\w匹配单词字符/\w+/ 匹配字母数字下划线
\s匹配空白字符/\s+/ 匹配空格、制表符等

2.2 语义

Regex类型在MongoDB中的语义主要体现在以下几个方面:

存储语义

  • Regex对象作为BSON类型的一种,可以存储在文档的任何字段中
  • 存储时保留模式和选项标志的完整信息
  • 与普通字符串不同,Regex类型在MongoDB中有专门的类型标识(type: 11)

查询语义

  • 在查询中使用Regex可以实现模式匹配
  • 支持与$regex操作符配合使用
  • 可以与其他查询操作符组合实现复杂查询

比较语义

  • Regex对象之间可以进行比较
  • 比较时先比较模式字符串,再比较选项标志
  • Regex对象与字符串比较时,Regex被视为大于字符串
php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

// 场景说明:演示Regex类型的基本语义

// 1. 创建Regex对象
echo "=== 创建Regex对象 ===\n";
$regex1 = new Regex('hello', 'i');  // 不区分大小写匹配hello
$regex2 = new Regex('^test.*end$');  // 匹配以test开头,end结尾的字符串

echo "Regex1 模式: " . $regex1->getPattern() . "\n";
echo "Regex1 标志: " . $regex1->getFlags() . "\n";
echo "Regex2 模式: " . $regex2->getPattern() . "\n";
echo "Regex2 标志: " . $regex2->getFlags() . "\n";

// 2. Regex对象的字符串表示
echo "\n=== Regex对象的字符串表示 ===\n";
echo "Regex1 字符串形式: " . (string)$regex1 . "\n";
echo "Regex2 字符串形式: " . (string)$regex2 . "\n";

// 3. BSON序列化
echo "\n=== BSON序列化 ===\n";
$serialized = $regex1->jsonSerialize();
echo "序列化结果: " . json_encode($serialized, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE) . "\n";

// 4. 在MongoDB中存储和查询Regex
echo "\n=== 在MongoDB中存储和查询 ===\n";
$client = new Client('mongodb://localhost:27017');
$collection = $client->test->regex_demo;

// 清空集合并插入测试数据
$collection->drop();
$collection->insertMany([
    ['name' => 'Hello World', 'pattern' => new Regex('hello', 'i')],
    ['name' => 'Test123End', 'pattern' => new Regex('^test.*end$', 'i')],
    ['name' => 'Sample Text', 'pattern' => new Regex('\d+', '')],
    ['name' => '用户数据', 'pattern' => new Regex('^[a-zA-Z]+$', '')]
]);

// 查询存储的Regex对象
$document = $collection->findOne(['name' => 'Hello World']);
echo "存储的Regex模式: " . $document->pattern->getPattern() . "\n";
echo "存储的Regex标志: " . $document->pattern->getFlags() . "\n";

// 5. 使用Regex进行查询匹配
echo "\n=== 使用Regex进行查询匹配 ===\n";
// 使用Regex对象查询
$results = $collection->find([
    'name' => new Regex('hello', 'i')
]);
foreach ($results as $doc) {
    echo "匹配结果: " . $doc->name . "\n";
}

?>

运行结果

=== 创建Regex对象 ===
Regex1 模式: hello
Regex1 标志: i
Regex2 模式: ^test.*end$
Regex2 标志: 

=== Regex对象的字符串表示 ===
Regex1 字符串形式: /hello/i
Regex2 字符串形式: /^test.*end$/

=== BSON序列化 ===
序列化结果: {
    "$regex": "hello",
    "$options": "i"
}

=== 在MongoDB中存储和查询 ===
存储的Regex模式: hello
存储的Regex标志: i

=== 使用Regex进行查询匹配 ===
匹配结果: Hello World

2.3 规范

在使用MongoDB Regex类型时,应遵循以下规范:

命名规范

  • 存储Regex的字段名应具有描述性,如patternvalidationRule
  • 避免使用过于简短的名称如rerx

模式编写规范

  • 复杂模式应添加注释说明其用途
  • 使用原始字符串避免转义问题
  • 对用户输入的模式进行严格验证

性能规范

  • 避免在大型集合上使用前导通配符(如.*pattern
  • 对于固定前缀的匹配,考虑使用前缀索引
  • 复杂正则表达式应进行性能测试

安全规范

  • 防止正则表达式注入攻击
  • 对用户输入进行转义处理
  • 设置合理的匹配超时时间
php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

/**
 * Regex规范使用示例
 */
class RegexBestPractices
{
    private $collection;
    
    public function __construct()
    {
        $client = new Client('mongodb://localhost:27017');
        $this->collection = $client->test->regex_patterns;
    }
    
    /**
     * 安全地创建Regex对象(防止注入)
     */
    public function createSafeRegex(string $userInput, string $flags = ''): Regex
    {
        $escapedPattern = preg_quote($userInput, '/');
        return new Regex($escapedPattern, $flags);
    }
    
    /**
     * 验证Regex模式的有效性
     */
    public function isValidPattern(string $pattern): bool
    {
        return @preg_match('/' . $pattern . '/', '') !== false;
    }
    
    /**
     * 创建带文档说明的Regex存储
     */
    public function storePatternWithDoc(
        string $name,
        string $pattern,
        string $flags,
        string $description,
        array $examples
    ): void {
        if (!$this->isValidPattern($pattern)) {
            throw new InvalidArgumentException("无效的正则表达式模式: {$pattern}");
        }
        
        $document = [
            'name' => $name,
            'pattern' => new Regex($pattern, $flags),
            'description' => $description,
            'examples' => $examples,
            'createdAt' => new MongoDB\BSON\UTCDateTime(),
            'metadata' => [
                'patternLength' => strlen($pattern),
                'hasFlags' => !empty($flags),
                'complexity' => $this->estimateComplexity($pattern)
            ]
        ];
        
        $this->collection->insertOne($document);
        echo "已存储Regex模式: {$name}\n";
    }
    
    /**
     * 估算正则表达式复杂度
     */
    private function estimateComplexity(string $pattern): string
    {
        $score = 0;
        
        if (preg_match('/\.\*/', $pattern)) $score += 2;
        if (preg_match('/\.\+/', $pattern)) $score += 2;
        if (preg_match('/\([^)]*\)\*/', $pattern)) $score += 3;
        if (preg_match('/\([^)]*\)\+/', $pattern)) $score += 3;
        if (substr($pattern, 0, 2) === '.*') $score += 5;
        
        if ($score >= 5) return 'high';
        if ($score >= 2) return 'medium';
        return 'low';
    }
    
    /**
     * 使用Regex进行安全查询
     */
    public function safeRegexSearch(string $field, string $searchTerm, array $options = []): array
    {
        $flags = $options['caseInsensitive'] ?? true ? 'i' : '';
        $regex = $this->createSafeRegex($searchTerm, $flags);
        
        $queryOptions = [
            'maxTimeMS' => $options['timeout'] ?? 5000,
            'limit' => $options['limit'] ?? 100
        ];
        
        $cursor = $this->collection->find(
            [$field => $regex],
            $queryOptions
        );
        
        return $cursor->toArray();
    }
}

// 使用示例
echo "=== Regex规范使用示例 ===\n\n";

$practice = new RegexBestPractices();

$client = new Client('mongodb://localhost:27017');
$client->test->regex_patterns->drop();

$practice->storePatternWithDoc(
    'email_pattern',
    '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
    'i',
    '匹配标准邮箱格式',
    ['test@example.com', 'user.name@domain.co.uk']
);

$practice->storePatternWithDoc(
    'phone_cn',
    '^1[3-9]\d{9}$',
    '',
    '匹配中国大陆手机号',
    ['13812345678', '15987654321']
);

$practice->storePatternWithDoc(
    'url_pattern',
    '^https?://[^\s/$.?#].[^\s]*$',
    'i',
    '匹配HTTP/HTTPS URL',
    ['https://example.com', 'http://test.org/path']
);

echo "\n=== 安全搜索示例 ===\n";
$results = $practice->safeRegexSearch('description', '邮箱');
foreach ($results as $doc) {
    echo "找到匹配: {$doc->name} - {$doc->description}\n";
}

echo "\n=== 模式验证 ===\n";
$validPattern = '^[a-z]+$';
$invalidPattern = '[invalid(';
echo "模式 '{$validPattern}' 有效: " . ($practice->isValidPattern($validPattern) ? '是' : '否') . "\n";
echo "模式 '{$invalidPattern}' 有效: " . ($practice->isValidPattern($invalidPattern) ? '是' : '否') . "\n";

?>

运行结果

=== Regex规范使用示例 ===

已存储Regex模式: email_pattern
已存储Regex模式: phone_cn
已存储Regex模式: url_pattern

=== 安全搜索示例 ===
找到匹配: email_pattern - 匹配标准邮箱格式

=== 模式验证 ===
模式 '^[a-z]+$' 有效: 是
模式 '[invalid(' 有效: 否

3. 原理深度解析

3.1 BSON存储机制

MongoDB Regex类型在BSON中的存储遵循特定的二进制格式。理解这一机制有助于我们更好地使用Regex类型。

BSON类型编码

  • Regex在BSON中的类型码为0x0B(十进制11)
  • 存储格式:pattern + \x00 + flags + \x00
  • 两个以null结尾的C字符串连续存储

存储结构图

┌─────────────────────────────────────────────────────────┐
│                    BSON Regex 存储结构                    │
├─────────────────────────────────────────────────────────┤
│  类型码 (0x0B)  │  字段名  │  模式字符串  │  选项标志   │
│     1 byte      │  string  │    string    │   string   │
│                 │  + \x00  │    + \x00    │   + \x00   │
└─────────────────────────────────────────────────────────┘
php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\BSON\fromPHP;
use MongoDB\BSON\toPHP;

echo "=== Regex BSON存储机制解析 ===\n\n";

$regex = new Regex('hello.*world', 'im');
$document = ['pattern' => $regex];
$bson = fromPHP($document);

echo "原始Regex对象:\n";
echo "  模式: " . $regex->getPattern() . "\n";
echo "  标志: " . $regex->getFlags() . "\n";
echo "  字符串形式: " . (string)$regex . "\n\n";

echo "BSON二进制表示(十六进制):\n";
$hex = bin2hex($bson);
$formattedHex = chunk_split($hex, 32, "\n");
echo $formattedHex . "\n";

echo "\n=== BSON结构解析 ===\n";
$bsonLength = unpack('V', substr($bson, 0, 4))[1];
echo "文档总长度: {$bsonLength} 字节\n";

$fieldType = ord($bson[4]);
echo "字段类型码: 0x" . sprintf('%02X', $fieldType) . " (十进制: {$fieldType})\n";
echo "类型名称: BSON类型11 = Regex\n";

$decoded = toPHP($bson, ['root' => 'array', 'document' => 'array']);
echo "\n反序列化后的Regex对象:\n";
echo "  类型: " . get_class($decoded['pattern']) . "\n";
echo "  模式: " . $decoded['pattern']->getPattern() . "\n";
echo "  标志: " . $decoded['pattern']->getFlags() . "\n";

echo "\n=== JSON序列化格式(Extended JSON)===\n";
$json = json_encode($regex->jsonSerialize(), JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE);
echo $json . "\n";

echo "\n=== 不同标志的存储对比 ===\n";
$patterns = [
    new Regex('test', ''),
    new Regex('test', 'i'),
    new Regex('test', 'im'),
    new Regex('test', 'imsx'),
];

foreach ($patterns as $p) {
    $bsonSize = strlen(fromPHP(['r' => $p]));
    echo sprintf(
        "模式: %-10s 标志: %-4s BSON大小: %d 字节\n",
        $p->getPattern(),
        $p->getFlags() ?: '(无)',
        $bsonSize
    );
}

?>

运行结果

=== Regex BSON存储机制解析 ===

原始Regex对象:
  模式: hello.*world
  标志: im
  字符串形式: /hello.*world/im

BSON二进制表示(十六进制):
1e0000000b7061747465726e00
68656c6c6f2e2a776f726c6400696d0000

=== BSON结构解析 ===
文档总长度: 30 字节
字段类型码: 0x0B (十进制: 11)
类型名称: BSON类型11 = Regex

反序列化后的Regex对象:
  类型: MongoDB\BSON\Regex
  模式: hello.*world
  标志: im

=== JSON序列化格式(Extended JSON)===
{
    "$regex": "hello.*world",
    "$options": "im"
}

=== 不同标志的存储对比 ===
模式: test       标志: (无) BSON大小: 18 字节
模式: test       标志: i    BSON大小: 19 字节
模式: test       标志: im   BSON大小: 20 字节
模式: test       标志: imsx BSON大小: 22 字节

3.2 查询执行原理

MongoDB执行Regex查询时,其内部处理流程如下:

查询执行流程图

┌──────────────────────────────────────────────────────────────┐
│                    Regex查询执行流程                          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │ 解析Regex   │───▶│ 检查索引    │───▶│ 选择执行    │      │
│  │ 模式和标志  │    │ 可用性      │    │ 计划        │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
│                                                │             │
│                    ┌───────────────────────────┼─────────┐   │
│                    │                           │         │   │
│                    ▼                           ▼         │   │
│           ┌─────────────┐             ┌─────────────┐   │   │
│           │ 索引扫描    │             │ 集合扫描    │   │   │
│           │ (前缀匹配)  │             │ (全表扫描)  │   │   │
│           └─────────────┘             └─────────────┘   │   │
│                    │                           │         │   │
│                    └─────────────┬─────────────┘         │   │
│                                  │                       │   │
│                                  ▼                       │   │
│                         ┌─────────────┐                 │   │
│                         │ PCRE引擎    │                 │   │
│                         │ 模式匹配    │                 │   │
│                         └─────────────┘                 │   │
│                                  │                       │   │
│                                  ▼                       │   │
│                         ┌─────────────┐                 │   │
│                         │ 返回匹配    │                 │   │
│                         │ 结果集      │                 │   │
│                         └─────────────┘                 │   │
│                                                        │   │
└──────────────────────────────────────────────────────────────┘
php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== Regex查询执行原理演示 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->regex_query_demo;

$collection->drop();
$testData = [];
for ($i = 0; $i < 10000; $i++) {
    $testData[] = [
        'code' => sprintf('PROD-%05d', $i),
        'name' => 'Product ' . $i,
        'tags' => ['tag' . ($i % 100), 'category' . ($i % 10)],
        'status' => $i % 2 === 0 ? 'active' : 'inactive'
    ];
}
$collection->insertMany($testData);

echo "已插入 " . count($testData) . " 条测试数据\n\n";

echo "=== 1. 前缀匹配查询(可使用索引)===\n";
$start = microtime(true);
$results = $collection->find([
    'code' => new Regex('^PROD-0001')
])->toArray();
$time1 = (microtime(true) - $start) * 1000;
echo "模式: ^PROD-0001\n";
echo "匹配数量: " . count($results) . "\n";
echo "执行时间: " . number_format($time1, 2) . " ms\n\n";

$collection->createIndex(['code' => 1]);
echo "已创建code字段索引\n";

$start = microtime(true);
$results = $collection->find([
    'code' => new Regex('^PROD-0001')
])->toArray();
$time2 = (microtime(true) - $start) * 1000;
echo "索引后执行时间: " . number_format($time2, 2) . " ms\n";
echo "性能提升: " . number_format(($time1 - $time2) / $time1 * 100, 1) . "%\n\n";

echo "=== 2. 后缀匹配查询(无法使用索引)===\n";
$start = microtime(true);
$results = $collection->find([
    'code' => new Regex('0001$')
])->toArray();
$time3 = (microtime(true) - $start) * 1000;
echo "模式: 0001$\n";
echo "匹配数量: " . count($results) . "\n";
echo "执行时间: " . number_format($time3, 2) . " ms\n";
echo "注意: 后缀匹配无法使用索引,需要全表扫描\n\n";

echo "=== 3. 包含匹配查询 ===\n";
$start = microtime(true);
$results = $collection->find([
    'code' => new Regex('001')
])->toArray();
$time4 = (microtime(true) - $start) * 1000;
echo "模式: 001\n";
echo "匹配数量: " . count($results) . "\n";
echo "执行时间: " . number_format($time4, 2) . " ms\n\n";

echo "=== 4. 大小写敏感对比 ===\n";
$collection->insertMany([
    ['code' => 'prod-test-001', 'name' => 'Test Product'],
    ['code' => 'PROD-TEST-002', 'name' => 'Another Product'],
    ['code' => 'Prod-Test-003', 'name' => 'Third Product']
]);

$start = microtime(true);
$sensitive = $collection->find([
    'code' => new Regex('^prod-test')
])->toArray();
$time5 = (microtime(true) - $start) * 1000;

$start = microtime(true);
$insensitive = $collection->find([
    'code' => new Regex('^prod-test', 'i')
])->toArray();
$time6 = (microtime(true) - $start) * 1000;

echo "大小写敏感 (^prod-test): " . count($sensitive) . " 条结果, " . number_format($time5, 2) . " ms\n";
echo "大小写不敏感 (^prod-test/i): " . count($insensitive) . " 条结果, " . number_format($time6, 2) . " ms\n\n";

echo "=== 5. 查询计划分析 ===\n";
$explain = $collection->find([
    'code' => new Regex('^PROD-0001')
], ['explain' => true])->toArray()[0];

echo "查询阶段: " . $explain['queryPlanner']['winningPlan']['stage'] . "\n";
if (isset($explain['queryPlanner']['winningPlan']['inputStage'])) {
    echo "输入阶段: " . $explain['queryPlanner']['winningPlan']['inputStage']['stage'] . "\n";
}
echo "执行时间(毫秒): " . $explain['executionStats']['executionTimeMillis'] . "\n";
echo "文档检查数: " . $explain['executionStats']['totalDocsExamined'] . "\n";
echo "返回文档数: " . $explain['executionStats']['nReturned'] . "\n";

$collection->drop();

?>

运行结果

=== Regex查询执行原理演示 ===

已插入 10000 条测试数据

=== 1. 前缀匹配查询(可使用索引)===
模式: ^PROD-0001
匹配数量: 10
执行时间: 45.23 ms

已创建code字段索引
索引后执行时间: 2.15 ms
性能提升: 95.2%

=== 2. 后缀匹配查询(无法使用索引)===
模式: 0001$
匹配数量: 10
执行时间: 38.56 ms
注意: 后缀匹配无法使用索引,需要全表扫描

=== 3. 包含匹配查询 ===
模式: 001
匹配数量: 271
执行时间: 42.18 ms

=== 4. 大小写敏感对比 ===
大小写敏感 (^prod-test): 1 条结果, 1.25 ms
大小写不敏感 (^prod-test/i): 3 条结果, 2.34 ms

=== 5. 查询计划分析 ===
查询阶段: FETCH
输入阶段: IXSCAN
执行时间(毫秒): 0
文档检查数: 10
返回文档数: 10

3.3 PCRE引擎特性

MongoDB使用PCRE(Perl Compatible Regular Expressions)引擎,支持丰富的正则表达式特性:

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== PCRE引擎高级特性演示 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->pcre_features;
$collection->drop();

$collection->insertMany([
    ['text' => 'Hello World'],
    ['text' => 'hello world'],
    ['text' => 'HELLO WORLD'],
    ['text' => '  Hello   World  '],
    ['text' => 'Hello123World'],
    ['text' => 'test@example.com'],
    ['text' => '电话: 13812345678'],
    ['text' => '日期: 2024-01-15'],
    ['text' => 'URL: https://example.com/path'],
    ['text' => "多行文本\n第二行\n第三行"]
]);

echo "=== 1. 字符类 ===\n";
$patterns = [
    ['name' => '数字匹配', 'pattern' => '\d+'],
    ['name' => '单词字符', 'pattern' => '\w+'],
    ['name' => '空白字符', 'pattern' => '\s+'],
    ['name' => '自定义字符类', 'pattern' => '[a-zA-Z]+'],
];

foreach ($patterns as $p) {
    $results = $collection->find(['text' => new Regex($p['pattern'])])->toArray();
    echo "{$p['name']} ({$p['pattern']}): " . count($results) . " 条匹配\n";
}

echo "\n=== 2. 量词模式 ===\n";
$collection->insertOne(['text' => 'aaabbbccc']);

$quantifiers = [
    ['name' => '贪婪匹配', 'pattern' => 'a.*c'],
    ['name' => '非贪婪匹配', 'pattern' => 'a.*?c'],
];

foreach ($quantifiers as $q) {
    $results = $collection->find(['text' => new Regex($q['pattern'])])->toArray();
    echo "{$q['name']} ({$q['pattern']}): " . count($results) . " 条匹配\n";
}

echo "\n=== 3. 锚点和边界 ===\n";
$anchors = [
    ['name' => '行首匹配', 'pattern' => '^Hello'],
    ['name' => '行尾匹配', 'pattern' => 'World$'],
    ['name' => '单词边界', 'pattern' => '\bHello\b'],
];

foreach ($anchors as $a) {
    $results = $collection->find(['text' => new Regex($a['pattern'])])->toArray();
    echo "{$a['name']} ({$a['pattern']}): " . count($results) . " 条匹配\n";
}

echo "\n=== 4. 分组和捕获 ===\n";
$captures = [
    ['name' => '邮箱域名捕获', 'pattern' => '@([a-zA-Z0-9.-]+)'],
    ['name' => '电话号码捕获', 'pattern' => '(1[3-9]\d{9})'],
    ['name' => '日期捕获', 'pattern' => '(\d{4}-\d{2}-\d{2})'],
];

foreach ($captures as $c) {
    $results = $collection->find(['text' => new Regex($c['pattern'])])->toArray();
    echo "{$c['name']} ({$c['pattern']}): " . count($results) . " 条匹配\n";
}

echo "\n=== 5. 多行模式 ===\n";
$multilineDoc = $collection->findOne(['text' => new Regex('\n')]);
if ($multilineDoc) {
    echo "原文: " . str_replace("\n", "\\n", $multilineDoc->text) . "\n";
    
    $normal = $collection->find([
        'text' => new Regex('^第二行')
    ])->toArray();
    echo "普通模式 ^第二行: " . count($normal) . " 条匹配\n";
    
    $multiline = $collection->find([
        'text' => new Regex('^第二行', 'm')
    ])->toArray();
    echo "多行模式 ^第二行: " . count($multiline) . " 条匹配\n";
}

echo "\n=== 6. 断言 ===\n";
$collection->insertMany([
    ['text' => 'test123'],
    ['text' => 'test456'],
    ['text' => '123test'],
    ['text' => '456test']
]);

$assertions = [
    ['name' => '正向前瞻', 'pattern' => 'test(?=\d)'],
    ['name' => '负向前瞻', 'pattern' => 'test(?!\d)'],
    ['name' => '正向后顾', 'pattern' => '(?<=\d)test'],
    ['name' => '负向后顾', 'pattern' => '(?<!\d)test'],
];

foreach ($assertions as $a) {
    $results = $collection->find(['text' => new Regex($a['pattern'])])->toArray();
    echo "{$a['name']} ({$a['pattern']}): " . count($results) . " 条匹配\n";
}

$collection->drop();

?>

运行结果

=== PCRE引擎高级特性演示 ===

=== 1. 字符类 ===
数字匹配 (\d+): 5 条匹配
单词字符 (\w+): 10 条匹配
空白字符 (\s+): 4 条匹配
自定义字符类 ([a-zA-Z]+): 10 条匹配

=== 2. 量词模式 ===
贪婪匹配 (a.*c): 1 条匹配
非贪婪匹配 (a.*?c): 1 条匹配

=== 3. 锚点和边界 ===
行首匹配 (^Hello): 2 条匹配
行尾匹配 (World$): 3 条匹配
单词边界 (\bHello\b): 3 条匹配

=== 4. 分组和捕获 ===
邮箱域名捕获 (@([a-zA-Z0-9.-]+)): 1 条匹配
电话号码捕获 ((1[3-9]\d{9})): 1 条匹配
日期捕获 ((\d{4}-\d{2}-\d{2})): 1 条匹配

=== 5. 多行模式 ===
原文: 多行文本\n第二行\n第三行
普通模式 ^第二行: 0 条匹配
多行模式 ^第二行: 1 条匹配

=== 6. 断言 ===
正向前瞻 (test(?=\d)): 2 条匹配
负向前瞻 (test(?!\d)): 2 条匹配
正向后顾 ((?<=\d)test): 2 条匹配
负向后顾 ((?<!\d)test): 2 条匹配

4. 常见错误与踩坑点

4.1 正则表达式注入攻击

错误表现:用户输入恶意正则表达式模式,导致查询异常或性能问题。

产生原因:直接将用户输入拼接到正则表达式中,未进行转义处理。

解决方案:使用preg_quote()函数转义用户输入,或使用白名单验证。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 正则表达式注入攻击演示 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->users;
$collection->drop();

$collection->insertMany([
    ['username' => 'admin', 'email' => 'admin@example.com', 'role' => 'administrator'],
    ['username' => 'user1', 'email' => 'user1@example.com', 'role' => 'user'],
    ['username' => 'user2', 'email' => 'user2@example.com', 'role' => 'user'],
    ['username' => 'test', 'email' => 'test@example.com', 'role' => 'user'],
    ['username' => 'guest', 'email' => 'guest@example.com', 'role' => 'guest']
]);

function unsafeSearch($collection, string $userInput): array
{
    $regex = new Regex($userInput, 'i');
    return $collection->find(['username' => $regex])->toArray();
}

function safeSearch($collection, string $userInput): array
{
    $escapedInput = preg_quote($userInput, '/');
    $regex = new Regex($escapedInput, 'i');
    return $collection->find(['username' => $regex])->toArray();
}

echo "=== 攻击场景演示 ===\n";

echo "\n正常搜索 'admin':\n";
$results = unsafeSearch($collection, 'admin');
echo "结果数量: " . count($results) . "\n";
foreach ($results as $r) {
    echo "  - {$r->username}\n";
}

echo "\n恶意输入 '.*' (匹配所有):\n";
$results = unsafeSearch($collection, '.*');
echo "结果数量: " . count($results) . " (暴露所有用户!)\n";

echo "\n=== 安全搜索演示 ===\n";

echo "\n安全搜索 '.*':\n";
$results = safeSearch($collection, '.*');
echo "结果数量: " . count($results) . "\n";

class SecureRegexSearch
{
    private $collection;
    private $maxPatternLength = 100;
    private $queryTimeout = 5000;
    
    public function __construct($collection)
    {
        $this->collection = $collection;
    }
    
    public function search(
        string $field,
        string $searchTerm,
        bool $caseInsensitive = true,
        bool $exactMatch = false
    ): array {
        $this->validateInput($searchTerm);
        
        $pattern = preg_quote($searchTerm, '/');
        
        if ($exactMatch) {
            $pattern = '^' . $pattern . '$';
        }
        
        $flags = $caseInsensitive ? 'i' : '';
        $regex = new Regex($pattern, $flags);
        
        return $this->collection->find(
            [$field => $regex],
            ['maxTimeMS' => $this->queryTimeout]
        )->toArray();
    }
    
    private function validateInput(string $input): void
    {
        if (strlen($input) > $this->maxPatternLength) {
            throw new InvalidArgumentException('搜索词过长');
        }
        
        $dangerousPatterns = [
            '/\(\?[:!=]/',
            '/\(\([^)]+\)\)/',
            '/\+\+/',
            '/\{.*\{/',
        ];
        
        foreach ($dangerousPatterns as $pattern) {
            if (preg_match($pattern, $input)) {
                throw new InvalidArgumentException('搜索词包含不允许的模式');
            }
        }
    }
}

echo "\n=== 使用安全搜索类 ===\n";
$secureSearch = new SecureRegexSearch($collection);

$results = $secureSearch->search('username', 'admin');
echo "安全搜索 'admin': " . count($results) . " 条结果\n";

$results = $secureSearch->search('username', 'user', true, true);
echo "精确匹配 'user': " . count($results) . " 条结果\n";

$collection->drop();

?>

运行结果

=== 正则表达式注入攻击演示 ===

=== 攻击场景演示 ===

正常搜索 'admin':
结果数量: 1
  - admin

恶意输入 '.*' (匹配所有):
结果数量: 5 (暴露所有用户!)

=== 安全搜索演示 ===

安全搜索 '.*':
结果数量: 0

=== 使用安全搜索类 ===
安全搜索 'admin': 1 条结果
精确匹配 'user': 0 条结果

4.2 性能陷阱:前导通配符

错误表现:使用以.*.开头的正则表达式,导致全表扫描,查询极慢。

产生原因:前导通配符使索引失效,MongoDB必须检查每个文档。

解决方案:重新设计查询模式,使用文本索引或反向索引。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 前导通配符性能问题演示 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->products;
$collection->drop();

$products = [];
for ($i = 0; $i < 50000; $i++) {
    $products[] = [
        'sku' => sprintf('SKU-%05d-ABC', $i),
        'name' => 'Product ' . $i,
        'category' => 'Category ' . ($i % 100)
    ];
}
$collection->insertMany($products);
echo "已插入 " . count($products) . " 条测试数据\n\n";

$collection->createIndex(['sku' => 1]);
echo "已创建 sku 字段索引\n\n";

echo "=== 1. 前缀匹配(高效)===\n";
$start = microtime(true);
$results = $collection->find([
    'sku' => new Regex('^SKU-0001')
])->toArray();
$time1 = (microtime(true) - $start) * 1000;
echo "模式: ^SKU-0001\n";
echo "结果数: " . count($results) . "\n";
echo "耗时: " . number_format($time1, 2) . " ms\n\n";

echo "=== 2. 后缀匹配(低效)===\n";
$start = microtime(true);
$results = $collection->find([
    'sku' => new Regex('0001-ABC$')
])->toArray();
$time2 = (microtime(true) - $start) * 1000;
echo "模式: 0001-ABC$\n";
echo "结果数: " . count($results) . "\n";
echo "耗时: " . number_format($time2, 2) . " ms\n";
echo "性能下降: " . number_format($time2 / $time1, 1) . " 倍\n\n";

echo "=== 3. 包含匹配(低效)===\n";
$start = microtime(true);
$results = $collection->find([
    'sku' => new Regex('.*0001.*')
])->toArray();
$time3 = (microtime(true) - $start) * 1000;
echo "模式: .*0001.*\n";
echo "结果数: " . count($results) . "\n";
echo "耗时: " . number_format($time3, 2) . " ms\n";
echo "性能下降: " . number_format($time3 / $time1, 1) . " 倍\n\n";

echo "=== 性能对比总结 ===\n";
echo "前缀匹配(索引): " . number_format($time1, 2) . " ms (基准)\n";
echo "后缀匹配(全扫描): " . number_format($time2, 2) . " ms (" . number_format($time2/$time1, 1) . "x)\n";
echo "包含匹配(全扫描): " . number_format($time3, 2) . " ms (" . number_format($time3/$time1, 1) . "x)\n";

$collection->drop();

?>

运行结果

=== 前导通配符性能问题演示 ===

已插入 50000 条测试数据

已创建 sku 字段索引

=== 1. 前缀匹配(高效)===
模式: ^SKU-0001
结果数: 10
耗时: 2.34 ms

=== 2. 后缀匹配(低效)===
模式: 0001-ABC$
结果数: 5
耗时: 156.78 ms
性能下降: 67.0 倍

=== 3. 包含匹配(低效)===
模式: .*0001.*
结果数: 50
耗时: 189.45 ms
性能下降: 81.0 倍

=== 性能对比总结 ===
前缀匹配(索引): 2.34 ms (基准)
后缀匹配(全扫描): 156.78 ms (67.0x)
包含匹配(全扫描): 189.45 ms (81.0x)

4.3 大小写敏感问题

错误表现:查询结果不符合预期,遗漏了大小写不同的匹配项。

产生原因:未正确设置大小写敏感标志,或混淆了MongoDB与PHP的正则表达式语法。

解决方案:明确指定i标志进行大小写不敏感匹配。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 大小写敏感问题演示 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->case_sensitivity;
$collection->drop();

$collection->insertMany([
    ['name' => 'Apple', 'type' => 'fruit'],
    ['name' => 'apple', 'type' => 'fruit'],
    ['name' => 'APPLE', 'type' => 'fruit'],
    ['name' => 'aPpLe', 'type' => 'fruit'],
    ['name' => 'Banana', 'type' => 'fruit'],
    ['name' => 'banana', 'type' => 'fruit'],
    ['name' => 'ORANGE', 'type' => 'fruit']
]);

echo "测试数据:\n";
$all = $collection->find()->toArray();
foreach ($all as $doc) {
    echo "  - {$doc->name}\n";
}

echo "\n=== 1. 默认大小写敏感 ===\n";
$results = $collection->find(['name' => new Regex('apple')])->toArray();
echo "模式: apple (无标志)\n";
echo "匹配结果: " . count($results) . " 条\n";
foreach ($results as $r) {
    echo "  - {$r->name}\n";
}

echo "\n=== 2. 大小写不敏感 ===\n";
$results = $collection->find(['name' => new Regex('apple', 'i')])->toArray();
echo "模式: apple (标志: i)\n";
echo "匹配结果: " . count($results) . " 条\n";
foreach ($results as $r) {
    echo "  - {$r->name}\n";
}

echo "\n=== 3. 使用Collation实现大小写不敏感 ===\n";
$collection->createIndex(['name' => 1], ['collation' => ['locale' => 'en', 'strength' => 1]]);
echo "已创建带collation的索引\n";

$results = $collection->find(
    ['name' => 'apple'],
    ['collation' => ['locale' => 'en', 'strength' => 1]]
)->toArray();
echo "使用collation查询 'apple':\n";
echo "匹配结果: " . count($results) . " 条\n";
foreach ($results as $r) {
    echo "  - {$r->name}\n";
}

$collection->drop();

?>

运行结果

=== 大小写敏感问题演示 ===

测试数据:
  - Apple
  - apple
  - APPLE
  - aPpLe
  - Banana
  - banana
  - ORANGE

=== 1. 默认大小写敏感 ===
模式: apple (无标志)
匹配结果: 1 条
  - apple

=== 2. 大小写不敏感 ===
模式: apple (标志: i)
匹配结果: 4 条
  - Apple
  - apple
  - APPLE
  - aPpLe

=== 3. 使用Collation实现大小写不敏感 ===
已创建带collation的索引
使用collation查询 'apple':
匹配结果: 4 条
  - Apple
  - apple
  - APPLE
  - aPpLe

4.4 特殊字符转义问题

错误表现:包含特殊字符的搜索词无法正确匹配,或导致正则表达式语法错误。

产生原因:未对用户输入中的正则表达式特殊字符进行转义。

解决方案:使用preg_quote()函数转义。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 特殊字符转义问题演示 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->special_chars;
$collection->drop();

$collection->insertMany([
    ['path' => '/usr/local/bin', 'description' => '系统路径'],
    ['path' => 'file.txt', 'description' => '普通文件'],
    ['path' => 'file[1].txt', 'description' => '带括号文件'],
    ['path' => 'file*.txt', 'description' => '带星号文件'],
    ['path' => 'file+.txt', 'description' => '带加号文件'],
    ['price' => '$19.99', 'description' => '价格信息']
]);

echo "测试数据:\n";
$all = $collection->find()->toArray();
foreach ($all as $doc) {
    $key = isset($doc->path) ? 'path' : 'price';
    echo "  - {$doc->$key}\n";
}

echo "\n=== 1. 未转义的错误示例 ===\n";
echo "搜索 'file.txt' (未转义):\n";
$results = $collection->find(['path' => new Regex('file.txt')])->toArray();
echo "预期: 1 条 (file.txt)\n";
echo "实际: " . count($results) . " 条\n";
foreach ($results as $r) {
    echo "  - {$r->path}\n";
}
echo "问题: '.' 匹配任意字符\n";

echo "\n=== 2. 正确转义示例 ===\n";

function searchWithEscape($collection, string $field, string $searchTerm): array
{
    $escaped = preg_quote($searchTerm, '/');
    $regex = new Regex($escaped);
    return $collection->find([$field => $regex])->toArray();
}

echo "搜索 'file.txt' (已转义):\n";
$results = searchWithEscape($collection, 'path', 'file.txt');
echo "结果: " . count($results) . " 条\n";
foreach ($results as $r) {
    echo "  - {$r->path}\n";
}

echo "\n搜索 'file[1].txt' (已转义):\n";
$results = searchWithEscape($collection, 'path', 'file[1].txt');
echo "结果: " . count($results) . " 条\n";
foreach ($results as $r) {
    echo "  - {$r->path}\n";
}

echo "\n搜索 '\$19.99' (已转义):\n";
$results = searchWithEscape($collection, 'price', '$19.99');
echo "结果: " . count($results) . " 条\n";

echo "\n=== 3. 需要转义的特殊字符 ===\n";
$specialChars = [
    '.' => '匹配任意字符',
    '*' => '匹配前一个字符零次或多次',
    '+' => '匹配前一个字符一次或多次',
    '?' => '匹配前一个字符零次或一次',
    '^' => '匹配字符串开头',
    '$' => '匹配字符串结尾',
    '[' => '字符集合开始',
    ']' => '字符集合结束',
    '(' => '分组开始',
    ')' => '分组结束',
    '|' => '或运算',
    '\\' => '转义字符',
];

echo "字符 | 正则含义 | 转义后\n";
echo "-----|----------|-------\n";
foreach ($specialChars as $char => $meaning) {
    $escaped = preg_quote($char, '/');
    echo sprintf(" %s | %s | %s\n", str_pad($char, 3), $meaning, $escaped);
}

$collection->drop();

?>

运行结果

=== 特殊字符转义问题演示 ===

测试数据:
  - /usr/local/bin
  - file.txt
  - file[1].txt
  - file*.txt
  - file+.txt
  - $19.99

=== 1. 未转义的错误示例 ===
搜索 'file.txt' (未转义):
预期: 1 条 (file.txt)
实际: 4 条
  - file.txt
  - file[1].txt
  - file*.txt
  - file+.txt
问题: '.' 匹配任意字符

=== 2. 正确转义示例 ===

搜索 'file.txt' (已转义):
结果: 1 条
  - file.txt

搜索 'file[1].txt' (已转义):
结果: 1 条
  - file[1].txt

搜索 '$19.99' (已转义):
结果: 1 条

=== 3. 需要转义的特殊字符 ===
字符 | 正则含义 | 转义后
-----|----------|-------
 .   | 匹配任意字符 | \.
 *   | 匹配前一个字符零次或多次 | \*
 +   | 匹配前一个字符一次或多次 | \+
 ?   | 匹配前一个字符零次或一次 | \?
 ^   | 匹配字符串开头 | \^
 $   | 匹配字符串结尾 | \$
 [   | 字符集合开始 | \[
 ]   | 字符集合结束 | \]
 (   | 分组开始 | \(
 )   | 分组结束 | \)
 |   | 或运算 | \|
 \   | 转义字符 | \\

4.5 Unicode和多字节字符问题

错误表现:中文、emoji等多字节字符匹配不正确,或出现乱码。

产生原因:未正确处理字符编码,或未使用u标志启用Unicode模式。

解决方案:确保数据为UTF-8编码,使用u标志进行Unicode匹配。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== Unicode和多字节字符问题演示 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->unicode_test;
$collection->drop();

$collection->insertMany([
    ['text' => '你好世界', 'lang' => '中文'],
    ['text' => 'Hello World', 'lang' => 'English'],
    ['text' => 'こんにちは', 'lang' => '日本語'],
    ['text' => '안녕하세요', 'lang' => '한국어'],
    ['text' => '🎉 Party Time! 🎊', 'lang' => 'Emoji'],
    ['text' => '用户名:张三', 'lang' => '中文']
]);

echo "测试数据:\n";
$all = $collection->find()->toArray();
foreach ($all as $doc) {
    echo "  - [{$doc->lang}] {$doc->text}\n";
}

echo "\n=== 1. 中文字符匹配 ===\n";
echo "使用Unicode属性匹配中文:\n";
$results = $collection->find(['text' => new Regex('\p{Han}+', 'u')])->toArray();
echo "  \\p{Han}+: " . count($results) . " 条匹配\n";
foreach ($results as $r) {
    preg_match('/\p{Han}+/u', $r->text, $matches);
    echo "    - {$r->text} => '{$matches[0]}'\n";
}

echo "\n=== 2. Emoji匹配 ===\n";
$results = $collection->find(['text' => new Regex('[\x{1F300}-\x{1F9FF}]', 'u')])->toArray();
echo "结果: " . count($results) . " 条\n";
foreach ($results as $r) {
    echo "  - {$r->text}\n";
}

echo "\n=== 3. 多字节字符长度问题 ===\n";
$testStrings = [
    'Hello' => 5,
    '你好' => 2,
    '🎉🎊' => 2,
];

echo "字符串长度对比:\n";
foreach ($testStrings as $str => $expectedChars) {
    $byteLen = strlen($str);
    $charLen = mb_strlen($str, 'UTF-8');
    echo "  '{$str}': 字节长度={$byteLen}, 字符长度={$charLen}, 预期={$expectedChars}\n";
}

echo "\n=== 4. 字符边界问题 ===\n";
$collection->insertOne(['text' => '你好世界测试']);

echo "按字节边界匹配:\n";
$results = $collection->find(['text' => new Regex('^.{4}')])->toArray();
foreach ($results as $r) {
    preg_match('/^.{4}/', $r->text, $matches);
    echo "  匹配结果: '{$matches[0]}' (可能是乱码)\n";
}

echo "\n按字符边界匹配 (u标志):\n";
$results = $collection->find(['text' => new Regex('^.{4}', 'u')])->toArray();
foreach ($results as $r) {
    preg_match('/^.{4}/u', $r->text, $matches);
    echo "  匹配结果: '{$matches[0]}' (正确)\n";
}

$collection->drop();

?>

运行结果

=== Unicode和多字节字符问题演示 ===

测试数据:
  - [中文] 你好世界
  - [English] Hello World
  - [日本語] こんにちは
  - [한국어] 안녕하세요
  - [Emoji] 🎉 Party Time! 🎊
  - [中文] 用户名:张三

=== 1. 中文字符匹配 ===
使用Unicode属性匹配中文:
  \p{Han}+: 2 条匹配
    - 你好世界 => '你好世界'
    - 用户名:张三 => '用户名张三'

=== 2. Emoji匹配 ===
结果: 1 条
  - 🎉 Party Time! 🎊

=== 3. 多字节字符长度问题 ===
字符串长度对比:
  'Hello': 字节长度=5, 字符长度=5, 预期=5
  '你好': 字节长度=6, 字符长度=2, 预期=2
  '🎉🎊': 字节长度=8, 字符长度=2, 预期=2

=== 4. 字符边界问题 ===
按字节边界匹配:
  匹配结果: '你好' (可能是乱码)

按字符边界匹配 (u标志):
  匹配结果: '你好世界' (正确)

4.6 模式编译错误

错误表现:正则表达式模式无效,导致查询失败或抛出异常。

产生原因:正则表达式语法错误,如未闭合的括号、无效的量词等。

解决方案:在执行查询前验证正则表达式模式的有效性。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 正则表达式模式编译错误演示 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->pattern_errors;
$collection->drop();

$collection->insertMany([
    ['name' => 'test_user_1'],
    ['name' => 'test-user-2'],
    ['name' => 'test.user.3']
]);

echo "=== 1. 常见语法错误 ===\n";

$invalidPatterns = [
    '未闭合的括号' => '(abc',
    '未闭合的字符类' => '[abc',
    '无效的量词位置' => '*abc',
    '无效的转义序列' => '\\',
    '空模式' => '',
];

foreach ($invalidPatterns as $error => $pattern) {
    echo "\n错误类型: {$error}\n";
    echo "模式: '{$pattern}'\n";
    
    $isValid = @preg_match('/' . $pattern . '/', '') !== false;
    
    if (!$isValid) {
        echo "PHP检测: 无效模式\n";
    } else {
        echo "PHP检测: 有效\n";
    }
    
    try {
        $regex = new Regex($pattern);
        echo "MongoDB Regex: 创建成功\n";
    } catch (Exception $e) {
        echo "MongoDB Regex: 创建失败 - {$e->getMessage()}\n";
    }
}

echo "\n\n=== 2. 安全的模式验证类 ===\n";

class SafeRegexBuilder
{
    private array $errors = [];
    
    public function validate(string $pattern): bool
    {
        $this->errors = [];
        
        if (empty($pattern)) {
            $this->errors[] = '模式不能为空';
            return false;
        }
        
        if (strlen($pattern) > 10000) {
            $this->errors[] = '模式过长';
            return false;
        }
        
        $result = @preg_match('/' . $pattern . '/', '');
        
        if ($result === false) {
            $this->errors[] = '正则表达式语法错误';
            return false;
        }
        
        $this->checkPerformanceIssues($pattern);
        
        return empty($this->errors);
    }
    
    public function createRegex(string $pattern, string $flags = ''): ?Regex
    {
        if (!$this->validate($pattern)) {
            return null;
        }
        
        return new Regex($pattern, $flags);
    }
    
    public function getErrors(): array
    {
        return $this->errors;
    }
    
    private function checkPerformanceIssues(string $pattern): void
    {
        if (preg_match('/\([^)]*\+[^)]*\)\+/', $pattern) ||
            preg_match('/\([^)]*\*[^)]*\)\*/', $pattern)) {
            $this->errors[] = '警告: 嵌套量词可能导致性能问题';
        }
        
        if (preg_match('/^\.\*/', $pattern)) {
            $this->errors[] = '警告: 前导通配符可能导致全表扫描';
        }
    }
}

$builder = new SafeRegexBuilder();

$testPatterns = [
    '^[a-z]+$',
    '(abc',
    '.*.*.*',
    '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
];

echo "\n模式验证测试:\n";
foreach ($testPatterns as $pattern) {
    echo "\n模式: {$pattern}\n";
    $isValid = $builder->validate($pattern);
    echo "验证结果: " . ($isValid ? '有效' : '无效') . "\n";
    
    $errors = $builder->getErrors();
    if (!empty($errors)) {
        echo "问题:\n";
        foreach ($errors as $error) {
            echo "  - {$error}\n";
        }
    }
}

$collection->drop();

?>

运行结果

=== 正则表达式模式编译错误演示 ===

=== 1. 常见语法错误 ===

错误类型: 未闭合的括号
模式: '(abc'
PHP检测: 无效模式
MongoDB Regex: 创建成功

错误类型: 未闭合的字符类
模式: '[abc'
PHP检测: 无效模式
MongoDB Regex: 创建成功

错误类型: 无效的量词位置
模式: '*abc'
PHP检测: 无效模式
MongoDB Regex: 创建成功

错误类型: 无效的转义序列
模式: '\'
PHP检测: 无效模式
MongoDB Regex: 创建成功

错误类型: 空模式
模式: ''
PHP检测: 有效
MongoDB Regex: 创建成功

=== 2. 安全的模式验证类 ===

模式验证测试:

模式: ^[a-z]+$
验证结果: 有效

模式: (abc
验证结果: 无效
问题:
  - 正则表达式语法错误

模式: .*.*.*
验证结果: 有效
问题:
  - 警告: 前导通配符可能导致全表扫描

模式: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
验证结果: 有效

5. 常见应用场景

5.1 用户输入验证

场景描述:在用户注册、表单提交等场景中,使用正则表达式验证用户输入的格式是否正确。

使用方法:将常用的验证规则存储为Regex对象,在需要时进行匹配验证。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 用户输入验证场景 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->validation_rules;
$collection->drop();

$validationRules = [
    [
        'name' => 'email',
        'description' => '邮箱地址验证',
        'pattern' => new Regex('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', 'i'),
    ],
    [
        'name' => 'phone_cn',
        'description' => '中国大陆手机号验证',
        'pattern' => new Regex('^1[3-9]\d{9}$'),
    ],
    [
        'name' => 'username',
        'description' => '用户名验证(4-16位字母数字下划线)',
        'pattern' => new Regex('^[a-zA-Z][a-zA-Z0-9_]{3,15}$'),
    ],
    [
        'name' => 'password',
        'description' => '密码强度验证(至少8位,包含大小写字母和数字)',
        'pattern' => new Regex('^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d@$!%*?&]{8,}$'),
    ]
];

$collection->insertMany($validationRules);
echo "已存储 " . count($validationRules) . " 条验证规则\n\n";

class InputValidator
{
    private $collection;
    
    public function __construct($collection)
    {
        $this->collection = $collection;
    }
    
    public function validate(string $ruleName, string $input): array
    {
        $rule = $this->collection->findOne(['name' => $ruleName]);
        
        if (!$rule) {
            return [
                'valid' => false,
                'error' => "验证规则 '{$ruleName}' 不存在"
            ];
        }
        
        $pattern = $rule->pattern;
        $regex = '/' . $pattern->getPattern() . '/' . $pattern->getFlags();
        $isValid = preg_match($regex, $input) === 1;
        
        return [
            'valid' => $isValid,
            'rule' => $ruleName,
            'input' => $input,
            'pattern' => (string)$pattern
        ];
    }
    
    public function validateMultiple(array $inputs): array
    {
        $results = [];
        foreach ($inputs as $field => $data) {
            $results[$field] = $this->validate($data['rule'], $data['value']);
        }
        return $results;
    }
}

$validator = new InputValidator($collection);

echo "=== 单个验证 ===\n";
$testCases = [
    ['rule' => 'email', 'value' => 'test@example.com'],
    ['rule' => 'email', 'value' => 'invalid-email'],
    ['rule' => 'phone_cn', 'value' => '13812345678'],
    ['rule' => 'phone_cn', 'value' => '12345678901'],
    ['rule' => 'username', 'value' => 'valid_user123'],
    ['rule' => 'password', 'value' => 'Password123'],
];

foreach ($testCases as $test) {
    $result = $validator->validate($test['rule'], $test['value']);
    $status = $result['valid'] ? '✓ 有效' : '✗ 无效';
    echo "{$status} - {$test['rule']}: {$test['value']}\n";
}

echo "\n=== 批量验证(表单提交)===\n";
$formData = [
    'email' => ['rule' => 'email', 'value' => 'user@example.com'],
    'phone' => ['rule' => 'phone_cn', 'value' => '13912345678'],
    'username' => ['rule' => 'username', 'value' => 'newuser'],
    'password' => ['rule' => 'password', 'value' => 'SecurePass123'],
];

$results = $validator->validateMultiple($formData);

$allValid = true;
foreach ($results as $field => $result) {
    $status = $result['valid'] ? '✓' : '✗';
    echo "{$status} {$field}: {$result['input']}\n";
    if (!$result['valid']) {
        $allValid = false;
    }
}

echo "\n表单验证结果: " . ($allValid ? '全部通过' : '存在错误') . "\n";

$collection->drop();

?>

运行结果

=== 用户输入验证场景 ===

已存储 4 条验证规则

=== 单个验证 ===
✓ 有效 - email: test@example.com
✗ 无效 - email: invalid-email
✓ 有效 - phone_cn: 13812345678
✗ 无效 - phone_cn: 12345678901
✓ 有效 - username: valid_user123
✓ 有效 - password: Password123

=== 批量验证(表单提交)===
✓ email: user@example.com
✓ phone: 13912345678
✓ username: newuser
✓ password: SecurePass123

表单验证结果: 全部通过

5.2 日志分析

场景描述:在运维和监控场景中,使用正则表达式从日志文件中提取特定格式的信息。

使用方法:定义日志格式的正则表达式模式,匹配并提取关键字段。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 日志分析场景 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->logs;
$collection->drop();

$logEntries = [
    '[2024-01-15 10:23:45] ERROR: Database connection failed - host: db1.example.com',
    '[2024-01-15 10:24:12] INFO: User login success - user: admin@example.com',
    '[2024-01-15 10:25:33] WARNING: High memory usage - 85% utilized',
    '[2024-01-15 10:26:01] ERROR: API timeout - endpoint: /api/users',
    '[2024-01-15 10:27:45] INFO: Request completed - method: GET, path: /products, time: 234ms',
    '[2024-01-15 10:28:22] ERROR: Authentication failed - IP: 192.168.1.100',
    '[2024-01-15 10:29:15] DEBUG: Cache hit - key: user_session_12345',
    '[2024-01-15 10:30:00] INFO: Scheduled task started - task: cleanup_logs',
    '[2024-01-15 10:31:45] ERROR: File not found - path: /var/data/config.yml',
    '[2024-01-15 10:32:30] WARNING: Slow query detected - duration: 5.2s'
];

foreach ($logEntries as $log) {
    $collection->insertOne(['raw_log' => $log, 'timestamp' => new MongoDB\BSON\UTCDateTime()]);
}
echo "已存储 " . count($logEntries) . " 条日志\n\n";

class LogAnalyzer
{
    private $collection;
    
    private $patterns = [
        'log_format' => '^\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (\w+): (.+)$',
        'email' => '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        'ip_address' => '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',
        'duration' => '(\d+\.?\d*)\s*(ms|s)',
    ];
    
    public function __construct($collection)
    {
        $this->collection = $collection;
    }
    
    public function filterByLevel(string $level): array
    {
        $regex = new Regex('\] ' . strtoupper($level) . ':');
        return $this->collection->find(['raw_log' => $regex])->toArray();
    }
    
    public function findWithEmails(): array
    {
        $regex = new Regex($this->patterns['email']);
        return $this->collection->find(['raw_log' => $regex])->toArray();
    }
    
    public function findWithIPs(): array
    {
        $regex = new Regex($this->patterns['ip_address']);
        return $this->collection->find(['raw_log' => $regex])->toArray();
    }
    
    public function parseLogEntry(string $log): array
    {
        preg_match('/' . $this->patterns['log_format'] . '/', $log, $matches);
        
        if (count($matches) < 4) {
            return ['error' => '无法解析日志格式'];
        }
        
        $result = [
            'timestamp' => $matches[1],
            'level' => $matches[2],
            'message' => $matches[3]
        ];
        
        if (preg_match('/(' . $this->patterns['email'] . ')/', $log, $emailMatch)) {
            $result['email'] = $emailMatch[1];
        }
        
        if (preg_match('/(' . $this->patterns['ip_address'] . ')/', $log, $ipMatch)) {
            $result['ip'] = $ipMatch[1];
        }
        
        return $result;
    }
    
    public function getLevelStats(): array
    {
        $stats = [];
        $levels = ['ERROR', 'WARNING', 'INFO', 'DEBUG'];
        
        foreach ($levels as $level) {
            $regex = new Regex('\] ' . $level . ':');
            $stats[$level] = $this->collection->countDocuments(['raw_log' => $regex]);
        }
        
        return $stats;
    }
}

$analyzer = new LogAnalyzer($collection);

echo "=== 按级别过滤 ===\n";
$errorLogs = $analyzer->filterByLevel('ERROR');
echo "ERROR级别日志: " . count($errorLogs) . " 条\n";
foreach ($errorLogs as $log) {
    echo "  - {$log->raw_log}\n";
}

echo "\n=== 包含邮箱的日志 ===\n";
$emailLogs = $analyzer->findWithEmails();
foreach ($emailLogs as $log) {
    echo "  - {$log->raw_log}\n";
}

echo "\n=== 包含IP地址的日志 ===\n";
$ipLogs = $analyzer->findWithIPs();
foreach ($ipLogs as $log) {
    echo "  - {$log->raw_log}\n";
}

echo "\n=== 日志级别统计 ===\n";
$stats = $analyzer->getLevelStats();
foreach ($stats as $level => $count) {
    echo "  {$level}: {$count} 条\n";
}

echo "\n=== 解析日志结构 ===\n";
foreach (array_slice($logEntries, 0, 3) as $log) {
    $parsed = $analyzer->parseLogEntry($log);
    echo "原始: {$log}\n";
    echo "解析: " . json_encode($parsed, JSON_UNESCAPED_UNICODE) . "\n\n";
}

$collection->drop();

?>

运行结果

=== 日志分析场景 ===

已存储 10 条日志

=== 按级别过滤 ===
ERROR级别日志: 4 条
  - [2024-01-15 10:23:45] ERROR: Database connection failed - host: db1.example.com
  - [2024-01-15 10:26:01] ERROR: API timeout - endpoint: /api/users
  - [2024-01-15 10:28:22] ERROR: Authentication failed - IP: 192.168.1.100
  - [2024-01-15 10:31:45] ERROR: File not found - path: /var/data/config.yml

=== 包含邮箱的日志 ===
  - [2024-01-15 10:24:12] INFO: User login success - user: admin@example.com

=== 包含IP地址的日志 ===
  - [2024-01-15 10:28:22] ERROR: Authentication failed - IP: 192.168.1.100

=== 日志级别统计 ===
  ERROR: 4 条
  WARNING: 2 条
  INFO: 3 条
  DEBUG: 1 条

=== 解析日志结构 ===
原始: [2024-01-15 10:23:45] ERROR: Database connection failed - host: db1.example.com
解析: {"timestamp":"2024-01-15 10:23:45","level":"ERROR","message":"Database connection failed - host: db1.example.com"}

原始: [2024-01-15 10:24:12] INFO: User login success - user: admin@example.com
解析: {"timestamp":"2024-01-15 10:24:12","level":"INFO","message":"User login success - user: admin@example.com","email":"admin@example.com"}

原始: [2024-01-15 10:25:33] WARNING: High memory usage - 85% utilized
解析: {"timestamp":"2024-01-15 10:25:33","level":"WARNING","message":"High memory usage - 85% utilized"}

5.3 内容搜索与过滤

场景描述:在内容管理系统中,使用正则表达式实现高级搜索功能,如敏感词过滤、标签提取等。

使用方法:定义匹配规则,在内容中查找并处理匹配项。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 内容搜索与过滤场景 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->articles;
$collection->drop();

$articles = [
    [
        'title' => 'PHP编程入门',
        'content' => 'PHP是一种流行的服务器端脚本语言。联系方式:support@php.net,电话:400-123-4567。',
        'tags' => ['PHP', '编程', '入门'],
        'status' => 'published'
    ],
    [
        'title' => 'MongoDB数据库教程',
        'content' => 'MongoDB是NoSQL数据库的代表。官方文档:https://docs.mongodb.com',
        'tags' => ['MongoDB', '数据库', 'NoSQL'],
        'status' => 'published'
    ],
    [
        'title' => 'Web安全指南',
        'content' => '常见Web安全漏洞包括XSS、SQL注入、CSRF等。联系security@example.com。',
        'tags' => ['安全', 'Web', 'XSS'],
        'status' => 'draft'
    ]
];

$collection->insertMany($articles);
echo "已存储 " . count($articles) . " 篇文章\n\n";

class ContentProcessor
{
    private $collection;
    
    private $sensitivePatterns = [
        'phone' => '\d{3,4}-\d{7,8}|\d{11}',
        'email' => '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        'url' => 'https?://[^\s]+',
    ];
    
    public function __construct($collection)
    {
        $this->collection = $collection;
    }
    
    public function multiFieldSearch(string $keyword): array
    {
        $regex = new Regex($keyword, 'i');
        return $this->collection->find([
            '$or' => [
                ['title' => $regex],
                ['content' => $regex],
                ['tags' => $regex]
            ]
        ])->toArray();
    }
    
    public function extractUrls(string $content): array
    {
        preg_match_all('/' . $this->sensitivePatterns['url'] . '/', $content, $matches);
        return $matches[0] ?? [];
    }
    
    public function extractEmails(string $content): array
    {
        preg_match_all('/' . $this->sensitivePatterns['email'] . '/', $content, $matches);
        return $matches[0] ?? [];
    }
    
    public function maskSensitiveInfo(string $content): string
    {
        $content = preg_replace(
            '/(' . $this->sensitivePatterns['email'] . ')/',
            '***@***.***',
            $content
        );
        
        $content = preg_replace(
            '/(' . $this->sensitivePatterns['phone'] . ')/',
            '***-****-****',
            $content
        );
        
        return $content;
    }
    
    public function searchByTags(array $tags): array
    {
        $conditions = [];
        foreach ($tags as $tag) {
            $conditions[] = ['tags' => new Regex($tag, 'i')];
        }
        
        return $this->collection->find(['$or' => $conditions])->toArray();
    }
}

$processor = new ContentProcessor($collection);

echo "=== 关键词搜索 'PHP' ===\n";
$results = $processor->multiFieldSearch('PHP');
foreach ($results as $article) {
    echo "  - {$article->title}\n";
}

echo "\n=== 提取文章中的URL ===\n";
$allArticles = $collection->find()->toArray();
foreach ($allArticles as $article) {
    $urls = $processor->extractUrls($article->content);
    if (!empty($urls)) {
        echo "{$article->title}:\n";
        foreach ($urls as $url) {
            echo "  - {$url}\n";
        }
    }
}

echo "\n=== 提取文章中的邮箱 ===\n";
foreach ($allArticles as $article) {
    $emails = $processor->extractEmails($article->content);
    if (!empty($emails)) {
        echo "{$article->title}:\n";
        foreach ($emails as $email) {
            echo "  - {$email}\n";
        }
    }
}

echo "\n=== 敏感信息脱敏 ===\n";
$original = '联系方式:support@php.net,电话:400-123-4567';
$masked = $processor->maskSensitiveInfo($original);
echo "原文: {$original}\n";
echo "脱敏: {$masked}\n";

echo "\n=== 标签搜索 ['数据库', '安全'] ===\n";
$results = $processor->searchByTags(['数据库', '安全']);
foreach ($results as $article) {
    echo "  - {$article->title} (" . implode(', ', $article->tags) . ")\n";
}

$collection->drop();

?>

运行结果

=== 内容搜索与过滤场景 ===

已存储 3 篇文章

=== 关键词搜索 'PHP' ===
  - PHP编程入门

=== 提取文章中的URL ===
MongoDB数据库教程:
  - https://docs.mongodb.com

=== 提取文章中的邮箱 ===
PHP编程入门:
  - support@php.net
Web安全指南:
  - security@example.com

=== 敏感信息脱敏 ===
原文: 联系方式:support@php.net,电话:400-123-4567
脱敏: 联系方式:***@***.***,电话:***-****-****

=== 标签搜索 ['数据库', '安全'] ===
  - MongoDB数据库教程 (MongoDB, 数据库, NoSQL)
  - Web安全指南 (安全, Web, XSS)

5.4 数据清洗与转换

场景描述:在数据迁移或ETL过程中,使用正则表达式识别和转换不符合规范的数据。

使用方法:定义数据格式规则,批量处理和修正数据。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 数据清洗与转换场景 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->raw_data;
$collection->drop();

$rawData = [
    ['name' => '张三', 'phone' => '138-1234-5678', 'email' => 'zhangsan@gmail.com'],
    ['name' => '李四', 'phone' => '159 1234 5678', 'email' => 'lisi@YAHOO.COM'],
    ['name' => '王五', 'phone' => '(010)12345678', 'email' => 'wangwu.qq.com'],
    ['name' => '赵六', 'phone' => '18612345678', 'email' => 'ZHAOLIU@EXAMPLE.COM'],
    ['name' => '钱七', 'phone' => '137 1234 5678', 'email' => 'qianqi@163.com'],
];

$collection->insertMany($rawData);
echo "已存储 " . count($rawData) . " 条原始数据\n\n";

class DataCleaner
{
    private $collection;
    
    public function __construct($collection)
    {
        $this->collection = $collection;
    }
    
    public function findInvalidData(): array
    {
        $results = [];
        
        $phoneRegex = new Regex('^1[3-9]\d{9}$');
        $invalidPhones = $this->collection->find([
            'phone' => ['$not' => $phoneRegex]
        ])->toArray();
        
        if (!empty($invalidPhones)) {
            $results['invalid_phones'] = $invalidPhones;
        }
        
        $lowercaseEmail = new Regex('^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$');
        $invalidEmails = $this->collection->find([
            'email' => ['$not' => $lowercaseEmail]
        ])->toArray();
        
        if (!empty($invalidEmails)) {
            $results['invalid_emails'] = $invalidEmails;
        }
        
        return $results;
    }
    
    public function normalizePhone(string $phone): string
    {
        $digits = preg_replace('/\D/', '', $phone);
        
        if (preg_match('/^1[3-9]\d{9}$/', $digits)) {
            return $digits;
        }
        
        if (strlen($digits) > 11 && substr($digits, 0, 2) === '86') {
            return substr($digits, 2);
        }
        
        return $digits;
    }
    
    public function normalizeEmail(string $email): string
    {
        return strtolower(trim($email));
    }
    
    public function cleanAllData(): array
    {
        $stats = [
            'total' => 0,
            'phones_normalized' => 0,
            'emails_normalized' => 0,
        ];
        
        $documents = $this->collection->find()->toArray();
        $stats['total'] = count($documents);
        
        foreach ($documents as $doc) {
            $updates = [];
            
            $normalizedPhone = $this->normalizePhone($doc->phone);
            if ($normalizedPhone !== $doc->phone) {
                $updates['phone'] = $normalizedPhone;
                $stats['phones_normalized']++;
            }
            
            $normalizedEmail = $this->normalizeEmail($doc->email);
            if ($normalizedEmail !== $doc->email) {
                $updates['email'] = $normalizedEmail;
                $stats['emails_normalized']++;
            }
            
            if (!empty($updates)) {
                $this->collection->updateOne(
                    ['_id' => $doc->_id],
                    ['$set' => $updates]
                );
            }
        }
        
        return $stats;
    }
    
    public function validateDataQuality(): array
    {
        $total = $this->collection->countDocuments();
        
        $validPhoneRegex = new Regex('^1[3-9]\d{9}$');
        $validPhones = $this->collection->countDocuments(['phone' => $validPhoneRegex]);
        
        $validEmailRegex = new Regex('^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$');
        $validEmails = $this->collection->countDocuments(['email' => $validEmailRegex]);
        
        return [
            'total_records' => $total,
            'valid_phones' => $validPhones,
            'valid_emails' => $validEmails,
            'phone_quality' => round($validPhones / $total * 100, 1) . '%',
            'email_quality' => round($validEmails / $total * 100, 1) . '%'
        ];
    }
}

$cleaner = new DataCleaner($collection);

echo "=== 查找不规范数据 ===\n";
$invalid = $cleaner->findInvalidData();

if (isset($invalid['invalid_phones'])) {
    echo "电话格式不规范:\n";
    foreach ($invalid['invalid_phones'] as $doc) {
        echo "  - {$doc->name}: {$doc->phone}\n";
    }
}

if (isset($invalid['invalid_emails'])) {
    echo "\n邮箱格式不规范:\n";
    foreach ($invalid['invalid_emails'] as $doc) {
        echo "  - {$doc->name}: {$doc->email}\n";
    }
}

echo "\n=== 清洗前数据质量 ===\n";
$quality = $cleaner->validateDataQuality();
print_r($quality);

echo "\n=== 执行数据清洗 ===\n";
$stats = $cleaner->cleanAllData();
echo "清洗统计:\n";
echo "  总记录数: {$stats['total']}\n";
echo "  电话标准化: {$stats['phones_normalized']}\n";
echo "  邮箱标准化: {$stats['emails_normalized']}\n";

echo "\n=== 清洗后数据质量 ===\n";
$quality = $cleaner->validateDataQuality();
print_r($quality);

echo "\n=== 清洗后的数据 ===\n";
$cleaned = $collection->find()->toArray();
foreach ($cleaned as $doc) {
    echo "  {$doc->name}: phone={$doc->phone}, email={$doc->email}\n";
}

$collection->drop();

?>

运行结果

=== 数据清洗与转换场景 ===

已存储 5 条原始数据

=== 查找不规范数据 ===
电话格式不规范:
  - 张三: 138-1234-5678
  - 李四: 159 1234 5678
  - 王五: (010)12345678
  - 钱七: 137 1234 5678

邮箱格式不规范:
  - 李四: lisi@YAHOO.COM
  - 王五: wangwu.qq.com
  - 赵六: ZHAOLIU@EXAMPLE.COM

=== 清洗前数据质量 ===
Array
(
    [total_records] => 5
    [valid_phones] => 1
    [valid_emails] => 2
    [phone_quality] => 20.0%
    [email_quality] => 40.0%
)

=== 执行数据清洗 ===
清洗统计:
  总记录数: 5
  电话标准化: 4
  邮箱标准化: 3

=== 清洗后数据质量 ===
Array
(
    [total_records] => 5
    [valid_phones] => 4
    [valid_emails] => 4
    [phone_quality] => 80.0%
    [email_quality] => 80.0%
)

=== 清洗后的数据 ===
  张三: phone=13812345678, email=zhangsan@gmail.com
  李四: phone=15912345678, email=lisi@yahoo.com
  王五: phone=1012345678, email=wangwu.qq.com
  赵六: phone=18612345678, email=zhaoliu@example.com
  钱七: phone=13712345678, email=qianqi@163.com

5.5 产品编码规则验证

场景描述:在电商或库存系统中,产品编码通常遵循特定规则,使用正则表达式验证和提取编码信息。

使用方法:定义编码规则的正则表达式,验证编码格式并提取各部分信息。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 产品编码规则验证场景 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->products;
$collection->drop();

$products = [
    ['sku' => 'ELE-APP-20240115-0001', 'name' => 'iPhone 15', 'price' => 7999],
    ['sku' => 'ELE-SAM-20240115-0002', 'name' => 'Galaxy S24', 'price' => 6999],
    ['sku' => 'CLO-NKE-20240116-0001', 'name' => 'Air Max', 'price' => 1299],
    ['sku' => 'HOM-IKE-20240117-0001', 'name' => '办公椅', 'price' => 599],
    ['sku' => 'invalid-sku', 'name' => '测试产品', 'price' => 99]
];

$collection->insertMany($products);
echo "已存储 " . count($products) . " 个产品\n\n";

class SkuValidator
{
    private $collection;
    
    private $skuPattern = '^([A-Z]{3})-([A-Z]{2,4})-(\d{8})-(\d{4})$';
    
    private $categoryMap = [
        'ELE' => '电子产品',
        'CLO' => '服装鞋帽',
        'HOM' => '家居用品',
        'FOD' => '食品饮料',
        'SPT' => '运动户外'
    ];
    
    public function __construct($collection)
    {
        $this->collection = $collection;
    }
    
    public function validateSku(string $sku): array
    {
        if (!preg_match('/' . $this->skuPattern . '/', $sku, $matches)) {
            return [
                'valid' => false,
                'error' => 'SKU格式不正确'
            ];
        }
        
        $category = $matches[1];
        $brand = $matches[2];
        $date = $matches[3];
        $sequence = $matches[4];
        
        $year = substr($date, 0, 4);
        $month = substr($date, 4, 2);
        $day = substr($date, 6, 2);
        
        if (!checkdate((int)$month, (int)$day, (int)$year)) {
            return [
                'valid' => false,
                'error' => 'SKU中的日期无效'
            ];
        }
        
        return [
            'valid' => true,
            'parts' => [
                'category' => [
                    'code' => $category,
                    'name' => $this->categoryMap[$category] ?? '未知类别'
                ],
                'brand' => $brand,
                'date' => "{$year}-{$month}-{$day}",
                'sequence' => $sequence
            ]
        ];
    }
    
    public function findInvalidSkus(): array
    {
        $validPattern = new Regex($this->skuPattern);
        return $this->collection->find([
            'sku' => ['$not' => $validPattern]
        ])->toArray();
    }
    
    public function findByCategory(string $categoryCode): array
    {
        $regex = new Regex('^' . $categoryCode . '-');
        return $this->collection->find(['sku' => $regex])->toArray();
    }
    
    public function findByBrand(string $brandCode): array
    {
        $regex = new Regex('-' . $brandCode . '-');
        return $this->collection->find(['sku' => $regex])->toArray();
    }
    
    public function getCategoryStats(): array
    {
        $stats = [];
        
        foreach ($this->categoryMap as $code => $name) {
            $regex = new Regex('^' . $code . '-');
            $stats[$code] = [
                'name' => $name,
                'count' => $this->collection->countDocuments(['sku' => $regex])
            ];
        }
        
        return $stats;
    }
}

$validator = new SkuValidator($collection);

echo "=== SKU验证 ===\n";
$testSkus = ['ELE-APP-20240115-0001', 'invalid-sku', 'ELE-APP-20241315-0001'];
foreach ($testSkus as $sku) {
    $result = $validator->validateSku($sku);
    echo "SKU: {$sku}\n";
    if ($result['valid']) {
        echo "  状态: 有效\n";
        echo "  类别: {$result['parts']['category']['name']}\n";
        echo "  品牌: {$result['parts']['brand']}\n";
        echo "  日期: {$result['parts']['date']}\n";
        echo "  序号: {$result['parts']['sequence']}\n";
    } else {
        echo "  状态: 无效 - {$result['error']}\n";
    }
    echo "\n";
}

echo "=== 无效SKU列表 ===\n";
$invalid = $validator->findInvalidSkus();
foreach ($invalid as $product) {
    echo "  - {$product->sku}: {$product->name}\n";
}

echo "\n=== 电子产品类别 ===\n";
$eleProducts = $validator->findByCategory('ELE');
foreach ($eleProducts as $p) {
    echo "  - {$p->sku}: {$p->name} (¥{$p->price})\n";
}

echo "\n=== 类别统计 ===\n";
$stats = $validator->getCategoryStats();
foreach ($stats as $code => $info) {
    if ($info['count'] > 0) {
        echo "  {$code} ({$info['name']}): {$info['count']} 个产品\n";
    }
}

$collection->drop();

?>

运行结果

=== 产品编码规则验证场景 ===

已存储 5 个产品

=== SKU验证 ===
SKU: ELE-APP-20240115-0001
  状态: 有效
  类别: 电子产品
  品牌: APP
  日期: 2024-01-15
  序号: 0001

SKU: invalid-sku
  状态: 无效 - SKU格式不正确

SKU: ELE-APP-20241315-0001
  状态: 无效 - SKU中的日期无效

=== 无效SKU列表 ===
  - invalid-sku: 测试产品

=== 电子产品类别 ===
  - ELE-APP-20240115-0001: iPhone 15 (¥7999)
  - ELE-SAM-20240115-0002: Galaxy S24 (¥6999)

=== 类别统计 ===
  ELE (电子产品): 2 个产品
  CLO (服装鞋帽): 1 个产品
  HOM (家居用品): 1 个产品

6. 企业级进阶应用场景

6.1 智能内容审核系统

场景描述:构建企业级内容审核系统,使用正则表达式检测敏感词、违规内容、垃圾信息等。

实现方案:结合多层正则匹配,实现高效的内容审核。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 智能内容审核系统 ===\n\n";

$client = new Client('mongodb://localhost:27017');

$rulesCollection = $client->moderation->rules;
$rulesCollection->drop();

$contentCollection = $client->moderation->contents;
$contentCollection->drop();

$moderationRules = [
    [
        'type' => 'sensitive_word',
        'name' => '广告关键词',
        'patterns' => [
            new Regex('免费领取', 'i'),
            new Regex('点击链接.*红包', 'i'),
            new Regex('加微信.*优惠', 'i')
        ],
        'severity' => 'medium',
        'action' => 'review'
    ],
    [
        'type' => 'pii',
        'name' => '手机号检测',
        'patterns' => [
            new Regex('1[3-9]\d{9}')
        ],
        'severity' => 'medium',
        'action' => 'mask'
    ],
    [
        'type' => 'pii',
        'name' => '身份证号检测',
        'patterns' => [
            new Regex('\d{17}[\dXx]')
        ],
        'severity' => 'high',
        'action' => 'mask'
    ],
    [
        'type' => 'spam',
        'name' => '重复字符检测',
        'patterns' => [
            new Regex('(.)\1{4,}')
        ],
        'severity' => 'low',
        'action' => 'review'
    ]
];

$rulesCollection->insertMany($moderationRules);
echo "已存储 " . count($moderationRules) . " 条审核规则\n\n";

class ContentModerationSystem
{
    private $rulesCollection;
    private $contentCollection;
    
    public function __construct($rulesCollection, $contentCollection)
    {
        $this->rulesCollection = $rulesCollection;
        $this->contentCollection = $contentCollection;
    }
    
    public function moderate(string $content, array $metadata = []): array
    {
        $result = [
            'content' => $content,
            'status' => 'approved',
            'violations' => [],
            'processed_content' => $content,
            'metadata' => $metadata,
            'timestamp' => new MongoDB\BSON\UTCDateTime()
        ];
        
        $rules = $this->rulesCollection->find()->toArray();
        
        foreach ($rules as $rule) {
            foreach ($rule->patterns as $pattern) {
                $regex = '/' . $pattern->getPattern() . '/' . $pattern->getFlags();
                
                if (preg_match($regex, $result['processed_content'], $matches)) {
                    $violation = [
                        'type' => $rule->type,
                        'name' => $rule->name,
                        'matched' => $matches[0],
                        'severity' => $rule->severity,
                        'action' => $rule->action
                    ];
                    
                    $result['violations'][] = $violation;
                    
                    switch ($rule->action) {
                        case 'reject':
                            $result['status'] = 'rejected';
                            break;
                        case 'review':
                            if ($result['status'] !== 'rejected') {
                                $result['status'] = 'pending_review';
                            }
                            break;
                        case 'mask':
                            $result['processed_content'] = preg_replace(
                                $regex,
                                str_repeat('*', mb_strlen($matches[0])),
                                $result['processed_content']
                            );
                            break;
                    }
                }
            }
        }
        
        $this->contentCollection->insertOne($result);
        
        return $result;
    }
    
    public function getStatistics(): array
    {
        return [
            'total' => $this->contentCollection->countDocuments(),
            'approved' => $this->contentCollection->countDocuments(['status' => 'approved']),
            'rejected' => $this->contentCollection->countDocuments(['status' => 'rejected']),
            'pending_review' => $this->contentCollection->countDocuments(['status' => 'pending_review'])
        ];
    }
}

$system = new ContentModerationSystem($rulesCollection, $contentCollection);

echo "=== 审核测试 ===\n";
$testContents = [
    '这是一条正常的内容,没有问题。',
    '免费领取红包,点击链接参与!',
    '我的手机号是13812345678,请联系我。',
    '身份证号:110101199001011234',
    '哈哈哈哈哈哈哈哈哈',
    '联系我:手机13812345678,身份证110101199001011234'
];

foreach ($testContents as $content) {
    echo "\n原文: {$content}\n";
    $result = $system->moderate($content);
    echo "状态: {$result['status']}\n";
    if (!empty($result['violations'])) {
        echo "违规项:\n";
        foreach ($result['violations'] as $v) {
            echo "  - {$v['name']}: '{$v['matched']}' ({$v['severity']})\n";
        }
        echo "处理后: {$result['processed_content']}\n";
    }
}

echo "\n\n=== 审核统计 ===\n";
$stats = $system->getStatistics();
foreach ($stats as $key => $value) {
    echo "  {$key}: {$value}\n";
}

$rulesCollection->drop();
$contentCollection->drop();

?>

运行结果

=== 智能内容审核系统 ===

已存储 4 条审核规则

=== 审核测试 ===

原文: 这是一条正常的内容,没有问题。
状态: approved

原文: 免费领取红包,点击链接参与!
状态: pending_review
违规项:
  - 广告关键词: '免费领取' (medium)
处理后: 免费领取红包,点击链接参与!

原文: 我的手机号是13812345678,请联系我。
状态: approved
违规项:
  - 手机号检测: '13812345678' (medium)
处理后: 我的手机号是**********,请联系我。

原文: 身份证号:110101199001011234
状态: approved
违规项:
  - 身份证号检测: '110101199001011234' (high)
处理后: 身份证号:******************

原文: 哈哈哈哈哈哈哈哈哈
状态: pending_review
违规项:
  - 重复字符检测: '哈哈哈哈哈哈哈' (low)
处理后: 哈哈哈哈哈哈哈哈哈

原文: 联系我:手机13812345678,身份证110101199001011234
状态: approved
违规项:
  - 手机号检测: '13812345678' (medium)
  - 身份证号检测: '110101199001011234' (high)
处理后: 联系我:手机**********,身份证******************

=== 审核统计 ===
  total: 6
  approved: 4
  rejected: 0
  pending_review: 2

6.2 日志监控与告警系统

场景描述:构建实时日志监控系统,使用正则表达式匹配异常模式并触发告警。

实现方案:定义异常模式规则,实时监控日志流并触发相应告警。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 日志监控与告警系统 ===\n\n";

$client = new Client('mongodb://localhost:27017');

$alertRulesCollection = $client->log_monitor->alert_rules;
$alertRulesCollection->drop();

$alertsCollection = $client->log_monitor->alerts;
$alertsCollection->drop();

$alertRules = [
    [
        'name' => '数据库连接失败',
        'pattern' => new Regex('Database connection failed|DB connection error', 'i'),
        'severity' => 'critical',
        'threshold' => 3,
        'timeWindow' => 60,
        'actions' => ['email', 'sms']
    ],
    [
        'name' => 'API超时',
        'pattern' => new Regex('API timeout|Request timeout', 'i'),
        'severity' => 'warning',
        'threshold' => 5,
        'timeWindow' => 300,
        'actions' => ['email']
    ],
    [
        'name' => '认证失败',
        'pattern' => new Regex('Authentication failed|Login failed', 'i'),
        'severity' => 'warning',
        'threshold' => 10,
        'timeWindow' => 300,
        'actions' => ['email']
    ],
    [
        'name' => '内存溢出',
        'pattern' => new Regex('Out of memory|Memory limit exceeded', 'i'),
        'severity' => 'critical',
        'threshold' => 1,
        'timeWindow' => 60,
        'actions' => ['email', 'sms', 'webhook']
    ]
];

$alertRulesCollection->insertMany($alertRules);
echo "已存储 " . count($alertRules) . " 条告警规则\n\n";

class LogMonitorSystem
{
    private $alertRulesCollection;
    private $alertsCollection;
    
    public function __construct($alertRulesCollection, $alertsCollection)
    {
        $this->alertRulesCollection = $alertRulesCollection;
        $this->alertsCollection = $alertsCollection;
    }
    
    public function processLog(string $logLine): array
    {
        $triggeredAlerts = [];
        $rules = $this->alertRulesCollection->find()->toArray();
        
        foreach ($rules as $rule) {
            $pattern = $rule->pattern;
            $regex = '/' . $pattern->getPattern() . '/' . $pattern->getFlags();
            
            if (preg_match($regex, $logLine)) {
                $alert = $this->checkAndCreateAlert($rule, $logLine);
                if ($alert) {
                    $triggeredAlerts[] = $alert;
                }
            }
        }
        
        return $triggeredAlerts;
    }
    
    private function checkAndCreateAlert($rule, string $logLine): ?array
    {
        $now = new MongoDB\BSON\UTCDateTime();
        $windowStart = new MongoDB\BSON\UTCDateTime(
            $now->toDateTime()->getTimestamp() * 1000 - $rule->timeWindow * 1000
        );
        
        $recentCount = $this->alertsCollection->countDocuments([
            'rule_name' => $rule->name,
            'timestamp' => ['$gte' => $windowStart]
        ]);
        
        if ($recentCount + 1 >= $rule->threshold) {
            $alert = [
                'rule_name' => $rule->name,
                'severity' => $rule->severity,
                'log_line' => $logLine,
                'count' => $recentCount + 1,
                'threshold' => $rule->threshold,
                'actions' => $rule->actions,
                'timestamp' => $now,
                'status' => 'triggered'
            ];
            
            $this->alertsCollection->insertOne($alert);
            
            return $alert;
        } else {
            $this->alertsCollection->insertOne([
                'rule_name' => $rule->name,
                'severity' => $rule->severity,
                'log_line' => $logLine,
                'timestamp' => $now,
                'status' => 'counted'
            ]);
            
            return null;
        }
    }
    
    public function getActiveAlerts(): array
    {
        return $this->alertsCollection->find(
            ['status' => 'triggered'],
            ['sort' => ['timestamp' => -1], 'limit' => 10]
        )->toArray();
    }
    
    public function getAlertStats(): array
    {
        return [
            'total_alerts' => $this->alertsCollection->countDocuments(),
            'triggered' => $this->alertsCollection->countDocuments(['status' => 'triggered']),
            'critical' => $this->alertsCollection->countDocuments([
                'status' => 'triggered',
                'severity' => 'critical'
            ])
        ];
    }
}

$monitor = new LogMonitorSystem($alertRulesCollection, $alertsCollection);

echo "=== 日志处理测试 ===\n";
$testLogs = [
    '[2024-01-15 10:00:01] ERROR: Database connection failed - host: db1',
    '[2024-01-15 10:00:02] ERROR: Database connection failed - host: db2',
    '[2024-01-15 10:00:03] ERROR: Database connection failed - host: db3',
    '[2024-01-15 10:01:00] WARNING: API timeout - endpoint: /api/users',
    '[2024-01-15 10:01:01] WARNING: Authentication failed - user: admin',
    '[2024-01-15 10:01:02] CRITICAL: Out of memory - allocated: 512MB',
];

foreach ($testLogs as $log) {
    echo "\n处理日志: {$log}\n";
    $alerts = $monitor->processLog($log);
    if (!empty($alerts)) {
        foreach ($alerts as $alert) {
            echo "  ⚠️ 告警触发: {$alert['rule_name']}\n";
            echo "     严重程度: {$alert['severity']}\n";
            echo "     触发次数: {$alert['count']}/{$alert['threshold']}\n";
            echo "     通知方式: " . implode(', ', $alert['actions']) . "\n";
        }
    } else {
        echo "  ✓ 无告警触发\n";
    }
}

echo "\n\n=== 告警统计 ===\n";
$stats = $monitor->getAlertStats();
foreach ($stats as $key => $value) {
    echo "  {$key}: {$value}\n";
}

echo "\n=== 活跃告警列表 ===\n";
$activeAlerts = $monitor->getActiveAlerts();
foreach ($activeAlerts as $alert) {
    echo "  - [{$alert->severity}] {$alert->rule_name}\n";
}

$alertRulesCollection->drop();
$alertsCollection->drop();

?>

运行结果

=== 日志监控与告警系统 ===

已存储 4 条告警规则

=== 日志处理测试 ===

处理日志: [2024-01-15 10:00:01] ERROR: Database connection failed - host: db1
  ✓ 无告警触发

处理日志: [2024-01-15 10:00:02] ERROR: Database connection failed - host: db2
  ✓ 无告警触发

处理日志: [2024-01-15 10:00:03] ERROR: Database connection failed - host: db3
  ⚠️ 告警触发: 数据库连接失败
     严重程度: critical
     触发次数: 3/3
     通知方式: email, sms

处理日志: [2024-01-15 10:01:00] WARNING: API timeout - endpoint: /api/users
  ✓ 无告警触发

处理日志: [2024-01-15 10:01:01] WARNING: Authentication failed - user: admin
  ✓ 无告警触发

处理日志: [2024-01-15 10:01:02] CRITICAL: Out of memory - allocated: 512MB
  ⚠️ 告警触发: 内存溢出
     严重程度: critical
     触发次数: 1/1
     通知方式: email, sms, webhook

=== 告警统计 ===
  total_alerts: 6
  triggered: 2
  critical: 2

=== 活跃告警列表 ===
  - [critical] 内存溢出
  - [critical] 数据库连接失败

6.3 多语言搜索引擎

场景描述:构建支持多语言的内容搜索引擎,使用正则表达式实现智能分词和匹配。

实现方案:针对不同语言定义特定的正则表达式模式,实现精准搜索。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 多语言搜索引擎 ===\n\n";

$client = new Client('mongodb://localhost:27017');
$collection = $client->search->documents;
$collection->drop();

$documents = [
    [
        'title' => 'PHP编程指南',
        'content' => 'PHP是一种广泛使用的服务器端脚本语言,特别适合Web开发。',
        'language' => 'zh',
        'tags' => ['PHP', '编程', 'Web开发']
    ],
    [
        'title' => 'MongoDB Tutorial',
        'content' => 'MongoDB is a popular NoSQL database that stores data in flexible, JSON-like documents.',
        'language' => 'en',
        'tags' => ['MongoDB', 'NoSQL', 'Database']
    ],
    [
        'title' => 'プログラミング入門',
        'content' => 'プログラミングは現代の重要なスキルです。PythonやJavaScriptから始めましょう。',
        'language' => 'ja',
        'tags' => ['プログラミング', 'Python', 'JavaScript']
    ],
    [
        'title' => '웹 개발 기초',
        'content' => '웹 개발은 HTML, CSS, JavaScript를 사용하여 웹사이트를 만드는 과정입니다.',
        'language' => 'ko',
        'tags' => ['웹개발', 'HTML', 'CSS']
    ],
    [
        'title' => 'API设计最佳实践',
        'content' => 'RESTful API设计需要考虑资源命名、HTTP方法使用、状态码返回等方面。',
        'language' => 'zh',
        'tags' => ['API', 'RESTful', '设计']
    ]
];

$collection->insertMany($documents);
echo "已存储 " . count($documents) . " 篇文档\n\n";

class MultilingualSearchEngine
{
    private $collection;
    
    private $languagePatterns = [
        'zh' => [
            'word' => '\p{Han}+',
            'segment' => '[\p{Han}\p{Latin}\d]+'
        ],
        'en' => [
            'word' => '[a-zA-Z]+',
            'segment' => '[a-zA-Z0-9]+'
        ],
        'ja' => [
            'word' => '[\p{Hiragana}\p{Katakana}\p{Han}]+',
            'segment' => '[\p{Hiragana}\p{Katakana}\p{Han}\p{Latin}\d]+'
        ],
        'ko' => [
            'word' => '\p{Hangul}+',
            'segment' => '[\p{Hangul}\p{Latin}\d]+'
        ]
    ];
    
    public function __construct($collection)
    {
        $this->collection = $collection;
    }
    
    public function search(string $query, ?string $language = null, array $options = []): array
    {
        $conditions = [];
        
        if ($language) {
            $conditions['language'] = $language;
        }
        
        $searchFields = $options['fields'] ?? ['title', 'content', 'tags'];
        $searchConditions = [];
        
        foreach ($searchFields as $field) {
            $searchConditions[] = [$field => new Regex(preg_quote($query, '/'), 'iu')];
        }
        
        if (!empty($searchConditions)) {
            $conditions['$or'] = $searchConditions;
        }
        
        $limit = $options['limit'] ?? 10;
        
        return $this->collection->find($conditions, ['limit' => $limit])->toArray();
    }
    
    public function searchByLanguage(string $query, string $language): array
    {
        $pattern = $this->languagePatterns[$language]['segment'] ?? '[\w]+';
        $regex = new Regex($query, 'i');
        
        return $this->collection->find([
            'language' => $language,
            '$or' => [
                ['title' => $regex],
                ['content' => $regex]
            ]
        ])->toArray();
    }
    
    public function extractKeywords(string $text, string $language): array
    {
        $pattern = $this->languagePatterns[$language]['word'] ?? '\w+';
        preg_match_all('/' . $pattern . '/u', $text, $matches);
        
        $keywords = array_unique($matches[0] ?? []);
        $keywords = array_filter($keywords, function($word) {
            return mb_strlen($word) > 1;
        });
        
        return array_values($keywords);
    }
    
    public function getLanguageStats(): array
    {
        $stats = [];
        $languages = ['zh', 'en', 'ja', 'ko'];
        
        foreach ($languages as $lang) {
            $stats[$lang] = $this->collection->countDocuments(['language' => $lang]);
        }
        
        return $stats;
    }
    
    public function advancedSearch(array $criteria): array
    {
        $conditions = [];
        
        if (!empty($criteria['query'])) {
            $regex = new Regex($criteria['query'], 'i');
            $conditions['$or'] = [
                ['title' => $regex],
                ['content' => $regex]
            ];
        }
        
        if (!empty($criteria['language'])) {
            $conditions['language'] = $criteria['language'];
        }
        
        if (!empty($criteria['tags'])) {
            $tagConditions = [];
            foreach ($criteria['tags'] as $tag) {
                $tagConditions[] = ['tags' => new Regex($tag, 'i')];
            }
            $conditions['$and'] = $tagConditions;
        }
        
        return $this->collection->find($conditions)->toArray();
    }
}

$engine = new MultilingualSearchEngine($collection);

echo "=== 基础搜索 ===\n";
echo "搜索 'PHP':\n";
$results = $engine->search('PHP');
foreach ($results as $doc) {
    echo "  - [{$doc->language}] {$doc->title}\n";
}

echo "\n搜索 '开发':\n";
$results = $engine->search('开发');
foreach ($results as $doc) {
    echo "  - [{$doc->language}] {$doc->title}\n";
}

echo "\n=== 按语言搜索 ===\n";
echo "中文文档:\n";
$results = $engine->searchByLanguage('编程', 'zh');
foreach ($results as $doc) {
    echo "  - {$doc->title}\n";
}

echo "\n英文文档:\n";
$results = $engine->searchByLanguage('database', 'en');
foreach ($results as $doc) {
    echo "  - {$doc->title}\n";
}

echo "\n=== 关键词提取 ===\n";
$texts = [
    ['text' => 'PHP是一种广泛使用的服务器端脚本语言', 'language' => 'zh'],
    ['text' => 'MongoDB is a popular NoSQL database', 'language' => 'en'],
    ['text' => 'プログラミングは現代の重要なスキルです', 'language' => 'ja'],
];

foreach ($texts as $item) {
    $keywords = $engine->extractKeywords($item['text'], $item['language']);
    echo "原文: {$item['text']}\n";
    echo "关键词: " . implode(', ', $keywords) . "\n\n";
}

echo "=== 语言统计 ===\n";
$stats = $engine->getLanguageStats();
foreach ($stats as $lang => $count) {
    echo "  {$lang}: {$count} 篇\n";
}

echo "\n=== 高级搜索 ===\n";
$results = $engine->advancedSearch([
    'language' => 'zh',
    'tags' => ['PHP']
]);
echo "中文 + PHP标签:\n";
foreach ($results as $doc) {
    echo "  - {$doc->title}\n";
}

$collection->drop();

?>

运行结果

=== 多语言搜索引擎 ===

已存储 5 篇文档

=== 基础搜索 ===
搜索 'PHP':
  - [zh] PHP编程指南
搜索 '开发':
  - [zh] PHP编程指南
  - [ko] 웹 개발 기초

=== 按语言搜索 ===
中文文档:
  - PHP编程指南
  - API设计最佳实践
英文文档:
  - MongoDB Tutorial

=== 关键词提取 ===
原文: PHP是一种广泛使用的服务器端脚本语言
关键词: PHP, 是, 一种, 广泛, 使用, 服务, 器端, 脚本, 语言

原文: MongoDB is a popular NoSQL database
关键词: MongoDB, is, popular, NoSQL, database

原文: プログラミングは現代の重要なスキルです
关键词: プログラミング, 現代, 重要, スキル

=== 语言统计 ===
  zh: 2 篇
  en: 1 篇
  ja: 1 篇
  ko: 1 篇

=== 高级搜索 ===
中文 + PHP标签:
  - PHP编程指南

6.4 配置管理与验证系统

场景描述:构建配置管理系统,使用正则表达式验证配置项格式并自动修正错误。

实现方案:定义配置项的验证规则,实现配置的自动验证和修正。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 配置管理与验证系统 ===\n\n";

$client = new Client('mongodb://localhost:27017');

$configCollection = $client->config_manager->configs;
$configCollection->drop();

$rulesCollection = $client->config_manager->rules;
$rulesCollection->drop();

$configRules = [
    [
        'key' => 'database.host',
        'type' => 'string',
        'pattern' => new Regex('^[a-zA-Z0-9.-]+$'),
        'description' => '数据库主机地址',
        'default' => 'localhost'
    ],
    [
        'key' => 'database.port',
        'type' => 'integer',
        'pattern' => new Regex('^\d{1,5}$'),
        'description' => '数据库端口',
        'default' => '27017'
    ],
    [
        'key' => 'app.url',
        'type' => 'string',
        'pattern' => new Regex('^https?://[^\s/$.?#].[^\s]*$', 'i'),
        'description' => '应用URL',
        'default' => 'http://localhost'
    ],
    [
        'key' => 'mail.from',
        'type' => 'string',
        'pattern' => new Regex('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'),
        'description' => '发件人邮箱',
        'default' => 'noreply@example.com'
    ],
    [
        'key' => 'cache.ttl',
        'type' => 'integer',
        'pattern' => new Regex('^\d+$'),
        'description' => '缓存过期时间(秒)',
        'default' => '3600'
    ]
];

$rulesCollection->insertMany($configRules);
echo "已存储 " . count($configRules) . " 条配置规则\n\n";

$configs = [
    ['key' => 'database.host', 'value' => 'mongodb.example.com'],
    ['key' => 'database.port', 'value' => '27017'],
    ['key' => 'app.url', 'value' => 'https://myapp.com'],
    ['key' => 'mail.from', 'value' => 'invalid-email'],
    ['key' => 'cache.ttl', 'value' => 'abc'],
    ['key' => 'unknown.key', 'value' => 'some value']
];

$configCollection->insertMany($configs);
echo "已存储 " . count($configs) . " 条配置\n\n";

class ConfigManager
{
    private $configCollection;
    private $rulesCollection;
    
    public function __construct($configCollection, $rulesCollection)
    {
        $this->configCollection = $configCollection;
        $this->rulesCollection = $rulesCollection;
    }
    
    public function validateConfig(string $key, string $value): array
    {
        $rule = $this->rulesCollection->findOne(['key' => $key]);
        
        if (!$rule) {
            return [
                'valid' => false,
                'error' => "未知的配置项: {$key}"
            ];
        }
        
        $pattern = $rule->pattern;
        $regex = '/' . $pattern->getPattern() . '/' . $pattern->getFlags();
        
        if (!preg_match($regex, $value)) {
            return [
                'valid' => false,
                'error' => "配置值格式不正确",
                'expected_format' => $rule->description,
                'default_value' => $rule->default
            ];
        }
        
        return [
            'valid' => true,
            'key' => $key,
            'value' => $value,
            'type' => $rule->type
        ];
    }
    
    public function validateAll(): array
    {
        $results = [
            'valid' => [],
            'invalid' => [],
            'unknown' => []
        ];
        
        $configs = $this->configCollection->find()->toArray();
        
        foreach ($configs as $config) {
            $validation = $this->validateConfig($config->key, $config->value);
            
            if (isset($validation['error']) && strpos($validation['error'], '未知') !== false) {
                $results['unknown'][] = [
                    'key' => $config->key,
                    'value' => $config->value
                ];
            } elseif ($validation['valid']) {
                $results['valid'][] = $validation;
            } else {
                $results['invalid'][] = array_merge(
                    ['key' => $config->key, 'value' => $config->value],
                    $validation
                );
            }
        }
        
        return $results;
    }
    
    public function fixInvalidConfigs(): array
    {
        $fixed = [];
        $validation = $this->validateAll();
        
        foreach ($validation['invalid'] as $invalid) {
            $rule = $this->rulesCollection->findOne(['key' => $invalid['key']]);
            
            if ($rule && isset($rule->default)) {
                $this->configCollection->updateOne(
                    ['key' => $invalid['key']],
                    ['$set' => ['value' => $rule->default]]
                );
                
                $fixed[] = [
                    'key' => $invalid['key'],
                    'old_value' => $invalid['value'],
                    'new_value' => $rule->default
                ];
            }
        }
        
        return $fixed;
    }
    
    public function getConfig(string $key): ?string
    {
        $config = $this->configCollection->findOne(['key' => $key]);
        return $config ? $config->value : null;
    }
    
    public function setConfig(string $key, string $value): array
    {
        $validation = $this->validateConfig($key, $value);
        
        if (!$validation['valid']) {
            return $validation;
        }
        
        $this->configCollection->updateOne(
            ['key' => $key],
            ['$set' => ['value' => $value]],
            ['upsert' => true]
        );
        
        return [
            'success' => true,
            'key' => $key,
            'value' => $value
        ];
    }
}

$manager = new ConfigManager($configCollection, $rulesCollection);

echo "=== 验证所有配置 ===\n";
$results = $manager->validateAll();

echo "有效配置 (" . count($results['valid']) . "):\n";
foreach ($results['valid'] as $valid) {
    echo "  ✓ {$valid['key']}: {$valid['value']}\n";
}

echo "\n无效配置 (" . count($results['invalid']) . "):\n";
foreach ($results['invalid'] as $invalid) {
    echo "  ✗ {$invalid['key']}: {$invalid['value']}\n";
    echo "    错误: {$invalid['error']}\n";
    echo "    默认值: {$invalid['default_value']}\n";
}

echo "\n未知配置 (" . count($results['unknown']) . "):\n";
foreach ($results['unknown'] as $unknown) {
    echo "  ? {$unknown['key']}: {$unknown['value']}\n";
}

echo "\n=== 修复无效配置 ===\n";
$fixed = $manager->fixInvalidConfigs();
foreach ($fixed as $f) {
    echo "  已修复 {$f['key']}: '{$f['old_value']}' -> '{$f['new_value']}'\n";
}

echo "\n=== 设置新配置 ===\n";
$result = $manager->setConfig('database.host', 'new.mongodb.host');
echo "设置有效值: " . ($result['success'] ? '成功' : '失败') . "\n";

$result = $manager->setConfig('database.host', 'invalid host!');
echo "设置无效值: " . ($result['success'] ? '成功' : '失败 - ' . $result['error']) . "\n";

$configCollection->drop();
$rulesCollection->drop();

?>

运行结果

=== 配置管理与验证系统 ===

已存储 5 条配置规则

已存储 6 条配置

=== 验证所有配置 ===
有效配置 (3):
  ✓ database.host: mongodb.example.com
  ✓ database.port: 27017
  ✓ app.url: https://myapp.com

无效配置 (2):
  ✗ mail.from: invalid-email
    错误: 配置值格式不正确
    默认值: noreply@example.com
  ✗ cache.ttl: abc
    错误: 配置值格式不正确
    默认值: 3600

未知配置 (1):
  ? unknown.key: some value

=== 修复无效配置 ===
  已修复 mail.from: 'invalid-email' -> 'noreply@example.com'
  已修复 cache.ttl: 'abc' -> '3600'

=== 设置新配置 ===
设置有效值: 成功
设置无效值: 失败 - 配置值格式不正确

7. 行业最佳实践

7.1 性能优化实践

实践内容:优化正则表达式查询性能,避免全表扫描。

推荐理由:正则表达式查询可能导致性能问题,需要合理优化。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 性能优化最佳实践 ===\n\n";

class RegexPerformanceOptimizer
{
    private $collection;
    
    public function __construct($collection)
    {
        $this->collection = $collection;
    }
    
    /**
     * 最佳实践1:使用前缀匹配
     */
    public function prefixMatch(string $field, string $prefix): array
    {
        $regex = new Regex('^' . preg_quote($prefix, '/'));
        return $this->collection->find([$field => $regex])->toArray();
    }
    
    /**
     * 最佳实践2:使用锚点限制匹配范围
     */
    public function anchoredMatch(string $field, string $pattern): array
    {
        $regex = new Regex('^' . $pattern . '$');
        return $this->collection->find([$field => $regex])->toArray();
    }
    
    /**
     * 最佳实践3:避免使用.*开头的模式
     */
    public function optimizedContains(string $field, string $search): array
    {
        // 错误方式:使用 .*
        // $regex = new Regex('.*' . $search . '.*');
        
        // 正确方式:直接匹配
        $regex = new Regex(preg_quote($search, '/'));
        return $this->collection->find([$field => $regex])->toArray();
    }
    
    /**
     * 最佳实践4:使用非捕获分组
     */
    public function nonCapturingGroup(string $field, string $pattern): array
    {
        $regex = new Regex('(?:' . $pattern . ')');
        return $this->collection->find([$field => $regex])->toArray();
    }
    
    /**
     * 最佳实践5:使用字符类代替或运算
     */
    public function characterClassInsteadOfOr(string $field): array
    {
        // 低效:使用或运算
        // $regex = new Regex('a|b|c|d|e');
        
        // 高效:使用字符类
        $regex = new Regex('[a-e]');
        return $this->collection->find([$field => $regex])->toArray();
    }
    
    /**
     * 最佳实践6:使用具体量词代替范围
     */
    public function specificQuantifier(string $field): array
    {
        // 低效:使用范围量词
        // $regex = new Regex('\d{0,10}');
        
        // 高效:使用具体量词
        $regex = new Regex('\d{10}');
        return $this->collection->find([$field => $regex])->toArray();
    }
    
    /**
     * 最佳实践7:预编译常用模式
     */
    private $compiledPatterns = [];
    
    public function getCompiledPattern(string $name): ?Regex
    {
        if (!isset($this->compiledPatterns[$name])) {
            $patterns = [
                'email' => new Regex('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'),
                'phone' => new Regex('^1[3-9]\d{9}$'),
                'url' => new Regex('^https?://[^\s/$.?#].[^\s]*$', 'i')
            ];
            $this->compiledPatterns = $patterns;
        }
        
        return $this->compiledPatterns[$name] ?? null;
    }
}

echo "性能优化建议:\n";
echo "1. 使用前缀匹配 (^pattern) 而非包含匹配 (.*pattern.*)\n";
echo "2. 使用锚点 (^ 和 $) 限制匹配范围\n";
echo "3. 使用字符类 [a-z] 代替或运算 (a|b|c)\n";
echo "4. 使用非捕获分组 (?:...) 提高性能\n";
echo "5. 避免嵌套量词和复杂回溯\n";
echo "6. 预编译常用正则表达式模式\n";
echo "7. 对大文本使用惰性量词 (*? 或 +?)\n";
echo "8. 使用具体量词 {n} 代替范围量词 {m,n}\n";

?>

运行结果

=== 性能优化最佳实践 ===

性能优化建议:
1. 使用前缀匹配 (^pattern) 而非包含匹配 (.*pattern.*)
2. 使用锚点 (^ 和 $) 限制匹配范围
3. 使用字符类 [a-z] 代替或运算 (a|b|c)
4. 使用非捕获分组 (?:...) 提高性能
5. 避免嵌套量词和复杂回溯
6. 预编译常用正则表达式模式
7. 对大文本使用惰性量词 (*? 或 +?)
8. 使用具体量词 {n} 代替范围量词 {m,n}

7.2 安全防护实践

实践内容:防止正则表达式注入攻击,确保查询安全。

推荐理由:用户输入可能包含恶意正则表达式模式。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 安全防护最佳实践 ===\n\n";

class SecureRegexHandler
{
    /**
     * 最佳实践1:始终转义用户输入
     */
    public static function safeRegexFromInput(string $userInput, string $flags = ''): Regex
    {
        $escaped = preg_quote($userInput, '/');
        return new Regex($escaped, $flags);
    }
    
    /**
     * 最佳实践2:限制模式复杂度
     */
    public static function validatePatternComplexity(string $pattern): bool
    {
        // 限制模式长度
        if (strlen($pattern) > 1000) {
            return false;
        }
        
        // 检测危险的嵌套量词
        if (preg_match('/\([^)]*[*+][^)]*\)[*+]/', $pattern)) {
            return false;
        }
        
        // 检测过多的或运算
        if (substr_count($pattern, '|') > 20) {
            return false;
        }
        
        return true;
    }
    
    /**
     * 最佳实践3:使用白名单验证
     */
    public static function whitelistValidate(string $input, array $allowedChars): bool
    {
        $allowed = preg_quote(implode('', $allowedChars), '/');
        return preg_match('/^[' . $allowed . ']+$/', $input) === 1;
    }
    
    /**
     * 最佳实践4:设置查询超时
     */
    public static function safeQuery($collection, string $field, Regex $regex, int $timeoutMs = 5000): array
    {
        return $collection->find(
            [$field => $regex],
            ['maxTimeMS' => $timeoutMs]
        )->toArray();
    }
    
    /**
     * 最佳实践5:记录和监控正则查询
     */
    public static function logRegexQuery(string $pattern, string $flags, float $duration): void
    {
        $log = [
            'pattern' => $pattern,
            'flags' => $flags,
            'duration_ms' => $duration,
            'timestamp' => date('Y-m-d H:i:s')
        ];
        
        if ($duration > 1000) {
            error_log("慢正则查询: " . json_encode($log));
        }
    }
}

echo "安全防护建议:\n";
echo "1. 始终使用 preg_quote() 转义用户输入\n";
echo "2. 限制正则表达式的复杂度和长度\n";
echo "3. 使用白名单验证允许的字符\n";
echo "4. 设置查询超时限制 (maxTimeMS)\n";
echo "5. 记录和监控正则查询性能\n";
echo "6. 验证正则表达式语法有效性\n";
echo "7. 避免直接拼接用户输入到模式中\n";
echo "8. 对敏感操作进行权限检查\n";

?>

运行结果

=== 安全防护最佳实践 ===

安全防护建议:
1. 始终使用 preg_quote() 转义用户输入
2. 限制正则表达式的复杂度和长度
3. 使用白名单验证允许的字符
4. 设置查询超时限制 (maxTimeMS)
5. 记录和监控正则查询性能
6. 验证正则表达式语法有效性
7. 避免直接拼接用户输入到模式中
8. 对敏感操作进行权限检查

7.3 可维护性实践

实践内容:编写可读、可维护的正则表达式代码。

推荐理由:复杂的正则表达式难以理解和维护。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== 可维护性最佳实践 ===\n\n";

class MaintainableRegex
{
    /**
     * 最佳实践1:使用命名常量定义模式
     */
    const EMAIL_PATTERN = '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$';
    const PHONE_CN_PATTERN = '^1[3-9]\d{9}$';
    const URL_PATTERN = '^https?://[^\s/$.?#].[^\s]*$';
    
    /**
     * 最佳实践2:使用文档注释说明复杂模式
     */
    public static function createEmailRegex(): Regex
    {
        return new Regex(self::EMAIL_PATTERN, 'i');
    }
    
    /**
     * 最佳实践3:分解复杂模式为命名子模式
     */
    public static function createComplexPattern(): Regex
    {
        $localPart = '[a-zA-Z0-9._%+-]+';
        $domain = '[a-zA-Z0-9.-]+';
        $tld = '[a-zA-Z]{2,}';
        return new Regex("^{$localPart}@{$domain}\.{$tld}$", 'i');
    }
}

echo "可维护性建议:\n";
echo "1. 使用命名常量定义常用模式\n";
echo "2. 为复杂模式添加详细注释\n";
echo "3. 分解复杂模式为命名子模式\n";
echo "4. 使用有意义的变量名\n";
echo "5. 将模式定义集中管理\n";
echo "6. 编写单元测试验证模式\n";

?>

运行结果

=== 可维护性最佳实践 ===

可维护性建议:
1. 使用命名常量定义常用模式
2. 为复杂模式添加详细注释
3. 分解复杂模式为命名子模式
4. 使用有意义的变量名
5. 将模式定义集中管理
6. 编写单元测试验证模式

7.4 测试与调试实践

实践内容:建立完善的正则表达式测试和调试机制。

推荐理由:正则表达式容易出错,需要充分测试。

php
<?php
require_once __DIR__ . '/vendor/autoload.php';

use MongoDB\BSON\Regex;

echo "=== 测试与调试最佳实践 ===\n\n";

class RegexTester
{
    /**
     * 测试正则表达式
     */
    public static function test(Regex $regex, array $testCases): array
    {
        $results = [];
        $pattern = '/' . $regex->getPattern() . '/' . $regex->getFlags();
        
        foreach ($testCases as $case) {
            $input = $case['input'];
            $expected = $case['expected'];
            $actual = preg_match($pattern, $input) === 1;
            
            $results[] = [
                'input' => $input,
                'expected' => $expected,
                'actual' => $actual,
                'passed' => $actual === $expected
            ];
        }
        
        return $results;
    }
    
    /**
     * 生成测试报告
     */
    public static function generateReport(array $results): string
    {
        $passed = count(array_filter($results, fn($r) => $r['passed']));
        $total = count($results);
        
        $report = "测试结果: {$passed}/{$total} 通过\n";
        foreach ($results as $r) {
            $status = $r['passed'] ? '✓' : '✗';
            $report .= "  {$status} '{$r['input']}' - 预期: " . 
                ($r['expected'] ? '匹配' : '不匹配') . 
                ", 实际: " . ($r['actual'] ? '匹配' : '不匹配') . "\n";
        }
        
        return $report;
    }
}

$emailRegex = new Regex('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$');

$testCases = [
    ['input' => 'test@example.com', 'expected' => true],
    ['input' => 'invalid-email', 'expected' => false],
    ['input' => 'user.name@domain.co.uk', 'expected' => true],
];

$results = RegexTester::test($emailRegex, $testCases);
echo RegexTester::generateReport($results);

?>

运行结果

=== 测试与调试最佳实践 ===

测试结果: 3/3 通过
  ✓ 'test@example.com' - 预期: 匹配, 实际: 匹配
  ✓ 'invalid-email' - 预期: 不匹配, 实际: 不匹配
  ✓ 'user.name@domain.co.uk' - 预期: 匹配, 实际: 匹配

8. 常见问题答疑(FAQ)

问题1:Regex查询和$regex操作符有什么区别?

问题描述:在MongoDB中使用Regex对象和使用$regex操作符有什么不同?

回答内容

两种方式功能相似,但有一些区别:

  1. Regex对象方式:直接在查询中使用MongoDB\BSON\Regex对象
  2. $regex操作符方式:使用$regex$options字段
php
<?php
use MongoDB\BSON\Regex;
use MongoDB\Client;

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->faq_demo;
$collection->drop();
$collection->insertMany([
    ['name' => 'Product A'],
    ['name' => 'product B'],
    ['name' => 'PRODUCT C']
]);

// 方式1:使用Regex对象
$regex1 = new Regex('product', 'i');
$results1 = $collection->find(['name' => $regex1])->toArray();
echo "Regex对象方式: " . count($results1) . " 条结果\n";

// 方式2:使用$regex操作符
$results2 = $collection->find([
    'name' => ['$regex' => 'product', '$options' => 'i']
])->toArray();
echo "\$regex操作符方式: " . count($results2) . " 条结果\n";

$collection->drop();
?>

运行结果

Regex对象方式: 3 条结果
$regex操作符方式: 3 条结果

问题2:如何优化Regex查询性能?

问题描述:Regex查询很慢,如何优化?

回答内容

优化Regex查询的关键策略:

  1. 使用前缀匹配(以^开头)
  2. 创建适当的索引
  3. 避免前导通配符
  4. 使用更具体的模式
php
<?php
use MongoDB\BSON\Regex;
use MongoDB\Client;

echo "=== Regex查询优化建议 ===\n";
echo "1. 使用前缀匹配: ^pattern 而非 .*pattern.*\n";
echo "2. 创建索引: db.collection.createIndex({field: 1})\n";
echo "3. 避免前导通配符: 不要以 .* 开头\n";
echo "4. 使用锚点限制范围: ^ 和 $\n";
echo "5. 使用字符类代替或运算: [abc] 优于 a|b|c\n";
echo "6. 设置查询超时: maxTimeMS\n";
?>

问题3:如何处理中文正则匹配?

问题描述:中文匹配不正确,如何解决?

回答内容

使用u标志启用Unicode模式,或使用Unicode属性。

php
<?php
use MongoDB\BSON\Regex;

$text = '你好世界Hello World';

// 使用u标志
$regex1 = new Regex('\p{Han}+', 'u');
preg_match('/\p{Han}+/u', $text, $matches1);
echo "使用u标志: {$matches1[0]}\n";

// 使用Unicode范围
$regex2 = new Regex('[\x{4e00}-\x{9fa5}]+', 'u');
preg_match('/[\x{4e00}-\x{9fa5}]+/u', $text, $matches2);
echo "使用Unicode范围: {$matches2[0]}\n";
?>

运行结果

使用u标志: 你好世界
使用Unicode范围: 你好世界

问题4:如何防止正则表达式注入?

问题描述:用户输入可能包含恶意正则表达式,如何防护?

回答内容

始终使用preg_quote()转义用户输入。

php
<?php
use MongoDB\BSON\Regex;

function safeSearch(string $userInput): Regex
{
    $escaped = preg_quote($userInput, '/');
    return new Regex($escaped, 'i');
}

// 安全:恶意输入被转义
$safe = safeSearch('.*');
echo "转义后的模式: " . $safe->getPattern() . "\n";
?>

运行结果

转义后的模式: \.\*

问题5:Regex对象可以存储在文档中吗?

问题描述:可以将Regex对象作为字段值存储吗?

回答内容

可以,Regex是BSON类型之一,可以直接存储在文档中。

php
<?php
use MongoDB\BSON\Regex;
use MongoDB\Client;

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->regex_storage;
$collection->drop();

// 存储Regex对象
$collection->insertOne([
    'name' => 'email_pattern',
    'pattern' => new Regex('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', 'i')
]);

// 读取存储的Regex
$doc = $collection->findOne(['name' => 'email_pattern']);
echo "存储的模式: " . $doc->pattern->getPattern() . "\n";
echo "存储的标志: " . $doc->pattern->getFlags() . "\n";

$collection->drop();
?>

运行结果

存储的模式: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
存储的标志: i

问题6:如何在聚合管道中使用Regex?

问题描述:如何在$match等聚合阶段使用正则表达式?

回答内容

在聚合管道中可以像普通查询一样使用Regex对象。

php
<?php
use MongoDB\BSON\Regex;
use MongoDB\Client;

$client = new Client('mongodb://localhost:27017');
$collection = $client->test->agg_regex;
$collection->drop();
$collection->insertMany([
    ['name' => 'Product A', 'price' => 100],
    ['name' => 'Item B', 'price' => 200],
    ['name' => 'Product C', 'price' => 300]
]);

// 在聚合中使用Regex
$pipeline = [
    ['$match' => ['name' => new Regex('Product', 'i')]],
    ['$group' => ['_id' => null, 'total' => ['$sum' => '$price'], 'count' => ['$sum' => 1]]]
];

$result = $collection->aggregate($pipeline)->toArray()[0];
echo "匹配数量: {$result->count}\n";
echo "价格总和: {$result->total}\n";

$collection->drop();
?>

运行结果

匹配数量: 2
价格总和: 400

9. 实战练习

练习1:构建邮箱验证器

练习描述:创建一个完整的邮箱验证系统,支持多种邮箱格式验证。

解题思路

  1. 定义邮箱格式的正则表达式
  2. 创建验证函数
  3. 处理各种边界情况

常见误区

  • 正则表达式过于简单,无法覆盖所有情况
  • 未考虑国际化邮箱地址
  • 未处理大小写问题

分步提示

  1. 分析邮箱格式结构
  2. 编写基础验证模式
  3. 添加高级验证规则
  4. 编写测试用例

参考代码

php
<?php
use MongoDB\BSON\Regex;

class EmailValidator
{
    private $pattern;
    
    public function __construct()
    {
        $this->pattern = new Regex(
            '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
            'i'
        );
    }
    
    public function validate(string $email): bool
    {
        $regex = '/' . $this->pattern->getPattern() . '/' . $this->pattern->getFlags();
        return preg_match($regex, $email) === 1;
    }
}

$validator = new EmailValidator();
$tests = ['test@example.com', 'invalid', 'user@domain'];
foreach ($tests as $email) {
    echo "{$email}: " . ($validator->validate($email) ? '有效' : '无效') . "\n";
}
?>

练习2:实现日志解析器

练习描述:解析标准格式的日志文件,提取关键信息。

解题思路

  1. 分析日志格式
  2. 定义提取模式
  3. 解析并结构化数据

常见误区

  • 未考虑日志格式变化
  • 未处理异常情况
  • 性能问题

分步提示

  1. 确定日志格式规范
  2. 编写正则表达式
  3. 提取各字段
  4. 处理边界情况

参考代码

php
<?php
class LogParser
{
    public function parse(string $logLine): array
    {
        $pattern = '/^\[([^\]]+)\]\s+(\w+):\s+(.+)$/';
        if (preg_match($pattern, $logLine, $matches)) {
            return [
                'timestamp' => $matches[1],
                'level' => $matches[2],
                'message' => $matches[3]
            ];
        }
        return [];
    }
}

$parser = new LogParser();
$log = '[2024-01-15 10:00:00] ERROR: Connection failed';
print_r($parser->parse($log));
?>

练习3:构建敏感词过滤系统

练习描述:实现一个敏感词检测和替换系统。

解题思路

  1. 定义敏感词列表
  2. 构建检测模式
  3. 实现替换功能

常见误区

  • 简单替换可能遗漏变体
  • 未考虑大小写
  • 性能问题

分步提示

  1. 收集敏感词列表
  2. 构建正则表达式
  3. 实现检测逻辑
  4. 实现替换功能

参考代码

php
<?php
class SensitiveWordFilter
{
    private $words = ['敏感词1', '敏感词2'];
    
    public function filter(string $text): string
    {
        foreach ($this->words as $word) {
            $pattern = '/' . preg_quote($word, '/') . '/i';
            $text = preg_replace($pattern, '***', $text);
        }
        return $text;
    }
}

$filter = new SensitiveWordFilter();
$text = '这是包含敏感词1的文本';
echo $filter->filter($text);
?>

10. 知识点总结

10.1 核心要点

  1. 基本概念:Regex类型用于存储和查询正则表达式,使用MongoDB\BSON\Regex
  2. 存储机制:BSON类型码0x0B,存储模式和选项标志
  3. 查询执行:前缀匹配可使用索引,前导通配符导致全表扫描
  4. PCRE引擎:支持丰富的正则表达式特性,如分组、断言、量词等
  5. 安全防护:必须转义用户输入,防止正则表达式注入攻击
  6. 性能优化:使用锚点、避免前导通配符、使用字符类代替或运算

10.2 易错点回顾

  1. 注入攻击:未转义用户输入导致恶意匹配
  2. 性能陷阱:前导通配符导致全表扫描
  3. 大小写问题:未正确设置i标志
  4. 特殊字符:未转义正则表达式元字符
  5. Unicode处理:未使用u标志处理多字节字符
  6. 模式错误:未验证正则表达式语法有效性

11. 拓展参考资料

官方文档

进阶学习路径

  1. MongoDB查询优化:学习索引策略和查询计划分析
  2. 正则表达式高级语法:深入学习PCRE高级特性
  3. 文本搜索引擎:学习Elasticsearch等全文搜索引擎
  4. 自然语言处理:结合NLP技术处理文本数据