尝试复现Cloudflare-2025-11-18的故障过程
尝试复现Cloudflare-2025-11-18的故障过程
项目地址
部署步骤
s0 - 部署 Clickhouse 集群,并初始化无故障状态

s1 - 部署 KV workers 集群,储存和传播配置文件

s3 - 部署 应用站点和代理服务,包括 FL( bot-manager 打开/关闭 )和 FL2

通过3个代理服务都可以访问到用户的站点

s4 - 部署 模拟用户访问的批量访问

演示故障
- 执行 clickhouse 权限变更
clickhouse-node1 :) GRANT SELECT ON r0.* TO test_user;
GRANT SELECT ON r0.* TO test_user
Query id: 5e583e6b-2601-4f12-88f3-163f3ac021dc
Ok.
0 rows in set. Elapsed: 0.003 sec.
- 观察 workers kv 缓存条目翻倍
[CK] cache updated, 4 columns
[CK] cache updated, 8 columns
- fl-bot-manager-off未受影响
- fl-bot-manager-on开始误判机器人访问
- fl2开始反复crash


- 客户访问出现大面积失败
---- http://proxy-server-fl-bot-manager-off:50001/ ----
Success rate: 100.00%
Success: 116
Blocked: 0
Total: 116
--------------------------
---- http://proxy-server-fl-bot-manager-on:50001/ ----
Success rate: 0.00%
Success: 0
Blocked: 100
Total: 100
--------------------------
---- http://proxy-server-fl2:50001/ ----
Success rate: 0.00%
Success: 0
Blocked: 106
Total: 106
--------------------------
演示故障恢复
- 执行 clickhouse 权限变更回滚
clickhouse-node1 :) REVOKE SELECT ON r0.* FROM test_user;
REVOKE SELECT ON r0.* FROM test_user
Query id: 3ebc9b84-5f87-41e1-8fcb-28422bed0dd4
Ok.
0 rows in set. Elapsed: 0.021 sec.
- 所有访问恢复正常 Helo, have a nice day!