Incident Overview
Our DMP monitoring platform alerted a replication failure with SQL thread stopped. Diagnostic commands revealed:
.png)
SHOW SLAVE STATUS\G;
-- Error: Worker 1 failed executing transaction '44bbb836-...'
-- Error_code: 1197 (max_binlog_cache_size exceeded)
SELECT * FROM performance_schema.replication_applier_status_by_worker;
Critical Finding:
Primary's max_binlog_cache_size: 10GB
Replica's max_binlog_cache_size: 10MB
Technical Deep Dive
Key Parameters
max_binlog_cache_sizeMaximum total cache for all client binlogsbinlog_cache_sizePer-client binlog cache allocation

Why It Failed:
A multi-statement transaction required >10MB binlog space at replica, while primary allowed 10GB.
Resolution Steps
1. Immediate Fix:
SET GLOBAL max_binlog_cache_size=10240000000; -- Match primary's 10GB
START SLAVE;
2. Prevention Checklist:
- Audit parameter consistency cluster-wide
- Monitor binlog usage trends
- Set alerts for replication errors 1197
Key Takeaways
- Consistency Matters: Always verify parameter parity across replication nodes
- Monitor Proactively: Track binlog growth for large transactions
- Dynamic Adjustment: Know which parameters can be changed online
Critical Warning:
Mismatched binlog settings can silently break replication during large transactions!

%20(2048%20x%201000%20%E5%83%8F%E7%B4%A0)%20(3).png)

%20(2048%20x%201000%20%E5%83%8F%E7%B4%A0)%20(2).png)
