Incident Overview
Our DMP monitoring platform alerted a replication failure with SQL thread stopped. Diagnostic commands revealed:
.png)
SHOW SLAVE STATUS\G;
-- Error: Worker 1 failed executing transaction '44bbb836-...'
-- Error_code: 1197 (max_binlog_cache_size exceeded)
SELECT * FROM performance_schema.replication_applier_status_by_worker;
Critical Finding:
Primary's max_binlog_cache_size
: 10GB
Replica's max_binlog_cache_size
: 10MB
Technical Deep Dive
Key Parameters
max_binlog_cache_size
Maximum total cache for all client binlogs
binlog_cache_size
Per-client binlog cache allocation

Why It Failed:
A multi-statement transaction required >10MB binlog space at replica, while primary allowed 10GB.
Resolution Steps
1. Immediate Fix:
SET GLOBAL max_binlog_cache_size=10240000000; -- Match primary's 10GB
START SLAVE;
2. Prevention Checklist:
- Audit parameter consistency cluster-wide
- Monitor binlog usage trends
- Set alerts for replication errors 1197
Key Takeaways
- Consistency Matters: Always verify parameter parity across replication nodes
- Monitor Proactively: Track binlog growth for large transactions
- Dynamic Adjustment: Know which parameters can be changed online
Critical Warning:
Mismatched binlog settings can silently break replication during large transactions!