June 13, 2025

How max_binlog_cache_size Mismatch Broke Our Cluster

A critical case study where mismatched max_binlog_cache_size values between primary and replica caused replication failure, with actionable solutions.

Incident Overview

Our DMP monitoring platform alerted a replication failure with SQL thread stopped. Diagnostic commands revealed:

SHOW SLAVE STATUS\G;
-- Error: Worker 1 failed executing transaction '44bbb836-...'
-- Error_code: 1197 (max_binlog_cache_size exceeded)

SELECT * FROM performance_schema.replication_applier_status_by_worker;

Critical Finding:​
Primary's max_binlog_cache_size: 10GB
Replica's max_binlog_cache_size: 10MB

Technical Deep Dive

Key Parameters

  • max_binlog_cache_sizeMaximum total cache for all client binlogs
  • binlog_cache_sizePer-client binlog cache allocation

Why It Failed:​
A multi-statement transaction required >10MB binlog space at replica, while primary allowed 10GB.

Resolution Steps

1. Immediate Fix:​

SET GLOBAL max_binlog_cache_size=10240000000; -- Match primary's 10GB
START SLAVE;

2. Prevention Checklist:​

  • Audit parameter consistency cluster-wide
  • Monitor binlog usage trends
  • Set alerts for replication errors 1197

Key Takeaways

  1. Consistency Matters:​​ Always verify parameter parity across replication nodes
  2. Monitor Proactively:​​ Track binlog growth for large transactions
  3. Dynamic Adjustment:​​ Know which parameters can be changed online

Critical Warning:​
Mismatched binlog settings can silently break replication during large transactions!

You will get best features of ChatDBA