Introduction
During database operations, whether for migration, scaling, or production deployment, encountering situations where the replica falls behind and struggles to catch up is almost inevitable. These issues can be quite frustrating.
Here are a few reasons I've personally encountered:
buffer_poolset too large, causing MySQL to use SWAP.- Backups causing the SQL thread replay to wait on Metadata Lock (MDL).
- Large transactions.
- Slow queries degrading replica performance.
- Parallel replication causing the replication lag monitoring to stick at 1 second.
- Network issues.
Specific circumstances require specific analysis, which we won't elaborate on here.
If conventional troubleshooting hasn't resolved the issue, below are some parameter adjustment ideas based on my experience for your reference.
Idea 1: Sync-Related Parameters
When catching up on lag, we can adjust the following parameters to improve log disk flushing efficiency. These can be reverted after the replica catches up.
sync_binlog=0
sync_master_info=10000 #default
sync_relay_log=10000 #default
sync_relay_log_info=10000 #default
Idea 2: Buffer and Concurrency Related Parameters
Consider increasing the innodb_buffer_pool_sizeso that the SQL thread can execute replay faster.
If memory space is tight, you can appropriately adjust the innodb_change_buffer_max_sizepercentage (this is feasible only if there are no read operations; typically, lagging replicas have no active business connections).
innodb_buffer_pool_size=24G #24*1024*1024*1024
innodb_change_buffer_max_size=50
innodb_thread_concurrency=0
innodb_adaptive_hash_index=0
Risks of increasing innodb_buffer_pool_size:
- Excessive memory allocation can trigger SWAP or Out-Of-Memory (OOM). Sufficient memory must be reserved for the operating system and other MySQL components. It's recommended that the buffer pool not exceed 70% of physical memory.
- Adjustments should be made incrementally, monitoring system memory closely, to avoid setting it too high in one step.
Idea 3: Replica-Related Parameters
Consider Enabling Parallel Replication
Enable parallel replication. For versions 8.0+, consider using WRITESET-based dependency tracking. Monitor the replication threads; if they are insufficient, consider increasing the number of worker threads. However, it's not advisable to exceed the number of CPU cores or the value of the innodb_thread_concurrencyparameter.
replica_preserve_commit_order adds a layer of locking to preserve commit order. When chasing low latency, it's recommended to turn this OFF; it can be re-enabled after the replica is caught up and back online.
replica_parallel_type=LOGICAL_CLOCK
replica_parallel_workers=16
replica_preserve_commit_order=OFF
I personally do not recommend modifying the following parameters. While they might boost performance, they can also cause commits on the primary to wait. However, adjustments might be acceptable for some non-real-time business scenarios.
The two parameters related to binary log group commit are:
binlog_group_commit_sync_delay: Delays the fsync disk write by a specified number of microseconds.binlog_group_commit_sync_no_delay_count: Specifies the number of transactions to accumulate before performing an fsync.
Consider Disabling log_replica_updates
Disabling log_replica_updatestypically requires a restart to take effect. However, experienced users can modify it using gdb without a full restart, only a restart of the replication threads is needed. If you lack gdb experience, this is not recommended as it could lead to a crash.
Also, understand your architecture. If there is no binary log backup or cascading replicas, and no business connections directly reading the binary log, it can likely be safely disabled.
Idea 4: For MGR Architectures
In Group Replication (MGR) architectures, consider temporarily switching to asynchronous replication and turning off replica_preserve_commit_order. Once the lag is eliminated, you can re-join the cluster.
Idea 5: General Performance Parameters
Other performance parameters, if set according to standard templates, generally shouldn't cause major issues. After applying the above adjustments, the replication lag will most likely decrease, and reducing it to zero is just a matter of time.
Appendix: Parallel Replication Backlog Statistics Interpretation
Example log output:
2021-01-10T16:08:39.947611+08:00 85441 [Note] Multi-threaded slave statistics for channel '';seconds elapsed = 120;events assigned = 4005889;worker queues filled over overrun level = 0;waited due a Worker queue full = 0;waited due the total size = 0;waited at clock conflicts = 6918018179200 waited (count) when Workers occupied = 0 waited when Workers occupied = 0Interpretation of the key metrics:
seconds elapsed = 120: Output interval (every 120 seconds).events assigned = 4005889: Total number of events assigned for execution.worker queues filled over overrun level = 0: Number of times a worker's private queue length exceeded the threshold.waited due a Worker queue full = 0: Number of waits caused by a full worker queue.waited due the total size = 0: Number of times the total size limit was exceeded.waited at clock conflicts = 6918018179200: Wait time (in nanoseconds) due to logical clock conflicts.waited (count) when Workers occupied = 0: Total count of waits because all workers were busy.waited when Workers occupied = 0: Total time waited (in nanoseconds) because all workers were busy.

%20(2048%20x%201000%20%E5%83%8F%E7%B4%A0)%20(3).png)

%20(2048%20x%201000%20%E5%83%8F%E7%B4%A0)%20(2).png)
