Percona XtraDB Cluster (PXC) removed the wsrep_group_commit_queue
component in version 8.0.41 due to multiple deadlock-causing bugs. As a DBA handling numerous Crash/Hang tickets, I've identified this component as the root cause. Here's an analysis of one such case.
Symptom Description
The customer reported two PXC clusters across different data centers, synchronized via MySQL native replication, with applications accessing local nodes. Architecture:

During peak traffic, multiple PXC nodes frequently hung, stopped processing writes, and crashed after 600 seconds.
Error Log Analysis
MySQL's error.log
showed:
2025-07-07T05:10:25.772284Z 0 [ERROR] [MY-012872] [InnoDB] [FATAL] Semaphore wait has lasted > 600 seconds...
InnoDB's SEMAPHORES section revealed two threads waiting for an RW-latch held by pthread ID 140395374548736
.
Core File Analysis
GDB identified the hanging thread (thd1
) waiting at wsrep_wait_for_turn_in_group_commit
(line 468) for COND_wsrep_group_commit
. The queue (wsrep_group_commit_queue
) contained four THDs, with thd2
(ID 42) being the front element.
Deadlock Pattern
- thd1 waits for thd2 via
COND_wsrep_group_commit
- thd2 waits for thd3 (the real leader) via
m_stage_cond_binlog
- thd3 (leader) should remove thd2 from the queue but failed due to an empty binlog event
Root Cause
When SQL thread applies a no-op update (e.g., UPDATE t SET d=d WHERE id=1
), it skips innobase_commit
, leaving itself stuck in wsrep_group_commit_queue
. This blocks subsequent transactions.
Reproduction Steps
- Deploy PXC + Replication clusters
- Create conflicting updates:
-- On Primary
UPDATE test.t SET d=1 WHERE id=1;
-- On Secondary
UPDATE test.t SET d=1 WHERE id=1; -- Creates empty binlog event
- Stress with sysbench:
sysbench oltp_insert --threads=10 --time=600 --tables=1 --table-size=1000 \
--mysql-host=127.0.0.1 --mysql-port=10001 run
TPS drops to 0 as deadlock occurs.
Solution
Percona fixed this in 8.0.41 by completely removing the problematic wsrep_group_commit_queue
component.
Key Takeaways
- The deadlock requires:
a) Replication applying no-op updates
b) Concurrent group commits
c) Busy workload - Understanding usage scenarios helps reproduce elusive bugs.
References