Background
A MySQL 8.0 upgrade test environment encountered corrupted backups generated by daily XtraBackup routines. The compressed tarball (xxx.tar.gz
) failed decompression midway with errors:
gzip: stdin: invalid compressed data--format violated
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Despite no alert notifications, subsequent tests with older backups replicated the issue, highlighting the risk of undetected corruption.
Key Questions
- Why did valid backup scripts produce corrupted archives?
- How to ensure backup integrity when automated processes fail silently?
Root Cause Analysis
The backup script used xtrabackup
with streaming compression:
xtrabackup ... --stream=tar ... | gzip > xxx.tar.gz
Hidden Issue:
- Concurrent Cron Jobs: System logs revealed two simultaneous
crond
processes executing the backup script at the same time:
Mar 6 00:00:01 localhost CROND[6212]: (root) CMD (sh backup.sh)
Mar 6 00:00:01 localhost CROND[6229]: (root) CMD (sh backup.sh)
- Data Overwrite: Both instances wrote to the same
xxx.tar.gz
file, corrupting the archive through overlapping writes.
Why No Alert?
- The script checked
xtrabackup
exit codes and log keywords likecompleted OK
, but failed to detect concurrent execution. - Compression errors only surfaced during decompression, outside the script’s validation scope.
Solutions
1. Add Process Mutual Exclusion: Use flock
to ensure single-instance execution:
flock -n /tmp/backup.lock sh backup.sh || echo "Backup already running"
2. Implement Recovery Drills: Regularly test backups by restoring to a staging environment.
3. Monitor Cron Health: Add alerts for abnormal crond
process counts:
# Check crond processes (expected count: 1)
if [ "$(pgrep crond | wc -l)" -ne 1 ]; then
echo "CRITICAL: Multiple crond processes detected!" | mail -s "Backup Alert" admin@example.com
fi
Recommendations
- Validate Backups Proactively: Automate integrity checks (e.g.,
xz -t
,mysqlbackup verify
). - Avoid Overlapping Jobs: Use dedicated cron schedules for critical tasks.
- Log Correlation: Centralize cron logs with tools like ELK Stack for anomaly detection.
Conclusion
This incident underscores the importance of validating backup workflows beyond superficial success checks. By combining process isolation, recovery testing, and proactive monitoring, teams can mitigate silent failures and ensure data resilience.