Removing Irreparable Corrupt Inodes with XFS
Because of its extensive use of tree structures, the XFS advanced journalling filesystem is particularly sensitive to in-memory corruption.
One of the systems I administrate has two XFS filesystems on two separate IDE drives: a single 35GB bootable partition with the root filesystem containing applications, configuration files, system logs and temporary files, and a 150GB partition for backups of other systems done remotely using rsnapshot. This system is a clone, with an AMD Athlon XP 1800+ CPU and 1GB of generic non-ECC RAM.
The root filesystem suffers from random filesystem corruption consistently when the system is shut down or rebooted, even when this is done properly using the shutdown command. I sent a fairly detailed report to the XFS mailing list when I last observed this problem in August 2004. Today I upgraded the system to Sarge and vanilla Linux kernel 2.6.11.11. Unfortunately, the problem persists, which leads me to believe that indeed my problem is hardware-related.
Thankfully, the backup filesystem is always okay, and corruption to the root filesystem doesn't seem to occur while the system is running: no error 990's are generated during a boot cycle of the machine, but occur immediately after a reboot.
The generic error 990's cannot be repaired by xfs_repair. The workaround I have found, so far, involves using xfs_db to muck with the filesystem's internal structures and delete the corrupted inode, after which I recover the file from backups, or reinstall the package to which the file belongs.
The following xfs_db incantation achieves this:
xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX
In a perfect world, I would replace this machine with another machine, ideally with branded ECC RAM, but considering that this computer is able to do its job well sans the post-reboot corruption, the above workound has continued to allow us to postpone having to buy new hardware.
