26 May
2005

Pull My Plug

XFS and the Plug-Pulling Test

On Tuesday, May 24, I went to a client to help them run a unique burn test on the Debian GNU/Linux-based setup I had designed for their up-and-coming GNU/Linux-based Point of Sale (POS) systems. Our objective was to see how resilient the new setup would be against interruption due to being turned off improperly.

These systems feature the XFS advanced journalling filesystem and Linux kernel 2.4.30. We had two POS machines available for testing: a standard IBM SureOne, and a cheaper IBM SurePOS 300. Both machines have VIA chipsets (VT8601, VT82C686), VIA C3 processors, and Western Digital Caviar WD400BB hard drives.

The test is simple, but interesting. It involves running Brad Fitzpatrick's diskchecker.pl utility to keep each system's drives and filesystems busy writing data to disk, pulling the plug on both machines, starting up again, then running diskcheck.pl in verify mode to verify the data against a record on a separate, uninterrupted server. Rinse. Lather. Repeat.


Why this test? We're testing both machines for stores too small to warrant having a separate, UPS-backed server running PostgreSQL. These POS units run everything from PostgreSQL, to the POS back-end and front-end software.

The tests are perfect for us since PostgreSQL uses fsync() to make sure data gets written to disk. It's also worth mentioning that part of the standardized installation for these units includes running hdparm to disable write cache on the drives, which is enabled by default.

So I'm back here today, Thursday, to check on the results, and they're pretty pleasing. After running continuously since Tuesday, with plug-pulling being done arbitrarily by the staff, at least a couple of times a day, and at least 10 times by me on Tuesday, we've only had two runs with errors, and that seems to have been caused once by diskchecker.pl's difficulty with handling user interruption via Ctrl+C, and another by an unconfirmed reason, but possibly a network problem when the power supply of the small switch connecting the test boxes and the server was disconnected.

We've not had any kernel panic situations, and have had only one minor filesystem consistency problem with a disconnected inode (whose file was empty, for whatever it's worth) that a manual run of xfs_repair placed safely in lost+found.

Hooray for XFS and GNU/Linux!!!

We're continuing tests through the week, to see how well the hardware takes it. These POS units are pretty sturdy, kudos to IBM, and should hold up without a sweat.

Posted by Federico Sevilla III at 13:53 | Comments (3)
Comments
Re: Pull My Plug

The hardware tests? All good by this time? :)

Posted by: Clair at June 01,2005 13:42
Re: Pull My Plug

So far, so good. We replaced the diskchecker.pl test with a custom-made Python script I made that continuously writes timestamp data concurrently to a PostgreSQL table on a local database and another on a control machine. The systems have held up equally well, although the diskchecker.pl test stressed out the hard drives more than the PostgreSQL tests.

We ran the tests on some older clones, though, which didn't hold up as well as the IBM SureOne and the IBM SurePOS 300.

Posted by: Federico Sevilla III at June 02,2005 10:57
Re: Pull My Plug

Well at least things look ok :)

Posted by: Clair at June 03,2005 13:32