Tuesday, December 7, 2010

Weird Linux filesystem problems

Like most weird problems you only start to notice it if something starts to behave oddly or slower than usual. The same was true in this case. It started with me noticing my internet connection starting to lag as hell and as always the first thought is that someone in my familiy does some excessive down- or uploading. So I fired up the router configuration interface of my Tomato-powered Linksys WRT54GL and wanted to check the current transfer speeds.

However my first assumption was wrong. The transfer speeds were relatively low, albeit there showing a constant up- and download. The router interface felt really slow, so the next thing I checked was the router logs. And what I saw there didn't please me. There were a lot log messages that looked like these (I removed the timestamps):

user.warn kernel: ip_conntrack: table full, dropping packet.
user.warn kernel: ip_conntrack: table full, dropping packet.
user.warn kernel: ip_conntrack: table full, dropping packet.
user.warn kernel: ip_conntrack: table full, dropping packet.
user.warn kernel: NET: 6 messages suppressed.


I didn't like that at all. Not knowing any better, the first thought was, that some kind of malicious software or outside attacker was responsible for these connections. Though a quick check of the connection table proved me wrong once again (damn those sci-fi fantasies!). It was related to inbound connections all originating from the same IP. The domain was very well known to me, it was the address of another router of mine, which occasionally connects to transfer rstats data via SMB tunneled over SSH.

So lets check the remote router via SSH. It showed much the same behaviour. Constant up- and downloads albeit on a low level. I'll save you the log excerpt as it looked the same as the other router. So what causes the connection tables to fill up? Next thing I checked was the Linux box, which should provide me with more insight as the local router forwards it's log messages to it while only retaining a small part of it on the router itself due to space constraints. On the same Linux box the rstat data is stored via tunneled SMB as outlined earlier. However, it wasn't possible to connect to that server. I double checked everything, though there wasn't a whole lot of things that could have went wrong as it was on the local network.

I also tried to directly access the Linux machine, though it wouldn't react to keyboard input. So as a last resort I simply turned it off and on again (somehow that reminds me of The IT Crowd). On the next boot I was greeted with even more problems. Fsck started as this machine wasn't checked for quite some time and it reported file system errors. Oddly it hung up at like 8%, so that wasn't a lot of help either. I grabbed a Gentoo LiveCD and started fsck from there. I got a lot of „error reading block force rewrite?“ messages (where being a number). I got no other choice than confirm, considering that otherwise the problem will be found again the next scan (plus the machine had no really important data so I didn't care). Fsck hung up again at like 8%. Something bad was happening here. I repeated the whole procedure a few times, and interestingly each time fsck progressed further. I always confirmed the force rewrite requests.

A lot of fsck runs later (I didn't count but it was really like 10 times the same stuff) I brought the Linux box back in a bootable state. That went fine for a few hours until it started acting up again. This time however the box didn't hang yet, as I was able to connect via SSH. However dmesg contained the following output at the end:

journal_bmap: journal block not found at offset 12300 on hda3
Aborting journal on device hda3.
ext3_abort called.
EXT3-fs error (device hda3): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data

So I had to reboot again doing a fsck. Though the problem at hand wasn't fixed yet. After a few days of pondering (and the problem appearing again) I finally found a solution. Using the following commands I first disabled the journal and afterwards re-enabled it (you have to replace /dev/hdxx with the correct identifier of the affected partition):

tune2fs -O^has_journal /dev/hdxx  //to remove it
tune2fs -j /dev/hdxx  //to add it again

After that I got one more error (a different from last time), but this time the file system was not forced into read-only mode:

ext2_free_blocks: bit already cleared for block

After another reboot with the LiveCD I ran another fsck. It found a couple of errors where free space wasn't correctly marked as such. After this fsck I rebooted back into Gentoo and since then it's running without any problems. Since then however, I'm closely monitoring S.M.A.R.T. values for problems. This can manually be done by invoking smartctl or by configuring smartd which results in perodical log messages, if any problems occur. Both these programs are part of smartmontools. On Windows you can use tools like CrystalDiskInfo.

The most interesting S.M.A.R.T. value in this regard is the Reallocated Sector Count (ID #5), as it represents the number of bad sectors the drive internally replaced with spare sectors. Current Pending Sector (ID #197) and Uncorrectable Offline Sector (ID #198) are also an indication for sever hard disc problems. However in a lot of cases it's still possible to continue using the hard disc. Just keep in mind that once an error occurred, it's very likely that additional errors occur, so prepare for imminent failure, do proper backups and don't store anything important on a drive that started showing signs of defect.