Tuesday, December 7, 2010

Weird Linux filesystem problems

Like most weird problems you only start to notice it if something starts to behave oddly or slower than usual. The same was true in this case. It started with me noticing my internet connection starting to lag as hell and as always the first thought is that someone in my familiy does some excessive down- or uploading. So I fired up the router configuration interface of my Tomato-powered Linksys WRT54GL and wanted to check the current transfer speeds.

However my first assumption was wrong. The transfer speeds were relatively low, albeit there showing a constant up- and download. The router interface felt really slow, so the next thing I checked was the router logs. And what I saw there didn't please me. There were a lot log messages that looked like these (I removed the timestamps):

user.warn kernel: ip_conntrack: table full, dropping packet.
user.warn kernel: ip_conntrack: table full, dropping packet.
user.warn kernel: ip_conntrack: table full, dropping packet.
user.warn kernel: ip_conntrack: table full, dropping packet.
user.warn kernel: NET: 6 messages suppressed.


I didn't like that at all. Not knowing any better, the first thought was, that some kind of malicious software or outside attacker was responsible for these connections. Though a quick check of the connection table proved me wrong once again (damn those sci-fi fantasies!). It was related to inbound connections all originating from the same IP. The domain was very well known to me, it was the address of another router of mine, which occasionally connects to transfer rstats data via SMB tunneled over SSH.

So lets check the remote router via SSH. It showed much the same behaviour. Constant up- and downloads albeit on a low level. I'll save you the log excerpt as it looked the same as the other router. So what causes the connection tables to fill up? Next thing I checked was the Linux box, which should provide me with more insight as the local router forwards it's log messages to it while only retaining a small part of it on the router itself due to space constraints. On the same Linux box the rstat data is stored via tunneled SMB as outlined earlier. However, it wasn't possible to connect to that server. I double checked everything, though there wasn't a whole lot of things that could have went wrong as it was on the local network.

I also tried to directly access the Linux machine, though it wouldn't react to keyboard input. So as a last resort I simply turned it off and on again (somehow that reminds me of The IT Crowd). On the next boot I was greeted with even more problems. Fsck started as this machine wasn't checked for quite some time and it reported file system errors. Oddly it hung up at like 8%, so that wasn't a lot of help either. I grabbed a Gentoo LiveCD and started fsck from there. I got a lot of „error reading block force rewrite?“ messages (where being a number). I got no other choice than confirm, considering that otherwise the problem will be found again the next scan (plus the machine had no really important data so I didn't care). Fsck hung up again at like 8%. Something bad was happening here. I repeated the whole procedure a few times, and interestingly each time fsck progressed further. I always confirmed the force rewrite requests.

A lot of fsck runs later (I didn't count but it was really like 10 times the same stuff) I brought the Linux box back in a bootable state. That went fine for a few hours until it started acting up again. This time however the box didn't hang yet, as I was able to connect via SSH. However dmesg contained the following output at the end:

journal_bmap: journal block not found at offset 12300 on hda3
Aborting journal on device hda3.
ext3_abort called.
EXT3-fs error (device hda3): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data

So I had to reboot again doing a fsck. Though the problem at hand wasn't fixed yet. After a few days of pondering (and the problem appearing again) I finally found a solution. Using the following commands I first disabled the journal and afterwards re-enabled it (you have to replace /dev/hdxx with the correct identifier of the affected partition):

tune2fs -O^has_journal /dev/hdxx  //to remove it
tune2fs -j /dev/hdxx  //to add it again

After that I got one more error (a different from last time), but this time the file system was not forced into read-only mode:

ext2_free_blocks: bit already cleared for block

After another reboot with the LiveCD I ran another fsck. It found a couple of errors where free space wasn't correctly marked as such. After this fsck I rebooted back into Gentoo and since then it's running without any problems. Since then however, I'm closely monitoring S.M.A.R.T. values for problems. This can manually be done by invoking smartctl or by configuring smartd which results in perodical log messages, if any problems occur. Both these programs are part of smartmontools. On Windows you can use tools like CrystalDiskInfo.

The most interesting S.M.A.R.T. value in this regard is the Reallocated Sector Count (ID #5), as it represents the number of bad sectors the drive internally replaced with spare sectors. Current Pending Sector (ID #197) and Uncorrectable Offline Sector (ID #198) are also an indication for sever hard disc problems. However in a lot of cases it's still possible to continue using the hard disc. Just keep in mind that once an error occurred, it's very likely that additional errors occur, so prepare for imminent failure, do proper backups and don't store anything important on a drive that started showing signs of defect.

Saturday, September 18, 2010

Duh Zombies?! - Rise of the zombie tomatoes

Recently when I logged in to my Tomato-powered Linksys WRT54GL via SSH I was quite disturbed when top showed quite a lot of zombie processes like this (I excluded the other normal processes):

19082 root 0 Z [cp]
19084 root 0 Z [cp]
19087 root 0 Z [cp]
19090 root 0 Z [cp]
19092 root 0 Z [cp]
19100 root 0 Z [cp]
19103 root 0 Z [cp]
19105 root 0 Z [cp]
19107 root 0 Z [cp]
25903 root 0 Z [cp]
26035 root 0 Z [cp]

I wonder where they come from. A little bit of guessing by the process ID and comparing these with the numbers from the log from other processes allowed me to narrow down the time when they „spawned“. This is based on the assumption that Linux assigns process IDs ascendingly (until it hits the max number then it should start at the bottom again). The log shows no other irregularities, though connection logging for example is disabled.

The brackets around cp probably mean that these were kernel services, though this is just a guess. Other than that the router is behaving like normal. When searching the internet I also found out that it's not possible to get rid of zombies in any way other than a reboot. Guess they will stay there for a while, however I will closely monitor the situation.

Friday, September 17, 2010

Grub setup for multiple OSes in a neat way

Armed with a grub CD I proceeded with the Windows 7 installation. First I made a backup of the Vista partition just in case, then I formatted it and installed Windows 7. The Windows 7 installation created a Boot directory on my WinXP partition along with bootmgr, the Vista/7 boot manager. The installation also added a note to boot.ini that this is the WinXP boot configuration. If you install Windows Vista/7 on a new harddrive, this stuff will be put onto a hidden partition.

Quote:
;
;Warning: Boot.ini is used on Windows XP and earlier operating systems.
;Warning: Use BCDEDIT.exe to modify Windows Vista boot options.
;

Of course it also updated the MBR to point to the new boot manager.

In order to set things up like I wanted, I first copied the Boot directory and bootmgr to my Win7 partition. Then I booted into Win7 and used bootsect (a tool located on the Win7 install DVD in the boot folder) to write a Windows 7 boot record to the Win7 partition and a WinXP boot record including ntldr to the WinXP partition.

Commands used (I first moved to the directory where bootsect resides):
bootsect /nt60 C: (NT 6.0 boot record for Win7 partition)
bootsect /nt52 D: (NT 5.2 boot record for XP partition)

Then I booted to Ubuntu Linux using the Grub Boot CD to setup Grub again. I used the below-mentioned commands.

From the standard Ubuntu terminal:
sudo grub

Commands for the grub CLI:
find /boot/grub/stage1 //returns where grub is actually installed
root (hd0,0) //type here whatever find returned
setup (hd0) //this installs grub into the MBR, if you want to put it into a specific partition use the (hd0,x) syntax to specify a partition
quit


Finally I cleaned up the boot.ini file using a simple text editor and removed unnecessary options from the Windows 7 boot manager (basically to make it boot instantly once selected from Grub) using bcdedit.

From the Windows 7 CLI (with admin rights):
bcdedit /v //this command displays a list of entries, basically needed to get the ID of the legacy entry
bcdedit /delete {ID} /f //substitute ID with the ID obtained from the previous command

I also removed bootmgr and the Boot directory from the Windows XP partition. In order to accomplish this you need to first claim ownership of the files and then set the priviledges for your user / group to be allowed to delete them.

The end result is a clean boot setup with grub as the main boot manager, which is used to boot Ubuntu Linux, Windows XP and Windows 7 without nested selection screens. The great thing about this setup is, that it is scalable, supporting an virtually arbitrary number of boot options without interfering with each other.

Note that the order in which the different operating systems are installed is largely unimportant as long as one knows where lies what and how to restore overwritten boot configuration or work around that using a prepared boot manager CD (it's also possible to put one on a separate hard drive or even USB drive).

Monday, July 19, 2010

Creating a Grub CD

Note: Impatient natures can skip the first 3 paragraphs.

Did it ever happen to you that you installed an additional operating system on your computer only to find the old one is „gone“? It sure happened to me. To elaborate a bit, by „gone“ I actually mean that the boot configuration of the OS was overwritten with the one of another OS. Strictly speaking overwritten is the wrong term, as aside from the MBR things get seldom overwritten, so maybe I should say replaced. In some cases things go fine and the old OS will be integrated in the new boot selection screen. Though sometimes things go wrong and this is where my findings might come in handy.

So ok, the other day I wanted to install Windows 7 on my notebook. Note that like 2 years ago I set up the system in a special way that suited my needs. I had Windows Vista (booted it like twice a year – haha in retrospect), Windows XP (my main OS) and Ubuntu Linux (whenever I needed Linux) installed and neatly organized via Grub. Each selection in Grub directly booted the respective OS, no additional selection screen was displayed from the Windows OS loader (bootmgr for Vista and ntldr for XP). If I hadn't set it up that way, it would just display „Microsoft Windows“ and after selecting this option I'd be presented a nested selection menu for Vista and XP. Needless to say this is not what I wanted, that's why I went through all the hassle of setting it up this way.

At this time I wanted to do a little bit more with Vista tough through some obscure error I couldn't install the Service Pack 2. This and the availability of Windows 7 made me thinking about replacing it in the first place. However, I remembered the chores to get a clean setup with the boot configuration so I wanted to prepare for the worst.

Enter a bootable CD containing grub with my specific configuration. Including the machine-specific (or should I say hard drive in the age of virtual machines) configuration is a bonus for convenience as grub comes with a command line so a generic CD would be enough for emergency purposes. I'm sure there is a hundred different ways to create a grub CD (not really but surely quite some). I chose a most probably not very elegant solution but it worked and it was basically stuff I was already experienced with since I created a lot of bootable BIOS-Update CDs based on DOS floppy images (reminds me I should probably blog about that as well). For a pure Grub CD you should probably look here.

Basically I took a DOS grub floppy from here, modified it as it suited me and then burned it to a CD-RW with bootable floppy disk emulation using Nero Burning ROM - though any other software capable of burning bootable CDs with floppy emulation might be ok. I removed the lots of selection screens from the grub floppy and put my menu.lst file on it. For this task I used the fancy tool WinImage. Linux users should probably read the info on the website of the grub floppy to get instructions on how to create such a disk. The grub floppy I linked has a lot of fancy features that might come in handy for one or another, but for me a basic grub disk was all I needed. Basically I removed all files except for the stage1 and stage2 files in /boot/grub and then put my own menu.lst in this directory. I took it directly from the /boot/grub folder from my Linux system. However you can also write a configuration from scratch, though that is outside of the scope of this blog.

Next step would be testing if it actually works. I recommend adding a dummy entry to either the CD or local grub configuration or changing the title or whatever you like in order to distinguish your local grub from the one on CD. Then you can reboot your computer, try booting from CD and if everything went fine you should be able to boot any of your operating systems from it.

Now that you have your backed up boot configuration you can start messing with other operating systems, especially the ones which like to replace MBRs and the likes. Remember that Windows-based OSes need their respective bootloader in order to be started, though most installation procedures don't remove those. If they do, you will still have the new operating system which can be used to put one in place, assuming the installation went fine.

Wednesday, July 14, 2010

Blog Intro - Windows Reinstall Shenanigans

Omg I really did it. I finally managed to set up this blog and actually post something to it.

Yeah, that sounds a bit lame for an intro, but let me explain a little bit. The initial idea to write this blog occurred to me about half a year ago. Back then I had some funny (read: horrible) problems with my laptop. It all started with wanting to defragment a hard drive partition I have been using for quite some time. It was a 40GB NTFS partition with like 1-2GB free space. The Windows XP defrag utility said the free space is not sufficient to carry out the defragmentation. The Windows Vista – I also had Vista installed on another partition - defrag utility however would carry out the process just fine. Bold as I am I thought that Microsoft may actually have improved the program and started the defragmentation.

Guess I shouldn't have been so bold back then. Note to little kids (and everyone else for that matter): Don't EVER defragment a Windows XP NTFS partition with the defrag utility from Windows Vista. Especially not when the Windows XP utility tells you the free space is not sufficiently large for the process, but I don't know if this criteria is enough. And for gods sake I don't feel like trying. So my daring readers you might want to know what actually happened. Let's put it this way: When you get a blue screen upon boot stating UNKNOWN HARD ERROR you know something terribly went wrong.

Know that I installed the system some 2 years before that and kept everything neat and clean, so I didn't feel the need for a reinstall. Fortunately I was able to rescue most of the data on the partition since it was still readable under Vista, though Windows was irrecoverably lost, as a I had to find out the hard way. When I wanted to prepare for reinstall I wanted to get the install CD to get the product key. I had installed a Windows XP Pro I bought for cheap at a students shop, not the XP Home shipped with the laptop. Although I usually don't lose such stuff, I was unable to find it (what was it about exceptions to the rule? Darn it!). That itself wouldn't be a problem if this didn't happen in some holidays. During workdays I'd have been able to retrieve my key from the shop as they register it with the buyer in order to prevent illicit use.

Some googling later I found various ways of recovering the product key of an existing Windows installation, though most of the options weren't for me as they require being able booting into it. So I opted for the „get the encoded CD key from windows registry and calculate the key from it“ option. Thank god there are Windows registry readers / editors for Linux. I chose chntpw. It can easily installed as a package in Ubuntu. Usage is pretty straightforward. I copied the software registry hive from %WINDIR%\system32\config\software to my home directory under Ubuntu. Then I invoked chntpw:
chntpw -e ~/software
The program greeted me with some info that the file contains some garbage, probably not the best sign:

chntpw version 0.99.5 070923 (decade), (c) Petter N Hagen
Hive name (from header): <emroot\system32\config\software>
ROOT KEY at offset: 0x001020 * Subkey indexing type is: 686c <lh>
Page at 0x1f9e000 is not 'hbin', assuming file contains garbage at end
File size 33292288 [1fc0000] bytes, containing 7513 pages (+ 1 headerpage)
Used for data: 582757/32096448 blocks/bytes, unused: 5213/812064 blocks/bytes.


Simple registry editor. ? for help.

Now typing '?' lists available commands. To carry out what I wanted to accomplish, I entered cd Microsoft\Windows NT\CurrentVersion to change to the specified key and then I dumped the data I was looking for via hex DigitalProductId. The tool might be used to retrieve other data from the registry as well. If you are done you can quit by entering 'q'. Note that the prompt has no convenient history feature like you are possibly used to from various shells.

An alternative might have been dumphive, which is also available as Ubuntu package. Dumphive converts a registry hive into RegEdit compatible text format.

The final step in order to recover the product key would be to stop by the DragonDesign website, where the whole process is detailed here. They offer a tool to derive the original product key from parts of the previously dumped DigitalProductId. Note that they offer the tool both as a web and standalone version. Though while I don't want to accuse them to harvest keys for sinister purposes, anyone concerned about security should use the standalone version on a computer which is physically divided from the internet – just in case.

Remember when I said that Windows was irrecoverably lost without actually explaining how I came to that conclusion. Opening the CURRENT_USER registry hive stored in the NTUSER.DAT file in the profile of the respective user with chntpwd reveiled that most of the file was corrupted.

chntpw version 0.99.5 070923 (decade), (c) Petter N Hagen
Hive name (from header): <nstellungen\usr\ntuser.dat>
ROOT KEY at offset: 0x001020 * Subkey indexing type is: 666c <lf>
Page at 0x4000 is not 'hbin', assuming file contains garbage at end
File size 8388608 [800000] bytes, containing 3 pages (+ 1 headerpage)
Used for data: 266/12192 blocks/bytes, unused: 0/0 blocks/bytes.

ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
ERROR: not 'nk' node! (strange?)
Simple registry editor. ? for help.

Looks like this file is corrupted. I lost some program configuration I wanted to backup, but other than that it was mostly a lot of time that I lost due to being a little bit too daring.

Explain a little bit, eh? Possibly this post more than anything explains why I don't get seemingly easy things done quick. Seemingly easy is the catch though, as most things I'll write about here will contain nasty details, pitfalls and the like. I publish my findings in the hope that it might save one or another from making the same mistake or trying harder than necessary to solve a problem.

Enjoy!