The Restoration of lennier

The Failure

During an attempt to build PHP on lennier, the load began to ramp up to 20 or so. I could not gain any information from the system as it was nearly unresponsive (I had an xload process running, which displayed the system load) so I had to reboot it. Once it was rebooted, I found that there were some errors with one of the two disks that comprised the main volume group. I restarted the build in order to try to recreate the problem. After the build completed, there was an error stating that the journal could not be written to, a rather serious problem.

Backing Up

I first rebooted in single-user mode and ran pvmove in order to remove the problematic disk from the volume group. Once that was done, I performed an rsync command to back up the data directory. The data directory contained my home directory and project files so it was of primary concern. The other part of the volume group contained the root directory of the system installation. Unfortunately, I didn't back this up.

Restoring The System

I cobbled together a system from some parts I had lying around to create a 900MHz system with 512MB of memory (it was an old E-PC motherboard and just wouldn't accept anything higher than 512MB). It may seem pretty feeble but it was replacing a 550MHz system with 128Mb of memory.

Rebuilding The Disks

I had an ATA RAID controller so I put together a few disks to replace the hard drive that had failed.

Xemacs Issues

Once I started up xemacs, I found as I expected that it was missing a lot of the modules that I normally use, for instance JDE. I normally manage emacs modules through emacs itself. I became root and started up xemacs and found one of the longstanding problems I have with the RedHat/Fedora distributions.

At some point, the engineers had decided that the ftp client's default mode should be passive. Prior to this, the command line option -p was used to turn on passive mode. When they changed the default, they did not provide an option to turn off passive mode, which causes problems with emacs when it tries to download module packages. Passive mode doesn't work with many servers.

NFS Issues

One of the most perplexing issues I had was with NFS. I suspect that part of the problem was with a dual network I had established for transitioning from one internal domain to another. I first found a problem when running amd on lennier. I found it would start up and either work with the /data tree or /home, but not both. Once I used one, I couldn't use the other. I switched to autofs and discovered that I could mount /data from beliafta and read it, but as soon as I tried to modify a directory under /data from lennier, I would get the message that beliaft was not responding. I tried adding "udp" to the mounts, but this didn't fix the problem. Since beliafta was running HP-UX 11.0, an older, unsuppoted OS, I decided to try moving the /data source directory from beliafta to hancock, a Solaris 10 system. This exhibited the same problem. Another curious twist is that copying a text file did not cause the hang, only copying binary files caused the hang. I tried booting with a 2.6.21 kernel without any success.

Because of this serious issue, I feel I can't use Fedora Core 7 on a production system yet. I'll continue to experiment with it in the hopes that the updates will fix the problem. I'll also try to recover data from the experiments in order to properly report the problem. I will need to install either Fedora Core 6 or another OS, such as RedHat AS 4 or CentOS 5.

Fedora Core 6 Downgrade

Because of the seriousness of the NFS issues, I decided to force install Fedora Core 6. I needed to get the CD drive working. For some reason though, the BIOS wouldn't recognize the CD drive. Nor, it seemed, any other CD or DVD drives I had. I was suspicious of my stock after the problems with the IDE drives, so I bought a couple of DVD drives from Logic Approach. They didn't work either. After installing the DVD drive that was brand new that I was going to install in gkar, I looked a little more closely at the BIOS. It turns out that there is an "extended system configuration" area that could be cleared. Once I cleared that the DVD drive showed up. My guess is that this is a feature that prevents someone from installing a removable drive on a "locked down" system (one that had the BIOS password set). Anyway, I was able to install Fedora Core 6.

Ah-ha! The nfs problem was due to a change in nfs. The way my configuration is set up, I have host specific home directories. I then use a common tmp directory that is linked in each host specific home directory. The result is that I have two mounts from the same exported directory. The mount options were different for the two mounts. This is apparently a no-no. Although the reference claims that this started with 2.6.18, I didn't notice a problem until 2.6.21 and 2.6.22. As soon as I started changing the mount options for those mounts, the mount hang problems went away.

Restoring MySQL

I've decided to move the MySQL database to vir.