Inexpensive, effective mixed-platform Network Security using Linux-based solutions.

Horizon Network Security™


	Home
	About Horizon Network Security™
	Network Security Products
	Network Security Services
	Security Consulting
	Linux System Administration
	Traffic Shaping
	Security Audits
	Disaster Recovery
	Compromise Planning
	Programming Services
	Security News
	Our Publications
	Linux Resources
	Contact Us
	Contact Us Securely

Our Publications

PROBLEM SOLVER

Dealing with disaster

by Bob Toxen

Sooner or later every UNIX system will sustain file system damage. It may be caused by a power failure, a hardware or software failure or an operator error. The chances of these occurrences may be reduced by using proper techniques. Preparing for the inevitable reduces the impact of the damage. When damage does occur, moreover, the proper response may minimize data loss and prevent further damage. File system damage is like a cancer: unless it is stopped, it will grow and destroy more and more files.

PREVENTION

A kilobyte of prevention is worth a gigabyte of cure. If power lines are unreliable or noisy, or if your equipment or data is particularly sensitive, then investing in an uninterruptible power supply is well worth the dollar-per-watt cost. It also pays to keep equipment properly maintained.

Be very careful with the files needed for booting. Other system files, too, should be handled with care. Removing /dev/console or accidentally entering:


  # chmod 666 / usr/file

instead of:


  # chmod 666 /usr/file

can be disastrous. The former will instantly render the root file system unusable and unbootable, since it takes execute (directory search) permission away from the entire file system -- except for references relative to the current directory that do not go through the root directory.

Make sure knowledgeable people know how to reset the system, know how to turn the system off, and understand the tradeoffs of connecting other equipment to the same electrical circuit (causing electrical noise or an overload). Ignorance is not bliss but an accident waiting to happen.

MINIMIZING THE IMPACT

The best and simplest way to minimize the impact of a crash is to perform frequent file system backups. Backups should usually be done every day or two and certainly weekly (unless data is static). Users should be encouraged to record their valuable data on either a tape (or floppy), another system, a different disk or, if necessary, on the same disk. Backup tapes (or floppies or disks) should be read periodically to make sure they are readable. Some people have learned this lesson the hard way. Alternate between at least two tapes (or sets of floppies) in case the system crashes in the middle of a backup, destroying both the disk and tape. Store some backups off-site to guard against fire, earthquake and sabotage.

Make provisions for easy recovery in the event the system will not boot. One method is to have some way of booting off a different disk. Another common method is to provide a way to backup the disks with standalone utilities that can be booted from instead of the default UNIX kernel. You could also provide a way to overwrite the disk with a bootable UNIX system and essential standalone utilities that should be bootable from tape (or floppy). There are 10 files needed for UNIX to boot, including:


/unix  (name may vary)
/dev/console
/dev/md0a (name may vary)
/dev/swap
/etc/init
/etc/inittab (System III & System V)
/etc/rc (System III)
/bin/sh
/bin/csh (Some configurations)
/bin/su (System V)

Also, all directories leading to these files must be readable and executable by all. Some versions and implementations will need different files. One way to find these files is to reboot the system (after properly shutting down) and issuing the command:


  # ls -lut / /bin /dev /etc

as soon as you get a single user prompt. This will list the files in the specified directories with the time they were last accessed (read, written or executed as a program) -- sorted by access time. Those files with an access time after the time the system was shut down are probably those needed for booting. For systems with several disks, these critical files should be duplicated on a second disk, and the capability of booting from that disk should be provided. In most implementations, one can boot off any disk or tape.

SHUTTING DOWN THE SYSTEM

Make sure everyone is logged off including those on dialups and nets.
Make sure that printers, tape drives and other peripherals are inactive.
Make sure UUCP and similar networking programs are inactive.
Make sure various daemons such as mailers, news and networks are inactive.
Take the system down to single user mode.
Do a ps and kill any process besides process 1, your shell and ps. Do another ps to verify that they all went away. A kill -9 may be needed. Don't worry about gettys that do not go away. This is a harmless problem caused by a defective tty driver. International Technical Seminars offers an excellent class on how to write drivers correctly.
Issue a sync command.
Turn off the system or press the reset button.

People who have the reboot (or facsimile) program may use it in place of steps 5 through 8.

WHAT FILE SYSTEM DAMAGE MEANS

In addition to the data in everyone's files, UNIX must keep track of the names of files, their permissions, ownership, time of last modification, links, directories, unused disk portions (free blocks and free inode numbers), counts of files and counts of blocks of data that will fit in each file system -- as well as an assortment of other concerns. When changes are made to such things, they are not immediately written out to disk but instead are kept in memory. If anyone wants to read any of this changed data, UNIX knows to use the copy in memory rather than the old copy on disk. Likewise, any new changes will affect the copy in memory.

Some portions such as the superblock, which keeps track of free blocks, free inodes and such, and the /tmp directory, which contains the temporary files used by editors and compilers, change often. Other areas, such as the /bin directory, do not change often but are read often (every time you execute a program). By keeping this rapidly changing or frequently read data in memory rather than having to read it continually from disk and write it back, UNIX runs much faster than it would otherwise.

This buffer area in memory is limited by a fixed size. If there isn't room to fit in some new data that someone wants to read from or write to a disk, then a portion of the buffer will be written to disk -- if a change has been made. That portion of the buffer then will be available for new data.

There is usually some changed data sitting in the memory buffer waiting to be written to disk. UNIX is in no hurry to write this data to disk. Why should it? If anyone wants it, they can get the memory copy of it. The only problem is that if the system crashes, the disk will contain some old data.

If some of this old data is information on whether a particular block of data is free; is contained in a file; is a list of where the data for a particular file is kept on disk; or is a list of files in a directory, then UNIX will be confused when it is rebooted.

Suppose you just created a file with vi. Imagine that the block on the disk that records the place where this file's data is kept is written to disk and that the actual data blocks are also written on the disk. If the system then crashes, the block that records the file's data blocks will have been allocated to a file (rather than being unused), and the data block of the directory that this file was created in will not have been written to disk.

If you then reboot, you will not be able to access that file because its name is not in the directory. Also, if you create another file, it may use the same blocks that were used by your first file, destroying the first file's data. This is why, when rebooting after a crash, fsck must be invoked immediately -- before the file system is changed further.

RECOVERING FROM A CRASH

First, log the crash in the system logbook. Include any error messages and any other significant items that will help determine the cause of the crash -- thus minimizing the impact of future crashes of a similar nature. For example, if the system crashed with the error messages:


  panic: IO err in swap

displayed several times, one would suspect that either the disk used for swapping or its related controller, device driver or the like was having problems. Similarly, the message:


  panic: parity

appearing more often than, say, once a month probably indicates memory hardware problems. In most implementations, the message will also tell which section of memory the parity error has occurred in. After several such panics, a field service engineer may be able to see a pattern and determine which section of memory should be replaced.

After logging a crash, reboot the system by pressing the reset button or performing whatever routine you normally use to start up your system. It should come up in single user mode. The very first thing to do at this point is to run fsck -- as in file
system consistency checker. It will check each of your file systems to make sure they are not corrupt. This is usually done with the command:


  # /etc/fsck

Some systems are configured to start up fsck automatically. The fsck command will read the file /etc/checklist to get a list of file systems to check.

The /etc/checklist file is a text file that may be edited with vi. It contains the name of each disk that contains each file system, one name per line. The first line should have the name of the root file system. This name (on every line except the first) should be the same name used in a mount command -- except that there should be a small letter "r" after /dev/. For example, if your root file system is on /dev/md0a and you issue the mount commands:


  # /etc/mount /dev/md0c /usr
  # /etc/mount /dev/md1a /mnt
  # /etc/mount /dev/md1c /image
  # /etc/mount /dev/md1b /tmp

then your /etc/checklist file should look like:


  /dev/md0a
  /dev/rmd0c
  /dev/rmd1a
  /dev/rmd1c
  /dev/rmd1b

If /etc/checklist is not configured, list the file systems to be checked on the command line, like so:


  # /etc/fsck /dev/md0a /dev/rmd0c /dev/rmd1[abc]

The fsck command will then read through each file system (from disk) and check for inconsistencies, such as blocks that are both on the free list and in a file or files that don't appear in any directory. Each time fsck finds something wrong, it will indicate what the problem is and ask whether it should be fixed. Almost always you will want to type the letter "y" (for yes) followed by RETURN. One case where you might want to type "n" (for no) is when fsck asks you whether it should delete a file and gives only its inode number rather than its name. You will want to find out the file's name before it is removed so you can recover it from backup tape. To find out the name of inode number 387 on md0c, give the commands:


  # /etc/mount /dev/md0c /usr
  # find /usr -inum 387 -print
  # /etc/umount /dev/md0c

Another time to say no is when fsck asks for permission to remove /unix or another equally important system file. Recovering from this or other more complex problems is beyond the scope of this article.

Bob Toxen is a member of the technical staff at Silicon Graphics, Inc. He has gained a reputation as a leading uucp expert and is responsible for ports of System V for the Zilog 8000 and System III for the Motorola 68000.

Back

design by: Digital Images Design