Tuesday, March 11, 2008

Future of OpenSolaris Boot Environment management

I was quite happy to see this recent post from Ethan Quach proposing an efficient method for sharing the variable parts of /var. It bears a striking resemblance to something that I suggested and and clarified in the past.

Correction June 6, 2009: Links to mail archives at opensolaris.org seem not to be stable. The same messages are available at the following: My initial suggestion and clarification.

But why does this matter? When you are making significant changes to the system, such as during a periodic patch cycle or upgrade, it is generally desirable to...

  1. be able to do so without taking the system down for the duration of the process
  2. be able to abort the operation if you have a change of heart
  3. be able to fail back if you realize that newer isn't better
Consider what is in /var:
  • Mail boxes If the machine is a mail server (using sendmail et. al.) there is a pretty good chance that users have their active mail boxes at /var/mail.
  • In flight mail messages Most machines process some email. For example, if a cron job generates output it is sent to the user via email. Many non-web mail clients invoke /bin/mail or /usr/lib/sendmail to cause mail to be sent. Each message spends somewhere between a few milliseconds and a few days in /var/spool/mqueue or /var/spool/clientmqueue.
  • Print jobs If the machine acts as a print server (even for a directly attached printer) each print job spends a bit of time in /var/spool/lp.
  • Logs When something goes wrong, it is often times useful to look in log messages to figure out why it went wrong. Those are often found under /var/adm.
  • Temporary files that may not be It is rather common for people to stick stuff in /var/tmp and expect to be able to find it sometime in the future.
  • DHCP If a machine is a dhcp server, it will store configuration and/or state information in /var/dhcp.
  • ...
All of those things should be of a stable file format and usable before and after you patch/upgrade/whatever. If you take the traditional Live Upgrade approach, you can patch or upgrade to an alternate boot environment. As part of activating the new environment, a bunch of files are copied between boot environments. According to this page the following things are synchronized:
/var/mail                    OVERWRITE
/var/spool/mqueue            OVERWRITE
/var/spool/cron/crontabs     OVERWRITE
/var/dhcp                    OVERWRITE
/etc/passwd                  OVERWRITE
/etc/shadow                  OVERWRITE
/etc/opasswd                 OVERWRITE
/etc/oshadow                 OVERWRITE
/etc/group                   OVERWRITE
/etc/pwhist                  OVERWRITE
/etc/default/passwd          OVERWRITE
/etc/dfs                     OVERWRITE
/var/log/syslog              APPEND
/var/adm/messages            APPEND
Notice that the default configuration loses your in flight print jobs because /var/spool/lp is not copied. Suppose you have a mail server with a few gigs of mail at /var/mail. Is it a good use of time or disk space to copy /var/mail between boot environments?

A much better solution seems to be to make those directories shared between the boot environments. The way to do this in Live Upgrade and presumably in the future is to remove (or not add) them to /etc/lu/synclist and allocate separate file systems. However, do you really want a file system for /, /var/mail, /var/spool/mqueue, /var/spool/clientmqueue, /var/spool/lp, /var/adm, /var/tmp, /var/dhcp, ...? What if you had someone tell you that you had to monitor every file system on every machine for being out of space? How big would you make all of those file systems so that your monitoring didn't wake you up in the middle of the night?

In the future, it looks as though OpenSolaris will use ZFS to store each boot environment. Among the features of ZFS that make this desirable are snapshots, clones, and rethinking the boundary between disk slices (or volumes) and file systems. If the organization of /var is changed just a bit...

/var/adm -> share/adm
/var/dhcp -> share/dhcp
/var/mail -> share/mail
/var/spool -> share/spool
/var/tmp -> share/tmp
/var/share/adm
/var/share/dhcp
/var/share/mail
/var/share/spool
/var/share/tmp
Then you can get by with having two zfs file systems: / and /var/share. The Snap Upgrade process would then likely do the following:
  1. Take a snapshot of /, clone it, then mount it somewhere usable in subsequent steps (e.g. /mnt/abe)
  2. Do whatever is needed on the alternate boot environment mounted at /mnt/abe.
  3. Unmount the alternate boot environment
When it comes time to activate the new boot environment, there are some files that are likely need to be synchronized using the traditional mechanism. For instance, if someone tried to get into a system by guessing a user's password, there is a reasonable chance that the account was locked via a modification to /etc/shadow. Presumably you don't want to give the bad guy another chance when you activate the new boot environment. Note, however that the files that may need to be synchronized in /etc are nearly always small files and there would not be very many of them. The files in /var/shared would not need to be synchronized. However, just in case the new version of sendmail decides to eat mailboxes, it would be very nice to be able to recover.

This means that activating a boot environment would look like:

  1. Bring the system into single-user mode
  2. Mount the alternate boot environment
  3. Synchronize those files that need to be synchronized
  4. Take a snapshot of /var/shared
  5. Set the boot loader to boot from the new boot environment and offer a failback option to the old boot environment
  6. Reboot
The items in italics are special to boot environment activation. Each one should take a couple seconds or less - adding far less than thirty seconds to the normal reboot process to activate the new boot environment. Failback would be similarly quick.

Now suppose this system is a bit more complicated and has 20 zones on it. Have you ever patched a system with 20 zones on it? Did you start and Friday and finish on Monday? How happy were the users with the "must install in single-user mode" requirement? This same technique should allow you to have two file systems per non-global zone - one for the zone root and one for /var/shared in the zone. Supposing that the reboot processing takes 5 seconds per zone you are looking at an extra minute to reboot rather than a weekend of down time.

Without Live Upgrade or Snap Upgrade, what would backout look like? After you had the system down for patching for a couple days, you could take it down again for a couple days to back the patches out. Or you could go to tape. Neither is an attractive option. With Snap Upgrade you should be able to fail back with your normal reboot time plus a minute.

1 comment:

Hartz said...

I love this idea.

I've got /var/cores split out to a separate file system, (I blogged about it recently), which is simple because it doesn't need synchronizing during activation of the new BE.

I've also moved /var/crash into /var/cores, with a soft-link from /var/crash to /var/cores/crash to "support" existing utilities looking for the files in /var/crash.