BU CDRW Data Format

Bu will always automatically break a dump into multiple volumes if all the data will not fit on one volume. Each volume has a standard ISO9660 filesystem with the Rock Ridge extensions and the following files.

 data.tar.gz       Gzip compressed tar file of all files on the volume.
                   Compression is on by default but if it is turned off
                   in the RC file, this will be a plain tar file.
 file-list         Text file containing a long directory listing of all
                   files on the volume.
 info              Text file containing volume and backup information.

In addition, the last volume contains the following file.

 MASTER-FILE-LIST  Long directory listing of the files for all volumes
                   combined.  This files contains a volume number
                   label at the top of the list of files for each volume.

The info file for all volumes but the last will look something like this.

 Label:         Quark's home directories
 Date:          Tue May  7 15:31:48 2002
 Type:          Directory list
 Mode:          Full
 Volume size:   1065.45 MB
 Volume number: 1

 Directories
 -----------
 /u1/home

The info file on the last volume has an extra field and shows the total number of volumes. It will look something like this.

 Label:         Quark's home directories
 Date:          Tue May  7 15:31:48 2002
 Type:          Directory list
 Mode:          Full
 Volume size:   1632.62 MB
 Total size:    2698.07 MB
 Volume number: 2 of 2

 Directories
 -----------
 /u1/home'

Files are not split between volumes the way uncompressed multi-volume tar does. The last file in each volume is always complete. This has several advantages and a couple of disadvantages as listed below. I feel the disadvantages are minimal and are far outweighed by the advantages.

Advantages

  • You never have to go through two volumes to restore a single file.
  • The volumes can be restored in any order.
  • It is more robust and dependable.
    Bu uses GNU tar, which has been very robust and dependable over the years for creating and extracting single volume tar files.

    Multi-volume formats that split files to fill the volumes are not necessarily standard or dependable. GNU tar is one of the few utilities I know of that does multi-volume and it cannot do it with compression.

    Although GNU tar does multi-volume well when accessing the archives directly without compression, during development I attempted to do multi-volume tar and stream the data through a pipe to another process that took care of compression so that tar was un-aware of it and it turned out that even GNU tar had a bug that corrupted the file that got split between volumes when restoring from a pipe.

    GNU cpio also does multi-volume when reaching the end of physical media but has no options to set a volume size, so it would not work for bu. Also cpio could not handle the large device numbers that BSD uses for device files.

  • Easy to extract the data with no special tools.
    If I had used a multi-volume format, it would have required bu to be used to extract the data, or sophisticated complex command lines or scripts. This way, even if bu is not present, such as while recovering from a crash in maintenance mode, only tar and gzip are needed with simple command line options that most anybody that uses Unix regularly are familiar with.

    This also adds to the robustness, since file recovery is easier and more dependable.

    Disadvantages

  • You cannot dump a file that is extremely large.
    If a file is too large to fit on the media after compression, bu will gracefully skip it, tell you about it, and go on to the next file. Depending on the compressibility of the file, the maximum file size will range from 650 MB to 2.5 GB or more on a standard 650 MB CD. If you need to dump a larger file, you will have to split it ahead of time.
  • You do not get 100% space utilization.
    Since bu does not split files. It changes volumes when it encounters a file that it thinks will not fit in the remaining space of the current volume. Bu decides this based on the file size, estimated compression ratio based on the file type, and, of course, remaining available space on the volume. This causes a varying amount of unused space on each volume. Bu also reserves a small percentage of the space to allow for the file lists and volume info files.

    Even though, on 650 MB CD's, I have been getting an average of about 1.2 GB of data on each volume before compression, and a maximum of up to about 2.3 GB. This varies depending on the average file sizes and types.

    To maximize volume usage efficiency, bu also continuously monitors the amount of actual data after compression that has been written during the dump so that it does not have accumulated error from estimating the compression ratio of files.