BURP - BackUp and Restore Program

Improved Open Source Backup:
Incorporating inline deduplication and sparse indexing solutions

G. P. E. Keeling

Appendix A - How the original 'burp' works

Burp has a client/server design. The machines to be backed up are the clients, and the machine that stores the backups is the server.

The clients are usually set up to regularly contact the server at short intervals of, for example, 20 minutes.
The server decides whether or not it is time for the client to be backed up, based on configuration options that give it timebands and minimum time elapsed since the last successful backup.

This has several advantages over a system like bacula, where a central server contacts the clients on a schedule. For example:

Firewalls tend to allow outgoing connections far more frequently than incoming connections. Having the clients connect to the server rather than the other way round means that only access to the single server port is required. This is a huge administrative advantage.
Server logic that deals with retrying after failed backups becomes massively simplified. The client will connect again in a few minutes time, and the decision is simply over whether that client has a recent enough successful backup, and whether it is still in timeband. Conversely, having the server initiate connections on a schedule means that it also needs complicated logic for scheduling retries. Indeed, I found this impossible to configure successfully in bacula.

Network connections use the OpenSSL library to secure communications. There is an automatic certificate signing and exchange mechanism to provide confidence and certainty that subsequent communications are with the same peer.

When a client initiates a connection, the server forks a child process to deal with it. A nice feature is that the configuration for the client on the server is read in at this point, meaning that most configuration changes do not need the administrator to reload or restart the server.

A.1. Backup sequence

There are four phases to the backup sequence.

In the first phase, the client scans its file system (based upon user configuration to include or exclude files) and sends the results to the server. The scan includes path names and meta information, such as file system entry type (regular file, directory, soft link, etc), permissions, time stamps, and so on. Retrieving this information is fast because no files are actually opened at this point - it is just file system data.
The server records all this information in a file called the 'manifest'.

In the second phase, the server reads through the manifest from the immediately previous backup and the new manifest. Because the paths in the manifests are always in alphabetical order, only a single pass is required.

If it finds files that do not exist in the previous manifest, it asks the client to transfer the whole file.
If it finds files in both manifests where the modification times are the same, it makes a note that that file has not changed.
If it finds files in both manifests where the modification times are different, it initiates a transfer using librsync. It reads the previous version of the file in order to generate 'signatures'. It sends the signatures to the client, and the client uses them to transfer only blocks that have changed. This results in a 'delta' file appearing on the server. There is more on librsync and its limitations, or problems, below.

When the second phase is complete, the client has no more work to do, and disconnects from the server.
However, the server child process still has more to do.

In the third phase, the server combines the notes that it made of what was and was not transferred, and builds a final manifest for the backup. This is a very quick phase, usually taking only a few seconds.

In the fourth phase, the server goes through the final manifest and the new files and deltas on disk, along with the unchanged files from the previous backup, and finalises the new backup. It needs to apply the deltas to the previous files to generate new files, and it makes hard linked entries to represent the unchanged files.
Optionally, it can then generate reverse deltas to take you from the latest versions of files to the previous versions, and deletes the original file from the previous backup. This can save significant disk space, but could add significant time to this phase, as well as adding to the time needed for restore.

A.2. Storage directories

To do a restore from any backup in a sequence, the server will only ever need to access that backup, and possibly subsequent backups. It will never need to access previous backups in the sequence.
This means that it is always 'safe' to delete the oldest backup in the sequence. This is not the case for backup solutions where you need access to older backups to recreate newer backups (bacula, for example). For such solutions, a satisfactory retention policy is highly problematic.

So, the resultant burp backup directories contain a manifest file, and a heap of (possibly very many) files and (optionally) deltas.

Since the files are stored in a directory structure that reflects the original client, a convenient way to restore a file is just to copy it directly out of the storage directory, rather than using the burp client.

A client's sequence of backups is always self-contained. That is, no information external to them is needed for them to be complete. Additionally, if you are not generating reverse deltas, each individual backup is self-contained. This is an advantage over backup solutions like bacula that use a separate 'catalogue' to store paths, meta data and storage locations. Such solutions add a significant administrative headache and many opportunities to end up with useless backups. You must have provision to back up your catalogue, for example.

A disadvantage to the burp storage structure is that there may be millions of files in a single backup, each needing its own file system entry. This will slow down the fourth phase of a backup significantly. Especially if you have set up your backup server to use Network Attached Storage (NAS), where each 'stat' system call goes over the network.

A.3. Deduplication

Burp's deduplication mechanism operates 'after the fact', as it were. It has a separate program that can be run on a cron job, which scans the storage file system looking for duplicate files. When it finds one, it replaces it with a hard link.
Obviously, this takes significant time when there are millions of files to scan. It also does not deduplicate chunks within files. The deduplication program is not run on a default install of burp.

It would be far preferable to have an 'inline' deduplication mechanism, where the deduplication happens to chunks during the usual backup process. A backup solution that does this is 'bup'.

< Prev Contents Next >

Burp is open and free software. I work on it in my spare time. If you would like this work to continue, please consider making a small donation.

Burp, don't suck.