Improved Open Source Backup:
Incorporating inline deduplication and sparse indexing solutions
G. P. E. Keeling
Appendix A - How the original 'burp' works
Burp has a client/server design. The machines to be backed up are the clients,
and the machine that stores the backups is the server.
The clients are usually set up to regularly contact the server at short
intervals of, for example, 20 minutes.
The server decides whether or not it is time for the client to be backed up,
based on configuration options that give it timebands and minimum time elapsed
since the last successful backup.
This has several advantages over a system like bacula, where a central server
contacts the clients on a schedule. For example:
- Firewalls tend to allow outgoing connections far more frequently than
incoming connections. Having the clients connect to the server rather than
the other way round means that only access to the single server port is
required. This is a huge administrative advantage.
- Server logic that deals with retrying after failed backups becomes massively
simplified. The client will connect again in a few minutes time, and the
decision is simply over whether that client has a recent enough successful
backup, and whether it is still in timeband. Conversely, having the server
initiate connections on a schedule means that it also needs complicated logic
for scheduling retries. Indeed, I found this impossible to configure
successfully in bacula.
Network connections use the OpenSSL library to secure communications. There
is an automatic certificate signing and exchange mechanism to provide
confidence and certainty that subsequent communications are with the same
When a client initiates a connection, the server forks a child process to deal
with it. A nice feature is that the configuration for the client on the server
is read in at this point, meaning that most configuration changes do not need
the administrator to reload or restart the server.
A.1. Backup sequence
There are four phases to the backup sequence.
In the first phase, the client scans its file system (based upon user
configuration to include or exclude files) and sends the results to the server.
The scan includes path names and meta information, such as file system entry
type (regular file, directory, soft link, etc), permissions, time stamps, and
so on. Retrieving this information is fast because no files are actually
opened at this point - it is just file system data.
The server records all this information in a file called the 'manifest'.
In the second phase, the server reads through the manifest from the immediately
previous backup and the new manifest. Because the paths in the manifests are
always in alphabetical order, only a single pass is required.
If it finds files that do not exist in the previous manifest, it asks the
client to transfer the whole file.
If it finds files in both manifests where the modification times are the same,
it makes a note that that file has not changed.
If it finds files in both manifests where the modification times are different,
it initiates a transfer using librsync. It reads the previous version of the
file in order to generate 'signatures'. It sends the signatures to the client,
and the client uses them to transfer only blocks that have changed. This
results in a 'delta' file appearing on the server. There is more on librsync
and its limitations, or problems, below.
When the second phase is complete, the client has no more work to do, and
disconnects from the server.
However, the server child process still has more to do.
In the third phase, the server combines the notes that it made of what was
and was not transferred, and builds a final manifest for the backup. This is
a very quick phase, usually taking only a few seconds.
In the fourth phase, the server goes through the final manifest and the new
files and deltas on disk, along with the unchanged files from the previous
backup, and finalises the new backup. It needs to apply the deltas to the
previous files to generate new files, and it makes hard linked entries to
represent the unchanged files.
Optionally, it can then generate reverse deltas to take you from the latest
versions of files to the previous versions, and deletes the original file from
the previous backup. This can save significant disk space, but could add
significant time to this phase, as well as adding to the time needed for
A.2. Storage directories
To do a restore from any backup in a sequence, the server will only ever need
to access that backup, and possibly subsequent backups. It will never need to
access previous backups in the sequence.
This means that it is always 'safe' to delete the oldest backup in the
sequence. This is not the case for backup solutions where you need access to
older backups to recreate newer backups (bacula, for example). For such
solutions, a satisfactory retention policy is highly problematic.
So, the resultant burp backup directories contain a manifest file, and a heap
of (possibly very many) files and (optionally) deltas.
Since the files are stored in a directory structure that reflects the original
client, a convenient way to restore a file is just to copy it directly out
of the storage directory, rather than using the burp client.
A client's sequence of backups is always self-contained. That is, no
information external to them is needed for them to be complete. Additionally,
if you are not generating reverse deltas, each individual backup is
self-contained. This is an advantage over backup solutions like bacula that
use a separate 'catalogue' to store paths, meta data and storage locations.
Such solutions add a significant administrative headache and many opportunities
to end up with useless backups. You must have provision to back up your
catalogue, for example.
A disadvantage to the burp storage structure is that there may be millions of
files in a single backup, each needing its own file system entry. This will
slow down the fourth phase of a backup significantly. Especially if you have
set up your backup server to use Network Attached Storage (NAS), where each
'stat' system call goes over the network.
Burp's deduplication mechanism operates 'after the fact', as it were.
It has a separate program that can be run on a cron job, which scans the
storage file system looking for duplicate files. When it finds one, it replaces
it with a hard link.
Obviously, this takes significant time when there are millions of files to
scan. It also does not deduplicate chunks within files. The deduplication
program is not run on a default install of burp.
It would be far preferable to have an 'inline' deduplication mechanism, where
the deduplication happens to chunks during the usual backup process. A
backup solution that does this is 'bup'.