Improved Open Source Backup:
Incorporating inline deduplication and sparse indexing solutions
G. P. E. Keeling
Appendix B - Design of the proof of concept programs
In order to ensure a solidly functional rabin algorithm and disk storage
format before embarking on the complex asymetric network i/o that will be
required for the first iteration of the software, I designed and implemented
some proof of concept programs.
The idea was to have programs that could use simple i/o so that data streams
could simply be piped between them, or written to or read from files.
So, there are four programs: 'client', 'server', 'tneilc' and 'revres', and
they operate in pairs.
files on disk backup stream backup stream manifest file
----------> client -------> ----------> server ---> signature files
files on disk restore stream restore stream manifest file
<---------- tneilc <------- <---------- revres <--- signature files
'client': Given a list of files, it reads them from the disk, splits them into
variable length chunks, generates signatures, and outputs a backup stream
(either to a file, or on standard output). There are options available for
tweaking the rabin fingerprinting parameters.
Usage: client [options] <input file> ...
-o output file
-w sliding window size (between 17 and 63)
-a average block size
-m minimum block size
-x maximum block size
'server': Reads a backup stream on standard input or from a file, and converts
the file names, signatures and data to a storage format on disk. The storage
format consists of three parts:
a) Data files, containing actual chunks of data.
b) Signature files, containing the signatures for the data files.
c) A manifest, containing the paths of the original files, and a list of the
signatures that represent the data making up the original file.
The server will read any existing signatures on disk into memory at the start
If the server has seen a chunk in the backup stream already, it will not save
it twice. It will instead save the signature reference to the manifest and
discard the duplicate data.
Usage: server [options] <input file> ...
-d <directory> directory for output
'revres': The reverse of 'server'. It reads a manifest file and
converts the data it references into a restore stream, which can be output
to a file, or to standard output.
Usage: revres [options] <input file> ...
-d base directory
-o output file
'tneilc': The reverse of 'client'. It reads a restore stream, and converts it
back into the files that were originally backed up.
Usage: tneilc [options] <input file> ...
-d <directory> directory for output
Examples of the expected usage of these programs:
To back up two files, named 'a' and 'b':
client README | server -d /tmp/storage
To restore them:
revres -d /tmp/storage/ /tmp/storage/man/0 | tneilc -d /tmp/restore
B.1. Backup stream
In the original burp, each peace of information sent over the wire starts
with a single byte that indicates the type of information. The next four
bytes after that are a hexidecimal number representing the length of
the rest of the line. This means that I can have a maximum of 65535 bytes
in a line. I have found that this schema works well, and will continue to
use it for this project.
For the proof of concept program, I will use:
'f' to mean 'file name'.
's' to mean 'signature'.
'a' to mean 'data' (I reserve 'd' to mean 'directory' later on).
So, when backing up a small file called 'testfile', the backup stream might
look something like the following. For ease of reading, I have split it onto
a0014This is my testfile.
The signature is made up of the rabin fingerprint (16 characters) and the
md5sum (32 characters) of the data in the file ('This is my testfile').
There may be many signature/data pairs per file backed up.
B.2. Restore stream
This is the same as the backup stream, except that there are no 's'
(signature) lines required.
B.3. Storage format
There are three parts to the storage format.
The manifest (instructions on how to recreate a set of files), the signatures,
and the actual data.
In the prototype, there will be three top level diretories in the storage:
'man', 'dat' and 'sig'.
Each backup will have a single manifest. The first manifest will be '0', and
subsequent ones will increment that name by one each time.
The 'dat' and 'sig' directories will each contain a path and file named like
this, where each component of the path and name is a hexidecimal number:
The signature file references the data file with the same path and name.
When enough data is written to this file (the number of chunks to be determined
via some experimentation), it is closed and a new one is
opened with an incremented file name:
When the file part gets to 'FFFF', a new subdirectory '0000/0001' will be
created, and the file within it will start from '0000' again.
Since popular file systems, like ext3, only support 32000 subdirectories,
The subdirectory increments will be limited to a maximum of 30000.
So, there can be 30000*30000*65535=58981500000000 data files, which should
be enough! Most file systems will run out of inodes before the number of
storage data files run out.
The contents of the manifest file will look like the following, where the
format of each line is following the same outline as described for the backup
The signature lines reference locations in the data and signature files, the
first three components being the file itself, and the last being the index
of the signature/data.
The signature files will look like this:
And so on, for however many signatures there are in the signature/data file.
These signatures are loaded by the server before it reads the backup stream
for the client, and uses them for deduplication.
The data files will contain the chunks for each of the signatures, each
preceeded by 'a<four characters of hex>'.
Note that the server does not have to read the signatures in order to retrieve
the data for a restore, because the references in the manifest also point
directly to the correct location in the data files.
Also, this format allows for client compression and encryption of chunks on
the client side at some future date. This will not be done for this project,
but I am ensuring that the decisions that I make now do not preclude that.