| Improved Open Source Backup:
Incorporating inline deduplication and sparse indexing solutions
G. P. E. Keeling< Prev
Contents
Next >Appendix B - Design of the proof of concept programs
In order to ensure a solidly functional rabin algorithm and disk storage
format before embarking on the complex asymetric network i/o that will be
required for the first iteration of the software, I designed and implemented
some proof of concept programs.
 
The idea was to have programs that could use simple i/o so that data streams
could simply be piped between them, or written to or read from files.
 
So, there are four programs: 'client', 'server', 'tneilc' and 'revres', and
they operate in pairs.
 
files on disk      backup stream   backup stream           manifest file
----------> client ------->        ----------> server ---> signature files
                                                           data files
files on disk      restore stream  restore stream          manifest file
<---------- tneilc <-------        <---------- revres <--- signature files
                                                           data files
'client': Given a list of files, it reads them from the disk, splits them into
variable length chunks, generates signatures, and outputs a backup stream
(either to a file, or on standard output). There are options available for
tweaking the rabin fingerprinting parameters.
 
Usage: client [options] <input file> ...
 -o output file
 -w sliding window size (between 17 and 63)
 -a average block size
 -m minimum block size
 -x maximum block size
 
'server': Reads a backup stream on standard input or from a file, and converts
the file names, signatures and data to a storage format on disk. The storage
format consists of three parts:
a) Data files, containing actual chunks of data.
 b) Signature files, containing the signatures for the data files.
 c) A manifest, containing the paths of the original files, and a list of the
   signatures that represent the data making up the original file.
 The server will read any existing signatures on disk into memory at the start
of execution.
 If the server has seen a chunk in the backup stream already, it will not save
it twice. It will instead save the signature reference to the manifest and
discard the duplicate data.
 
Usage: server [options] <input file> ...
 -d <directory> directory for output
 
'revres': The reverse of 'server'. It reads a manifest file and
converts the data it references into a restore stream, which can be output
to a file, or to standard output.
 
Usage: revres [options] <input file> ...
 -d base directory
 -o output file
 
'tneilc': The reverse of 'client'. It reads a restore stream, and converts it
back into the files that were originally backed up.
 
Usage: tneilc [options] <input file> ...
 -d <directory> directory for output
 
Examples of the expected usage of these programs:
 
To back up two files, named 'a' and 'b':
client README | server -d /tmp/storage
To restore them:
revres -d /tmp/storage/ /tmp/storage/man/0 | tneilc -d /tmp/restore
 B.1. Backup stream
In the original burp, each peace of information sent over the wire starts
with a single byte that indicates the type of information. The next four
bytes after that are a hexidecimal number representing the length of 
the rest of the line. This means that I can have a maximum of 65535 bytes
in a line. I have found that this schema works well, and will continue to
use it for this project.
For the proof of concept program, I will use:
 'f' to mean 'file name'.
 's' to mean 'signature'.
 'a' to mean 'data' (I reserve 'd' to mean 'directory' later on).
 So, when backing up a small file called 'testfile', the backup stream might
look something like the following. For ease of reading, I have split it onto
separate lines.
 
f0008testfile
s00300000006E18F503EA810FFB91AF5E8C4874BA3F8CE6DF96A9
a0014This is my testfile.
 
The signature is made up of the rabin fingerprint (16 characters) and the
md5sum (32 characters) of the data in the file ('This is my testfile').
There may be many signature/data pairs per file backed up.
 B.2. Restore stream
This is the same as the backup stream, except that there are no 's'
(signature) lines required.
 B.3. Storage format
There are three parts to the storage format.
The manifest (instructions on how to recreate a set of files), the signatures,
and the actual data.
 
In the prototype, there will be three top level diretories in the storage:
'man', 'dat' and 'sig'.
Each backup will have a single manifest. The first manifest will be '0', and
subsequent ones will increment that name by one each time.
The 'dat' and 'sig' directories will each contain a path and file named like
this, where each component of the path and name is a hexidecimal number:
 '0000/0000/0000'
 The signature file references the data file with the same path and name.
 When enough data is written to this file (the number of chunks to be determined
via some experimentation), it is closed and a new one is
opened with an incremented file name:
 '0000/0000/0001'
 
When the file part gets to 'FFFF', a new subdirectory '0000/0001' will be
created, and the file within it will start from '0000' again.
Since popular file systems, like ext3, only support 32000 subdirectories,
The subdirectory increments will be limited to a maximum of 30000.
 
So, there can be 30000*30000*65535=58981500000000 data files, which should
be enough! Most file systems will run out of inodes before the number of
storage data files run out.
 
The contents of the manifest file will look like the following, where the
format of each line is following the same outline as described for the backup
stream:
 
f0009testfile2
s00130000/0000/0000/0000
s00130000/0000/0000/0001
s00130000/0000/0000/0000
s00130000/0000/0000/0001
s00130000/0000/0000/0002
f0009testfile3
s00130000/0000/0000/0000
 
The signature lines reference locations in the data and signature files, the
first three components being the file itself, and the last being the index
of the signature/data.
 
The signature files will look like this:
 
s0030F45FE5EFF25857E918C552AEFA3CFD39A8ECB1F4E8D09196
s0030A423F6E2AF45C796EA35BD8C9ABC4D328133349EB80DC52A
s00301C9472B37A8DDC2106A9EBC217CF44AAC33A4AB50641A3A5
s0030933AEF79A6ED384202DA19419BC5F72908596D9D2E8B7C8F
s0030C1BEF23979ADDAADACDD8BEDF8BA67A91090777FEBD69266
And so on, for however many signatures there are in the signature/data file. 
These signatures are loaded by the server before it reads the backup stream
for the client, and uses them for deduplication.
 
 
The data files will contain the chunks for each of the signatures, each
preceeded by 'a<four characters of hex>'.
< Prev
Contents
Next >Note that the server does not have to read the signatures in order to retrieve
the data for a restore, because the references in the manifest also point
directly to the correct location in the data files.
 Also, this format allows for client compression and encryption of chunks on
the client side at some future date. This will not be done for this project,
but I am ensuring that the decisions that I make now do not preclude that.
 |