BURP - BackUp and Restore Program

Improved Open Source Backup:
Incorporating inline deduplication and sparse indexing solutions

G. P. E. Keeling

2. Software development methodology

Disclaimer: This methodology stuff is nonsense waffle required by the degree course. My actual methodology was more like this.

I generally followed the Agile method of XP ('Extreme Programming'), having previously found that it worked reasonably well in a previous computer science module.

Wells (2009) provides the following diagram, which gives a general outline of the XP process.

The coding was done on a Debian Linux platform, in C, although C++ was considered. A major factor influencing this decision was my more extensive experience with C, meaning that I would be able to utilise the limited time more effectively, rather than dealing with problems arising out of C++ inexperience. However, a considerable effort was made to make sure that my C structures followed object oriented principles. Another factor in the decision to use C was that most of the original burp code was written in C, albeit with the occassional C++ module for interfacing with Windows APIs.

Most of the testing during development was done on the Debian Linux platform, with occassional testing using a Windows client cross-compiled with a mingw-w64 toolchain. This enabled quick progress whilst ensuring that nothing major was broken on the Windows platform.

During the implementation of the second iteration in particular, once it was clear that essential functionality would be complete before the project deadline, a fair chunk of time was spent cleaning up existing burp code. Code redundancy was reduced and old code started to be shifted towards a more object-oriented approach. The structure of the source tree was altered significantly in order to logically separate server and client code. This was all done for the purpose of reducing 'technical debt', which is very desirable from an Agile viewpoint.

2.1. Initial design decisions

I now needed to make some design decisions on exactly which algorithms to use, and the changes that needed to be made to the design of burp to support these new proposals.
However, I did feel that it was important to design the software in such a way that I could easily switch between different choices of algorithms at some point in the future. It is not actually greatly important to the design and implementation of the software that I choose the best algorithms from the start.

I chose MD5 for the strong checksum. This was because I was familiar with it (and MD4) via experience with rsync, librsync, and the original burp. The original burp uses MD5 in order to checksum whole files. It is a suitable, low risk choice for strong chunk checksums in the new software. An MD5 checksum also fits into 128 bits, which saves on memory compared to something like the SHA family (Eastlake, 2001) - which use a minimum of 160 bits - but still provides good uniqueness. That is, two different chunks of data are vastly unlikely to have matching MD5 checksums. For a match, the blocks also need to have matching weak checksums, which further reduces the chances of collisions. At some future point, experimentation with other algorithms may be performed though.

Lillibridge, M. et al, used their own Two-Threshold Two-Divisor chunking algorithm (Eshghi, 2005), and assert that it "produces variable-sized chunks with smaller size variation than other chunking algorithms, leading to superior deduplication".
However, I chose 'rabin fingerprinting' (Rabin, 1981) for this task. It gives a suitable weak checksum and the implementation that I produced seems suitably fast, as will be seen in the test results. Again, at some future point, experimentation with other algorithms may be performed.

< Prev Contents Next >

Burp is open and free software. I work on it in my spare time. If you would like this work to continue, please consider making a small donation.

Burp, don't suck.

Last updated: June 2016
By Graham Keeling
Hosted by 6sync

Improved Open Source Backup: Incorporating inline deduplication and sparse indexing solutions

G. P. E. Keeling

2. Software development methodology

2.1. Initial design decisions

Improved Open Source Backup:
Incorporating inline deduplication and sparse indexing solutions