Improved Open Source Backup:
Incorporating inline deduplication and sparse indexing solutions
G. P. E. Keeling
11. Evaluation of the project as a whole
In order to produce the new software, I initially produced a list of
objectives. During the span of the time allocated to this project, I was
able to achieve all of the objectives, as follows.
Core: Complete a search and review of popular existing open source
network backup solutions, explaining the areas in which current backup
offerings are flawed.
This was done early on in the project, and the results can be read in
'Appendix F - Open source competition'.
Core: Complete a literature search and review of relevant algorithms and
techniques that will be needed to implement a new software engine.
This was also done early on in the project, and the information learnt was
used in the design of the new software engine.
Core: Design and develop the new software.
The design was fairly quick, but the actual development took a significant
amount of time, including proof of concept programs
and iterations of the final software. The iteration used for the
test results was completed with around a month of time remaining before the
Advanced: By conducting suitable tests and analysis, prove that the
new software is superior to other offerings.
The new software was tested and shown to be either comparable or superior to
other solutions in most areas, with the exception of client memory usage. This
will be addressed before the initial release of the new software.
Advanced: Demonstrate that sparse indexing is an effective solution to
the disk deduplication bottleneck.
Another deduplicating solution (backshift) taking longer than two days to
complete one backup demonstrated the bottleneck problem.
Analysis of the new software's results for server memory usage, disk space and
the time taken for backing up showed that sparse indexing was an effective
Core: Complete the final project report.
The final project report was completed before the deadline.
Overall, I was pleased with the way that the project went. I put a lot of
my spare time into this, and with a little more work the result has the
potential to be useful to many people.
The two hardest parts of the project were implementing the multiple streams
of asynchronous I/O, and the testing of the various software.
The former would have been made slightly simpler if I had left the file
system scan part as a separate phase before backing up the data, which would
then only require two streams in each direction instead of three.
Testing the various backup solutions was hard because of the amount of time
involved with it. I would very often run a test sequence over a few days,
and find at the end that I would have to run it again for some reason. For
example, network compression may have been turned on and so the results could
not be compared fairly against other solutions that were not doing network
Fortunately, as mentioned in the Intermediate Project Report, I began the
testing ahead of schedule and was therefore able to resolve these issues and
complete the testing on time.
The decision to continue to code in C was the correct one, as I think I would
not have finished the iterations in time had I tried to move to C++. However,
the object-orientated style of coding that I am moving towards seems effective
In the end, I find it surprising that the ideas utilised during this project
have not already been combined in open source software.
Perhaps a possible explanation for this is that the hard work that is involved
with the development of this kind of software causes the authors to wish to
sell the results. I do not intend to do that, but I can see a potential route
to remuneration via providing support for the software instead. That is, if
it really does turn out to be useful to other people.
It would be fantastic to work on this software full time.