Improved Open Source Backup:
Incorporating inline deduplication and sparse indexing solutions
G. P. E. Keeling
< Prev
Contents
Next >
5. Comparative testing
Once the first iteration was basically working, I designed a procedure with
which to test the various backup solutions.
The testing was done using a Linux server and client, because this greatly
simplifies setup and the measurement of resource utilisation. The machines
were connected to each other using a 100Mb/s switch.
Server
CPU: |
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (double core) |
RAM |
4GB: |
OS |
Linux version 3.2.0-3-amd64 (Debian 3.2.21-3) |
Disk 1: |
ATA WDC WD400BB-00JH 05.0 PQ (40GB - for the OS) |
Disk 2: |
ATA ST3400620A 3.AA PQ (400GB - for the storage) |
Client
CPU: |
Intel(R) Atom(TM) CPU D510 @ 1.66GHz (quad core) |
RAM: |
4GB |
OS: |
Linux version 3.2.0-4-amd64 (Debian 3.2.46-1+deb7u1) |
Disk 1: |
ATA SAMSUNG HD501LJ CR10 PQ: 0 ANSI: 5 (500GB) |
There were two test sequences for each backup software. In both cases, files
were initially copied into a directory on the client computer for the purposes
of being backed up.
a) Many small files - I downloaded 59 different linux kernel source packages
from http://kernel.org/ and unpacked them.
This resulted in 1535717 files and directories, and 20GB (20001048kb) of
data, which is an average of about 13kb per file.
b) One large file - I used a 22GB VirtualBox VDI image file of a Windows 7
machine. I took one copy of this, started and stopped the Windows 7 virtual
machine then took another copy of the file, which was now changed.
Each sequence had the following steps, each of which is targetting potential
weaknesses of backup software. For example, updating the timestamp of a large
file could cause the whole file to be copied across the network even though
none of the data has changed.
- Perform a backup.
- Perform a backup without changing anything.
- Perform a backup after changing some of the data.
For the small files, I randomly scrambled the files in one of the kernel
directories.
For the large file, I used the rebooted VDI image.
- Perform a backup after updating the timestamp on some of the files.
For the small files, I updated all of the timestamps in one of the kernel
directories without changing the data.
For the large file, I updated its timestamp without changing its data.
- Perform a backup after renaming some of the files.
For the small files, I created a new directory and moved half of the kernel
sources into it.
- Perform a backup after deleting some of the files.
For the small files, I deleted half of them.
For the large file, I truncated it to 11GB.
- Restore all the files from each of the six backups.
These measurements were taken for each backup or restore:
- The time taken.
- The cumulative disk space used after each backupe
- The cumulative number of file system nodes used by the backup.
- Bytes sent over the network from the server.
- Bytes sent over the network from the client.
- Maximum memory usage on the server.
- Maximum memory usage on the client.
I would also have liked to measure the CPU utilisation, but I was not able
to find a satisfactory practical way to do this for each piece of
software.
To get the time taken and the memory usage statistics, I used the
GNU 'time' program.
To get the disk space statistics, I used the 'du' command.
To get the number of file system nodes, I used the 'find' command, piped to
'wc'.
To get the network statistics, I was able to use the linux
firewall, 'iptables', to count the bytes going to and from particular TCP
ports.
Immediately before each test, I would reset the firewall counters and flush
the disk cache on both server and client.
Since each test sequence might take a lot of time,
scripts were written to automate the testing process so that they could be run
without intervention. The scripts had to be customised for each backup
software under test. These scripts are included with the software as part of
the submitted materials for this project.
I ensured that each software was configured with the same features for each
test; there would be no compression on the network or in the storage,
and there would be encryption on the network but not in the storage.
For those software that had no native network support, this meant running the
software over secure shell (ssh).
During the initial testing, I discovered that burp's network library was
automatically compressing on the network, leading to incorrect results. I made
a one line patch to the versions of burp under test in order to turn off
the network compression. This will become a configurable option in future
versions. I redid the initial testing of both versions of burp for this reason.
These are the candidate software to be initially tested. For more verbose
information on each of these, see the Bibliography
and Appendix F.
burp-1.3.36 was the latest version of the original burp at the time the testing
was started.
burp-2.0.0 is the first iteration of the new software.
Software |
Good cross platform support |
Native network support |
Good attribute support |
Good imaging support |
Notifications and scheduling |
Retention periods |
Resume on interrupt |
Hard link farm |
Inline deduplication |
amanda 3.3.1 |
No |
No |
Yes |
No |
Yes * |
Yes |
No |
No |
No |
backshift 1.20 |
No |
No |
Yes |
No |
No * |
Yes |
Yes |
No |
Yes |
backuppc 3.2.1 |
No |
No |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
No |
bacula 5.2.13 |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
No |
No |
No |
bup-0.25 |
No |
No |
No |
Yes |
No * |
No |
Yes |
No |
Yes |
obnam 1.1 |
No |
No |
Yes |
Yes |
No * |
Yes |
Yes |
No |
Yes |
rdiff-backup 1.2.8 |
No |
No |
Yes |
No |
No * |
No * |
No |
Yes |
No |
rsync 3.0.9 --link-dest |
No |
Yes |
Yes |
No |
No * |
No * |
Yes |
Yes |
No |
tar 1.26 |
No |
No |
Yes |
No |
No * |
No * |
No |
No |
No |
urbackup-1.2.4 |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
burp 1.3.36 |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
No |
burp 2.0.0/1 |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
No |
Yes |
* Possible with management via external tools, such as cron.
'Good cross platform support' means that the software is able to back up and
restore a set of files on Unix/Linux/Mac/Windows computers, and is able to use
the Windows backup API.
'Good imaging support' means that the software is able to perform image backups
and restores efficiently.
'Hard link farm' means that the software saves its data in individual files,
one for each file that is backed up, and may hard link unchanged versions
together. This can be beneficial on small backup systems, and you can copy
the files for restore using standard file system tools. However, with large
backup sets, they become unwieldy due to file system overhead.
5.1. Conclusions from feature comparison
When comparing the feature lists prior to doing any testing, it seems that the
original burp is already one of the best choices in the field, and some
of its deficiencies are addressed by the newly developed software.
For example, the only other software with good cross platform support is
bacula. None of the solutions offering inline deduplication except the newly
developed software manage to have good cross platform support.
I expected urbackup (Raiber, 2011) to be a good contender, but it turns out that
it doesn't work well on Linux, as described in the next section.
Many of the technically interesting offerings, such as bup-0.25
(Pennarun, 2010), lack features that help with
simplifying administration of a central server. A few of them, for example
backshift (Stromberg, 2012) and obnam (Wirzenius, 2007), are not really
network based solutions and require remote filesystems
to be externally mounted so that they appear to the software as local
filesystems. These suffer appropriately in the testing that follows.
You may have noticed that the newly developed software has lost the ability
to delete old backups. This will be addressed in a future iteration beyond
the scope of this project, but the planned concept is explained in a
following chapter about further iterations.
< Prev
Contents
Next >
|