duplicacy/DESIGN.md

## Lock-Free Deduplication

The three elements of lock-free deduplication are:

* Use variable-size chunking algorithm to split files into chunks
* Store each chunk in the storage using a file name derived from its hash, and rely on the file system API to manage chunks without using a centralized indexing database
* Apply a *two-step fossil collection* algorithm to remove chunks that become unreferenced after a backup is deleted

The variable-size chunking algorithm, also called Content-Defined Chunking, is well-known and has been adopted by many
backup tools. The main advantage of the variable-size chunking algorithm over the fixed-size chunking algorithm (as used
by rsync) is that in the former the rolling hash is only used to search for boundaries between chunks, after which a far
more collision-resistant hash function like MD5 or SHA256 is applied on each chunk. In contrast, in the fixed-size
chunking algorithm, for the purpose of detecting inserts or deletions, a lookup in the known hash table is required every
time the rolling hash window is shifted by one byte, thus significantly reducing the chunk splitting performance.

What is novel about lock-free deduplication is the absence of a centralized indexing database for tracking all existing
chunks and for determining which chunks are not needed any more.  Instead, to check if a chunk has already been uploaded
before, one can just perform a file lookup via the file storage API using the file name derived from the hash of the chunk.
This effectively turn a cloud storage offering only a very limited
set of basic file operations into a powerful modern backup backend capable of both block-level and file-level deduplication.  More importantly, the absence of a centralized indexing database means that there is no need to implement a distributed locking mechanism on top of the file storage.

There is one problem, though.
Deletion of snapshots without an indexing database, when concurrent access is permitted, turns out to be a hard problem.
If exclusive access to a file storage by a single client can be guaranteed, the deletion procedure can simply search for
chunks not referenced by any backup and delete them. However, if concurrent access is required, an unreferenced chunk
can't be trivially removed, because of the possibility that a backup procedure in progress may reference the same chunk.
The ongoing backup procedure, still unknown to the deletion procedure, may have already encountered that chunk during its
file scanning phase, but decided not to upload the chunk again since it already exists on the file storage.

Fortunately, there is a solution to address the deletion problem and make lock-free deduplication practical.  The solution is a *two-step fossil collection* algorithm that deletes unreferenced chunks in two steps: identify and collect them in the first step, and then permanently remove them once certain conditions are met.

## Two-Step Fossil Collection

Interestingly, the two-step fossil collection algorithm hinges on a basic file operation supported almost universally, *file renaming*.
When the deletion procedure identifies a chunk not referenced by any known snapshots, instead of deleting the chunk file
immediately, it changes the name of the chunk file (and possibly moves it to a different directory).
A chunk that has been renamed is called a *fossil*.

The fossil still exists on the file storage.  Two rules are enforced regarding the access of fossils:

* A restore, list, or check procedure that reads existing backups can read the fossil if the original chunk cannot be found.
* A backup procedure does not check the existence of a fossil. That is, it must upload a chunk if it cannot find the chunk, even if an equivalent fossil exists.

In the first step of the deletion procedure, called the *fossil collection* step, the names of all identified fossils will
be saved in a fossil collection file. The deletion procedure then exits without performing further actions. This step has not effectively changed any chunk references due to the first fossil access rule.  If a backup procedure references a chunk after it is marked as a fossil, a new chunk will be uploaded because of the second fossil access rule, as shown in Figure 1.

<p align="center">
  <img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_1.png?raw=true"
       alt="Reference after Rename"/>
</p>

The second step, called the *fossil deletion* step, will permanently delete fossils, but only when two conditions are met:

* For each snapshot id, there is a new snapshot that was not seen by the fossil collection step
* The new snapshot must finish after the fossil collection step

The first condition guarantees that if a backup procedure references a chunk before the deletion procedure turns it into a fossil, the reference will be detected in the fossil deletion step which will then turn the fossil back into a normal chunk.

The second condition guarantees that any backup procedure unknown to the fossil deletion step can start only after the fossil collection step finishes.  Therefore, if it references a chunk that was identified as fossil in the fossil collection step, it should observe the fossil, not the chunk, so it will upload a new chunk, according to the second fossil access rule.

Therefore, if a backup procedure references a chunk before the chunk is marked a fossil, the fossil deletion step will not
delete the chunk until it sees the backup procedure finishes (as indicated by the appearance of a new snapshot file uploaded to the storage).  This ensures that scenarios depicted in Figure 2 will never happen.

<p align="center">
  <img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_2.png?raw=true"
       alt="Reference before Rename"/>
</p>

## Encryption

When encryption is enabled, Duplicacy will generate 4 random 256 bit keys:

* *Hash Key*: for generating a chunk hash from the content of a chunk
* *ID Key*: for generating a chunk id from a chunk hash
* *Chunk Key*: for encrypting chunk files
* *File Key: for encrypting non-chunk files such as snapshot files.

Here is a diagram showing how these keys are used:

<p align="center">
  <img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/duplicacy_encryption.png?raw=true"
       alt="encryption"/>
</p>

Chunk hashes are used internally and saved in the snapshot file.  Chunk ids are used for the name of the chunk files and therefore exposed.
Chunk content is encrypted by AES-GCM, with an encryption key that is the HMAC-SHA256 of the chunk Hash with the Chunk Key as the secret key.

The snapshot is encrypted by AES-GCM too, using an encrypt key that is the HMAC-SHA256 of the file path with the File Key as the secret key.

There four random keys are saved in a file named 'config' in the file storage, encrypted with a master key derived from the PBKDF2 function on
the storage password selected by the user.

## Snapshot Format

A snapshot file is a file that the backup procedure uploads to the file storage after it finishes splitting files into
chunks and uploading all new chunks. It mainly contains metadata for the backup overall, metadata for all the files,
and chunk references for each file. Here is an example snapshot file for a repository containing 3 files (file1, file2,
and dir1/file3):

```json
{
  "id": "host1",
  "revision": 1,
  "tag": "first",
  "start_time": 1455590487,
  "end_time": 1455590487,
  "files": [
    {
      "path": "file1",
      "content": "0:0:2:6108",
      "hash": "a533c0398194f93b90bd945381ea4f2adb0ad50bd99fd3585b9ec809da395b51",
      "size": 151901,
      "time": 1455590487,
      "mode": 420
    },
    {
      "path": "file2",
      "content": "2:6108:3:7586",
      "hash": "f6111c1562fde4df9c0bafe2cf665778c6e25b49bcab5fec63675571293ed644",
      "size": 172071,
      "time": 1455590487,
      "mode": 420
    },
    {
      "path": "dir1/",
      "size": 102,
      "time": 1455590487,
      "mode": 2147484096
    },
    {
      "path": "dir1/file3",
      "content": "3:7586:4:1734",
      "hash": "6bf9150424169006388146908d83d07de413de05d1809884c38011b2a74d9d3f",
      "size": 118457,
      "time": 1455590487,
      "mode": 420
    }
  ],
  "chunks": [
    "9f25db00881a10a8e7bcaa5a12b2659c2358a579118ea45a73c2582681f12919",
    "6e903aace6cd05e26212fcec1939bb951611c4179c926351f3b20365ef2c212f",
    "4b0d017bce5491dbb0558c518734429ec19b8a0d7c616f68ddf1b477916621f7",
    "41841c98800d3b9faa01b1007d1afaf702000da182df89793c327f88a9aba698",
    "7c11ee13ea32e9bb21a694c5418658b39e8894bbfecd9344927020a9e3129718"
  ],
  "lengths": [
    64638,
    81155,
    170593,
    124309,
    1734
  ]
}
```

When Duplicacy splits a file in chunks using the variable-size chunking algorithm, if the end of a file is reached and yet the boundary marker for terminating a chunk
hasn't been found, the next file, if there is one, will be read in and the chunking algorithm continues. It is as if all
files were packed into a big tar file which is then split into chunks.

The *content* field of a file indicates the indexes of starting and ending chunks and the corresponding offsets. For
instance, *fiel1* starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by *file2*.

The backup procedure can run in one of two modes. In the default quick mode, only modified or new files are scanned. Chunks only
referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new
files are appended. Indices for unchanged files need to be updated too.

In the safe mode (enabled by the -hash option), all files are scanned and the chunk sequence is regenerated.

The length sequence stores the lengths for all chunks, which are needed when calculating some statistics such as the total
length of chunks. For a repository containing a large number of files, the size of the snapshot file can be tremendous.
To make the situation worse, every time a big snapshot file would have been uploaded even if only a few files have been changed since
last backup. To save space, the variable-size chunking algorithm is also applied to the three dynamic fields of a snapshot
file, *files*, *chunks*, and *lengths*.

Chunks produced during this step are deduplicated and uploaded in the same way as regular file chunks. The final snapshot file
contains sequences of chunk hashes and other fixed size fields:

```json
{
  "id": "host1",
  "revision": 1,
  "start_time": 1455590487,
  "tag": "first",
  "end_time": 1455590487,
  "file_sequence": [
    "21e4c69f3832e32349f653f31f13cefc7c52d52f5f3417ae21f2ef5a479c3437",
  ],
  "chunk_sequence": [
    "8a36ffb8f4959394fd39bba4f4a464545ff3dd6eed642ad4ccaa522253f2d5d6"
  ],
  "length_sequence": [
    "fc2758ae60a441c244dae05f035136e6dd33d3f3a0c5eb4b9025a9bed1d0c328"
  ]
}
```

In the extreme case where the repository has not been modified since last backup, a new backup procedure will not create any new chunks,
as shown by the following output from a real use case:

```
$ duplicacy backup -stats
Storage set to sftp://gchen@192.168.1.100/Duplicacy
Last backup at revision 260 found
Backup for /Users/gchen/duplicacy at revision 261 completed
Files: 42367 total, 2,204M bytes; 0 new, 0 bytes
File chunks: 447 total, 2,238M bytes; 0 new, 0 bytes, 0 bytes uploaded
Metadata chunks: 6 total, 11,753K bytes; 0 new, 0 bytes, 0 bytes uploaded
All chunks: 453 total, 2,249M bytes; 0 new, 0 bytes, 0 bytes uploaded
Total running time: 00:00:05
```