duplicacy/DESIGN.md

## Lock-Free Deduplication

## Snapshot Format

A snapshot file is a file that the backup procedure uploads to the file storage after it finishes breaking files into
chunks and uploading all new chunks. It mainly contains metadata for the backup overall, metadata for all the files,
and chunk references for each file. Here is an example snapshot file for a repository containing 3 files (file1, file2,
and dir1/file3):

```json
{
  "id": "host1",
  "revision": 1,
  "tag": "first",
  "start_time": 1455590487,
  "end_time": 1455590487,
  "files": [
    {
      "path": "file1",
      "content": "0:0:2:6108",
      "hash": "a533c0398194f93b90bd945381ea4f2adb0ad50bd99fd3585b9ec809da395b51",
      "size": 151901,
      "time": 1455590487,
      "mode": 420
    },
    {
      "path": "file2",
      "content": "2:6108:3:7586",
      "hash": "f6111c1562fde4df9c0bafe2cf665778c6e25b49bcab5fec63675571293ed644",
      "size": 172071,
      "time": 1455590487,
      "mode": 420
    },
    {
      "path": "dir1/",
      "size": 102,
      "time": 1455590487,
      "mode": 2147484096
    },
    {
      "path": "dir1/file3",
      "content": "3:7586:4:1734",
      "hash": "6bf9150424169006388146908d83d07de413de05d1809884c38011b2a74d9d3f",
      "size": 118457,
      "time": 1455590487,
      "mode": 420
    }
  ],
  "chunks": [
    "9f25db00881a10a8e7bcaa5a12b2659c2358a579118ea45a73c2582681f12919",
    "6e903aace6cd05e26212fcec1939bb951611c4179c926351f3b20365ef2c212f",
    "4b0d017bce5491dbb0558c518734429ec19b8a0d7c616f68ddf1b477916621f7",
    "41841c98800d3b9faa01b1007d1afaf702000da182df89793c327f88a9aba698",
    "7c11ee13ea32e9bb21a694c5418658b39e8894bbfecd9344927020a9e3129718"
  ],
  "lengths": [
    64638,
    81155,
    170593,
    124309,
    1734
  ]
}
```

When Duplicacy splits a file in chunks, if the end of a file is reached and yet the boundary marker for terminating a chunk
hasn't been found, the next file, if there is one, will be read in and the chunking algorithm continues. It is as if all
files were packed into a big zip file which is then split into chunks.

The *content* field of a file indicates the indexes of starting and ending chunks and the corresponding offsets. For
instance, *fiel1* starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by *file2*.

The backup procedure can run in one of two modes. In the default quick mode, only modified or new files are scanned. Chunks only
referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new
files are appended. Indices for unchanged files need to be updated too.

In the safe mode (enabled by the -hash option), all files are scanned and the chunk sequence is regenerated.

The length sequence stores the lengths for all chunks, which are needed when calculating some statistics such as the total
length of chunks. For a repository containing a large number of files, the size of the snapshot file can be tremendous.
To make the situation worse, every time a big snapshot file would have been uploaded even if only a few files have been changed since
last backup. To save space, the variable-size chunking algorithm is also applied to the three dynamic fields of a snapshot
file, *files*, *chunks*, and *lengths*.

Chunks produced during this step are deduplicated and uploaded in the same way as regular file chunks. The final snapshot file
contains sequences of chunk hashes and other fixed size fields:

```json
{
  "id": "host1",
  "revision": 1,
  "start_time": 1455590487,
  "tag": "first",
  "end_time": 1455590487,
  "file_sequence": [
    "21e4c69f3832e32349f653f31f13cefc7c52d52f5f3417ae21f2ef5a479c3437",
  ],
  "chunk_sequence": [
    "8a36ffb8f4959394fd39bba4f4a464545ff3dd6eed642ad4ccaa522253f2d5d6"
  ],
  "length_sequence": [
    "fc2758ae60a441c244dae05f035136e6dd33d3f3a0c5eb4b9025a9bed1d0c328"
  ]
}
```

If the respository has not been touched since last backup, a new backup procedure will not create any new chunks,
as shown by the following output from a real use case:

```
$ duplicacy backup -stats
Storage set to sftp://gchen@192.168.1.100/Duplicacy
Last backup at revision 260 found
Backup for /Users/gchen/duplicacy at revision 261 completed
Files: 42367 total, 2,204M bytes; 0 new, 0 bytes
File chunks: 447 total, 2,238M bytes; 0 new, 0 bytes, 0 bytes uploaded
Metadata chunks: 6 total, 11,753K bytes; 0 new, 0 bytes, 0 bytes uploaded
All chunks: 453 total, 2,249M bytes; 0 new, 0 bytes, 0 bytes uploaded
Total running time: 00:00:05
```