Overview

When we compare a file with its last version, we read all the manber-hashes into memory. More...

When we compare a file with its last version, we read all the manber-hashes into memory.

When we use them on commit for constructing a delta stream, we'll have to have them sorted and/or indexed for fast access; then we can't read them from disk or something like that.

Performance considerations

Count of records, memory requirements

We need about 16+4+8 (28, with alignment 32) bytes per hash value, and that's for approx. 128kB. So a file of 1M needs 8*32 => 512 bytes, 1G needs 8k*32 => 512 kB, 1T needs 8M*32 => 512 MB. If this is too much, you'll have to increase CS__APPROX_BLOCKSIZE_BITS and use bigger blocks.

(Although, if you've got files greater than 1TB, you'll have other problems than getting >512MB RAM) And still, there's (nearly) always a swap space ...

Performance considerations

To avoid the costs of the unused 4 bytes (which accumulate to 32MB on a 1TB file) and to get the manber-hashes better into L2 cache (only the 32bit value - the rest is looked up after we found the correct hash) we allocate 3 memory regions - one for each data.

for big files

It's a tiny bit disturbing that we read the whole file at once and not as-needed. On the hypothetical 1TB file we'll be reading 512MB before seeing that the first block had changed... Perhaps this should be expanded later, to say something like "already open, return next N entries" - file handle, last N, etc. should be stored in struct cs__manber_hashes. For now I'll just ignore that, as a (practical) 4GB file (a DVD) will lead to 2MB read - and on average we'll find a difference after 2GB more reading.

A better way than read()/lseek() would be mmap of binary files. But that wouldn't allow to have the data compressed. I don't know if that matters; the 1TB file has 8M lines * 60 Bytes => 480MB on ASCII-data.

If we assume that the CRCs and lengths can be compressed away, we still need the offsets and MD5s, so we would still end with 8M * 24 Bytes, ie. at least 196MB. I don't think that's a necessity.

on big files

Please note, that on 1TB files you'll have 8M of hash blocks, which have a significant collision-chance on 32bit hash values! (look up discussion about rsync and its bugs on block change detection). We'd either have to use bigger blocks or a bigger hash value - the second might be easier and better, esp. as files this big should be on a 64bit platform, where a 64bit hash won't be slow.

last block

The last block in a file ends per definition *not* on a manber-block- border (or only per chance). This block is not written into the md5s file. The data is verified by the full-file MD5 that we've been calculating.

Todo:
When we do a rsync-copy from the repository, we'll have to look at that again! Either we write the last block too, or we'll have to ask for the last few bytes extra.

Generated for fsvs by  doxygen 1.5.9