When we compare a file with its last version, we read all the manber-hashes into memory. More...
When we compare a file with its last version, we read all the manber-hashes into memory.
When we use them on commit for constructing a delta stream, we'll have to have them sorted and/or indexed for fast access; then we can't read them from disk or something like that.
We need about 16+4+8 (28, with alignment 32) bytes per hash value, and that's for approx. 128kB. So a file of 1M needs 8*32 => 512 bytes, 1G needs 8k*32 => 512 kB, 1T needs 8M*32 => 512 MB. If this is too much, you'll have to increase CS__APPROX_BLOCKSIZE_BITS and use bigger blocks.
(Although, if you've got files greater than 1TB, you'll have other problems than getting >512MB RAM) And still, there's (nearly) always a swap space ...
To avoid the costs of the unused 4 bytes (which accumulate to 32MB on a 1TB file) and to get the manber-hashes better into L2 cache (only the 32bit value - the rest is looked up after we found the correct hash) we allocate 3 memory regions - one for each data.
It's a tiny bit disturbing that we read the whole file at once and not as-needed. On the hypothetical 1TB file we'll be reading 512MB before seeing that the first block had changed... Perhaps this should be expanded later, to say something like "already open, return next N entries" - file handle, last N, etc. should be stored in struct cs__manber_hashes. For now I'll just ignore that, as a (practical) 4GB file (a DVD) will lead to 2MB read - and on average we'll find a difference after 2GB more reading.
A better way than read()/lseek() would be mmap of binary files. But that wouldn't allow to have the data compressed. I don't know if that matters; the 1TB file has 8M lines * 60 Bytes => 480MB on ASCII-data.
If we assume that the CRCs and lengths can be compressed away, we still need the offsets and MD5s, so we would still end with 8M * 24 Bytes, ie. at least 196MB. I don't think that's a necessity.
Please note, that on 1TB files you'll have 8M of hash blocks, which have a significant collision-chance on 32bit hash values! (look up discussion about rsync and its bugs on block change detection). We'd either have to use bigger blocks or a bigger hash value - the second might be easier and better, esp. as files this big should be on a 64bit platform, where a 64bit hash won't be slow.
The last block in a file ends per definition *not* on a manber-block- border (or only per chance). This block is not written into the md5s file. The data is verified by the full-file MD5 that we've been calculating.