Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-04 Thread Julian Andres Klode
On Sun, Mar 04, 2012 at 03:18:50AM +0100, Timo Weingärtner wrote:
 Hallo Julian Andres,
 
 2012-03-04 um 01:07:42 schriebst Du:
  On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote:
 
   The initial comparison was with hardlink, which got OOM killed with a
   hundred backups of my home directory. Last night I compared it to duff
   and rdfind which would have happily linked files with different st_mtime
   and st_mode.
  
  You might want to try hardlink 0.2~rc1. In any case, I don't think we need
  yet another such tool in the archive. If you want that algorithm, we can
  implement it in hardlink 0.2 using probably about 10 lines. I had that
  locally and it works, so if you want it, we can add it and avoid the
  need for one more hack in that space.
 
 And why is lighttpd in the archive? Apache can do the same ...

Can it? Apache uses threads, lighttpd uses an event loop for
processing. That's a different topic though.

But in any case, avoiding yet another tool with the same security
issues (CVE-2011-3632) and bugs (and more bugs) as we currently
have would be a good idea.

hadori bugs:
  - Race, possible data loss: Calls unlink() before link(). If
link() fails the data might be lost (best solution appears
to be to link to a temporary file in the target directory
and then rename to target name, making the replacement
atomic)
  - Error checking: Errors when opening files or reading
files are not checked (ifstream uses the failbit and
stuff).

Common security issue, same as CVE-2011-3632 for Fedora's hardlink:
[Unsafe operations on changing trees]
  - If a regular file is replaced by a non-regular one before an
open() for reading, the program reads from a non-regular file
  - A source file is replaced by one file with different owner
or permissions after the stat() and before the link()
  - A component of the path is replaced by a symbolic link after
the initial stat()ing and readdir()ing. An attacker may use
that to write outside of the intented directory.

(Fixed in Fedora's hardlink, and my hardlink by adding a section
 to the manual page stating that it is not safe to run the
 program on changing trees).

Possibly hardlink only bugs:
   - Exaggeration of sizes. hardlink currently counts every
 link replaced -st_size, even is st_nlink  1. I don't
 know what hadori does there. (and it requires more work
 to fix that in hardlink, as we currently do not combine
 links to the same inode in any way, and likely want
 --dry-run to work correct as well).

You can also drop your race check. The tool is unsafe on
changing trees anyway, so you don't need to check whether
someone else deleted the file, especially if you're then
linking to it anyway.

   
  hardlink 0.2 is written in C, and uses a binary tree to map
  (dev_t, off_t) to a struct file which contains the stat information
  plus name for linking. It requires two allocations per file, one for
  the struct file with the filename, and one for the node in the tree
  (well, actually we only need the node for the first file with a
   specific (dev_t, off_t) tuple). A node has 3 pointers.
 
 The hardlink I used at that time was written in python and definitely 
 didn't 
 do it the way I want.

I know. I knew that there were problems on large trees in 2009, but got
nowhere with a fix in Python. We still have the two passes in hardlink
and thus need to keep all the files currently, as I did not carry the
link-first mode over from my temporary C++ rewrite, as memory usage
was not much different in my test case. But as my test case was just
running on /, the whole thing may not be representative. If there
are lots of duplicates, link-first can definitely help.

The one that works exactly as as you want is most likely Fedora's hardlink.

 
 hadori is written in C++11 which IMHO makes it look a little more readable. 

Yes. It looks readable, but also has far less features than hardlink (which
were added to hardlink because of user requests). You can get it even shorter
and more efficient if you use POSIX's nftw() to walk the directory tree
instead of your recurse()/recurse_start() pair (you just need to write
one callback that takes a complete file path, a type info, a stat buffer,
and the basename of the file). Does not work on Haiku though, or BSDs
prior to 2002. Alternatively, there's also fts() if you don't care about
the System V line of UNIXes (e.g. Solaris).

You can tell both to not cross mount points and do not follow symbolic
links, which makes the whole thing very easy.

 It 
 started with tree based map and multimap, now it uses the unordered_ (hash 
 based) versions which made it twice as fast in a typical workload.

That's strange. In my (not published) C++ version of hardlink, unordered
(multi) maps were only slightly faster than ordered ones. I then rewrote
the code in C to make it more readable to the common DD who does not
want to work with C++, and more 

Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-04 Thread Julian Andres Klode
On Sun, Mar 04, 2012 at 07:00:13AM +0100, Goswin von Brederlow wrote:
 Timo Weingärtner t...@tiwe.de writes:
 
  Package: wnpp
  Severity: wishlist
  X-Debbugs-CC: debian-de...@lists.debian.org
 
 Package name: hadori
  Version: 0.2
  Upstream Author: Timo Weingärtner t...@tiwe.de
  URL: https://github.com/tiwe-de/hadori
  License: GPL3+
  Description: Hardlinks identical files
   This might look like yet another hardlinking tool, but it is the only one
   which only memorizes one filename per inode. That results in less merory
   consumption and faster execution compared to its alternatives. Therefore
   (and because all the other names are already taken) it's called
   HArdlinking DOne RIght.
   .
   Advantages over other hardlinking tools:
* predictability: arguments are scanned in order, each first version is 
  kept
* much lower CPU and memory consumption
* hashing option: speedup on many equal-sized, mostly identical files
 
  The initial comparison was with hardlink, which got OOM killed with a 
  hundred 
  backups of my home directory. Last night I compared it to duff and rdfind 
  which would have happily linked files with different st_mtime and st_mode.
 
  I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
  number.
 
 
  Greetings
  Timo
 
 I've been thinking about the problem of memory consumption too. But I've
 come to a different solution. One that doesn't need memory at all.

I know yet another solution. For each file you visit, you simply visit
the complete tree. Than you have n + 1 visits, but get constant space
usage.

 
 Instead of remembering inodes, filenames and checksums create a global
 cache (e.g. directory hierachy like .cache/start of hash/hash)
 and hardlink every file to there. If you want/need to include uid, gid,
 mtime, mode in there then make that part of the .cache path.
 
 Garbage collection in the cache would be removing all files with a link
 count of 1.
 
 Going one step further link files with unique size [uid, gid, mtime,
 ...] to .cache/size and change that into .cache/size/start of
 hash/hash when you find a second file with the same size that isn't
 identical. That would save on the expensive hashing of clearly unique
 files.

So implement an object store and replace files outside the object
store with hardlinks to the store. Yes, this is guaranteed to work
for some cases, but also has problems. If you create files first, and
then move them to the store, you still need to check every file with
link count != 1 and check whether it is in the cache already. And for
this, you need a lookup by inode if you want to avoid hashing.

And this is basically the same hierarchy as git has:
.git/objects/first 2 hex digits of sha1sum/remaining sha1sum

 
 You could also use a hash that computes the first byte from the first
 4k, second byte from 64k, thrid from 1mb and so on. That way you can
 check if the beginning of 2 files match without having to checksum the
 whole file or literally comprare the two.

If the beginning can match. They're not guaranteed to match just because
the hashes match.

-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.


pgpEizNmg3bwX.pgp
Description: PGP signature


Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-04 Thread Timo Weingärtner
Hallo Julian Andres,

2012-03-04 um 12:31:39 schriebst Du:
 But in any case, avoiding yet another tool with the same security
 issues (CVE-2011-3632) and bugs (and more bugs) as we currently
 have would be a good idea.
 
 hadori bugs:
   - Race, possible data loss: Calls unlink() before link(). If
 link() fails the data might be lost (best solution appears
 to be to link to a temporary file in the target directory
 and then rename to target name, making the replacement
 atomic)

I copied that from ln -f, which has the same bug then.

   - Error checking: Errors when opening files or reading
 files are not checked (ifstream uses the failbit and
 stuff).

If only one of the files fails nothing bad happens. If both fail bad things 
might happen, that's right.

 Common security issue, same as CVE-2011-3632 for Fedora's hardlink:
   [Unsafe operations on changing trees]
   - If a regular file is replaced by a non-regular one before an
 open() for reading, the program reads from a non-regular file
   - A source file is replaced by one file with different owner
 or permissions after the stat() and before the link()
   - A component of the path is replaced by a symbolic link after
 the initial stat()ing and readdir()ing. An attacker may use
 that to write outside of the intented directory.
 
 (Fixed in Fedora's hardlink, and my hardlink by adding a section
  to the manual page stating that it is not safe to run the
  program on changing trees).

I think that kind of bugs will stay until it is possible open/link by inode 
number. Perhaps *at() can help at the file currently examined.

Right now I only used it for my backups which are only accessible by me (and 
root).

 Possibly hardlink only bugs:
- Exaggeration of sizes. hardlink currently counts every
  link replaced -st_size, even is st_nlink  1. I don't
  know what hadori does there.

hadori does not have statistics. They should be easy to add, but I had no use 
for them.

 You can also drop your race check. The tool is unsafe on
 changing trees anyway, so you don't need to check whether
 someone else deleted the file, especially if you're then
 linking to it anyway.

I wanted it to exit when something unexpected happens.

 I knew that there were problems on large trees in 2009, but got nowhere with
 a fix in Python. We still have the two passes in hardlink and thus need to
 keep all the files currently, as I did not carry the link-first mode over
 from my temporary C++ rewrite, as memory usage was not much different in my
 test case. But as my test case was just running on /, the whole thing may
 not be representative. If there are lots of duplicates, link-first can
 definitely help.
 
 The one that works exactly as as you want is most likely Fedora's hardlink.

I've searched for other implementations and all the others do two passes when 
one is obviously enough.

 Yes. It looks readable, but also has far less features than hardlink (which
 were added to hardlink because of user requests).

I still don't get what --maximize (and --minimize) are needed for. In my 
incremental full backup scenario I get best results with keep-first. When 
hardlinking only $last and $dest (see below) even --maximize can disconnect 
files from older backups.

  It
  started with tree based map and multimap, now it uses the unordered_
  (hash based) versions which made it twice as fast in a typical workload.
 
 That's strange. In my (not published) C++ version of hardlink, unordered
 (multi) maps were only slightly faster than ordered ones. I then rewrote
 the code in C to make it more readable to the common DD who does not
 want to work with C++, and more portable.
 
 And it does not seem correct if you spend so much time in the map, at
 least not without caching. And normally, you most likely do not have
 the tree(s) you're hardlinking on cached.

I have, because I usually run:
$ rsync -aH $source $dest --link-dest $last
$ hadori $last $dest


Grüße
Timo


signature.asc
Description: This is a digitally signed message part.


Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-04 Thread Julian Andres Klode
On Sun, Mar 04, 2012 at 07:26:13PM +0100, Timo Weingärtner wrote:
 Hallo Julian Andres,
 
 2012-03-04 um 12:31:39 schriebst Du:
  But in any case, avoiding yet another tool with the same security
  issues (CVE-2011-3632) and bugs (and more bugs) as we currently
  have would be a good idea.
  
  hadori bugs:
- Race, possible data loss: Calls unlink() before link(). If
  link() fails the data might be lost (best solution appears
  to be to link to a temporary file in the target directory
  and then rename to target name, making the replacement
  atomic)
 
 I copied that from ln -f, which has the same bug then.

Could be. The other way is definitely safer.

 
- Error checking: Errors when opening files or reading
  files are not checked (ifstream uses the failbit and
  stuff).
 
 If only one of the files fails nothing bad happens. If both fail bad things 
 might happen, that's right.

 
  Common security issue, same as CVE-2011-3632 for Fedora's hardlink:
  [Unsafe operations on changing trees]
- If a regular file is replaced by a non-regular one before an
  open() for reading, the program reads from a non-regular file
- A source file is replaced by one file with different owner
  or permissions after the stat() and before the link()
- A component of the path is replaced by a symbolic link after
  the initial stat()ing and readdir()ing. An attacker may use
  that to write outside of the intented directory.
  
  (Fixed in Fedora's hardlink, and my hardlink by adding a section
   to the manual page stating that it is not safe to run the
   program on changing trees).
 
 I think that kind of bugs will stay until it is possible open/link by inode 
 number. Perhaps *at() can help at the file currently examined.

Nobody said they have to be fixed. As I wrote, the fix was to mention
it in the manpage.

 
 Right now I only used it for my backups which are only accessible by me (and 
 root).
 
  Possibly hardlink only bugs:
 - Exaggeration of sizes. hardlink currently counts every
   link replaced -st_size, even is st_nlink  1. I don't
   know what hadori does there.
 
 hadori does not have statistics. They should be easy to add, but I had no use 
 for them.
 
  You can also drop your race check. The tool is unsafe on
  changing trees anyway, so you don't need to check whether
  someone else deleted the file, especially if you're then
  linking to it anyway.
 
 I wanted it to exit when something unexpected happens.

But then there are just other cases where you don't, like
opening files. And also, users will complain if you exit
just because one file has a problem. They expect the
program to continue with the remaining ones (at least they
expected this in my audio conversion script, so I do this
in hardlink as well).

 
  I knew that there were problems on large trees in 2009, but got nowhere with
  a fix in Python. We still have the two passes in hardlink and thus need to
  keep all the files currently, as I did not carry the link-first mode over
  from my temporary C++ rewrite, as memory usage was not much different in my
  test case. But as my test case was just running on /, the whole thing may
  not be representative. If there are lots of duplicates, link-first can
  definitely help.
  
  The one that works exactly as as you want is most likely Fedora's hardlink.
 
 I've searched for other implementations and all the others do two passes when 
 one is obviously enough.

Fedora's hardlink should do one pass only and link the first file to later
files. It's fairly simple code. But nobody apart from Fedora and some
other RPM distributions use it.

 
  Yes. It looks readable, but also has far less features than hardlink (which
  were added to hardlink because of user requests).
 
 I still don't get what --maximize (and --minimize) are needed for. In my 
 incremental full backup scenario I get best results with keep-first. When 
 hardlinking only $last and $dest (see below) even --maximize can disconnect 
 files from older backups.

That's to be expected. I don't know the reasons either. I thought
they came from hardlinkpy (of which hardlink was a rewrite with
the goal of increasing object-orientedness, and readability). But
it seems they didn't. It turns out that this was a user request,
or actually a feature of that user's fork of hardlinkpy:

   BEGINNE LOGBUCH UM Fri Dec 19 15:48:39 2008

  Dez 19 18:54:56 chf Gegebenenfalls sollte man es zusammenführen, 
weil die Modifikation, welche ich benötige,
auch den Link-Count der Dateien vergleicht
und immer diejenige mit dem höheren erhält
und die mit dem niedrigeren durch einen
Hardlink auf die andere ersetzt.

(If a non-German-speaking person wants to read this, try a translator
 software)

I think he probably ran this on all of his backups at once. And it
makes sense if 

Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-04 Thread Goswin von Brederlow
Julian Andres Klode j...@debian.org writes:

 On Sun, Mar 04, 2012 at 07:00:13AM +0100, Goswin von Brederlow wrote:
 Timo Weingärtner t...@tiwe.de writes:
 
  Package: wnpp
  Severity: wishlist
  X-Debbugs-CC: debian-de...@lists.debian.org
 
 Package name: hadori
  Version: 0.2
  Upstream Author: Timo Weingärtner t...@tiwe.de
  URL: https://github.com/tiwe-de/hadori
  License: GPL3+
  Description: Hardlinks identical files
   This might look like yet another hardlinking tool, but it is the only one
   which only memorizes one filename per inode. That results in less merory
   consumption and faster execution compared to its alternatives. Therefore
   (and because all the other names are already taken) it's called
   HArdlinking DOne RIght.
   .
   Advantages over other hardlinking tools:
* predictability: arguments are scanned in order, each first version is 
  kept
* much lower CPU and memory consumption
* hashing option: speedup on many equal-sized, mostly identical files
 
  The initial comparison was with hardlink, which got OOM killed with a 
  hundred 
  backups of my home directory. Last night I compared it to duff and rdfind 
  which would have happily linked files with different st_mtime and st_mode.
 
  I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
  number.
 
 
  Greetings
  Timo
 
 I've been thinking about the problem of memory consumption too. But I've
 come to a different solution. One that doesn't need memory at all.

 I know yet another solution. For each file you visit, you simply visit
 the complete tree. Than you have n + 1 visits, but get constant space
 usage.

 
 Instead of remembering inodes, filenames and checksums create a global
 cache (e.g. directory hierachy like .cache/start of hash/hash)
 and hardlink every file to there. If you want/need to include uid, gid,
 mtime, mode in there then make that part of the .cache path.
 
 Garbage collection in the cache would be removing all files with a link
 count of 1.
 
 Going one step further link files with unique size [uid, gid, mtime,
 ...] to .cache/size and change that into .cache/size/start of
 hash/hash when you find a second file with the same size that isn't
 identical. That would save on the expensive hashing of clearly unique
 files.

 So implement an object store and replace files outside the object
 store with hardlinks to the store. Yes, this is guaranteed to work
 for some cases, but also has problems. If you create files first, and
 then move them to the store, you still need to check every file with
 link count != 1 and check whether it is in the cache already. And for
 this, you need a lookup by inode if you want to avoid hashing.

 And this is basically the same hierarchy as git has:
   .git/objects/first 2 hex digits of sha1sum/remaining sha1sum

In the above every file is in the cache. A link count of 1 would
indicate a new file that hasn't been processed yet. Unfortunately you
can also have files with link count != 1 that aren't processed yet,
e.g. 2 new files that are hardlinked to each other.

Detecting wether a file is already in cache or not actualy needs to
check 2 things:

1) link count == 1
   = new file, add to cache
2) link count != 1 but hash of file not known (e.g. extended attribute
   not set)
   = new set of files that are hardlinks to each other

Actually the link count can be completly ignored if you always add a
flag when you've processed a file.

Note: The above wastes time in the 2nd case since it would checksum all
the files that are hardlinks one by one and replace them with hardlinks
into the cache. But you could remember the inode and name of the first
occurance. This would only use up memory proportionally to the number of
new inodes.

 You could also use a hash that computes the first byte from the first
 4k, second byte from 64k, thrid from 1mb and so on. That way you can
 check if the beginning of 2 files match without having to checksum the
 whole file or literally comprare the two.

 If the beginning can match. They're not guaranteed to match just because
 the hashes match.

This wouldn't be to proof identity but to quickly proof difference. If
the first 4k differ then the file will not match. Only makes sense if
you have a lot of big files of equal size.

MfG
Goswin



--
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87hay39aqg.fsf@frosties.localnet



Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-03 Thread Timo Weingärtner
Package: wnpp
Severity: wishlist
X-Debbugs-CC: debian-de...@lists.debian.org

   Package name: hadori
Version: 0.2
Upstream Author: Timo Weingärtner t...@tiwe.de
URL: https://github.com/tiwe-de/hadori
License: GPL3+
Description: Hardlinks identical files
 This might look like yet another hardlinking tool, but it is the only one
 which only memorizes one filename per inode. That results in less merory
 consumption and faster execution compared to its alternatives. Therefore
 (and because all the other names are already taken) it's called
 HArdlinking DOne RIght.
 .
 Advantages over other hardlinking tools:
  * predictability: arguments are scanned in order, each first version is kept
  * much lower CPU and memory consumption
  * hashing option: speedup on many equal-sized, mostly identical files

The initial comparison was with hardlink, which got OOM killed with a hundred 
backups of my home directory. Last night I compared it to duff and rdfind 
which would have happily linked files with different st_mtime and st_mode.

I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
number.


Greetings
Timo


signature.asc
Description: This is a digitally signed message part.


Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-03 Thread Julian Andres Klode
On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote:
 Package: wnpp
 Severity: wishlist
 X-Debbugs-CC: debian-de...@lists.debian.org
 
Package name: hadori
 Version: 0.2
 Upstream Author: Timo Weingärtner t...@tiwe.de
 URL: https://github.com/tiwe-de/hadori
 License: GPL3+
 Description: Hardlinks identical files
  This might look like yet another hardlinking tool, but it is the only one
  which only memorizes one filename per inode. That results in less merory
  consumption and faster execution compared to its alternatives. Therefore
  (and because all the other names are already taken) it's called
  HArdlinking DOne RIght.
  .
  Advantages over other hardlinking tools:
   * predictability: arguments are scanned in order, each first version is kept
   * much lower CPU and memory consumption
   * hashing option: speedup on many equal-sized, mostly identical files
 
 The initial comparison was with hardlink, which got OOM killed with a hundred 
 backups of my home directory. Last night I compared it to duff and rdfind 
 which would have happily linked files with different st_mtime and st_mode.

You might want to try hardlink 0.2~rc1. In any case, I don't think we need
yet another such tool in the archive. If you want that algorithm, we can
implement it in hardlink 0.2 using probably about 10 lines. I had that
locally and it works, so if you want it, we can add it and avoid the
need for one more hack in that space.

hardlink 0.2 is written in C, and uses a binary tree to map
(dev_t, off_t) to a struct file which contains the stat information
plus name for linking. It requires two allocations per file, one for
the struct file with the filename, and one for the node in the tree
(well, actually we only need the node for the first file with a
 specific (dev_t, off_t) tuple). A node has 3 pointers.

-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.


pgpTObJIud0UX.pgp
Description: PGP signature


Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-03 Thread Timo Weingärtner
Hallo Julian Andres,

2012-03-04 um 01:07:42 schriebst Du:
 On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote:

  The initial comparison was with hardlink, which got OOM killed with a
  hundred backups of my home directory. Last night I compared it to duff
  and rdfind which would have happily linked files with different st_mtime
  and st_mode.
 
 You might want to try hardlink 0.2~rc1. In any case, I don't think we need
 yet another such tool in the archive. If you want that algorithm, we can
 implement it in hardlink 0.2 using probably about 10 lines. I had that
 locally and it works, so if you want it, we can add it and avoid the
 need for one more hack in that space.

And why is lighttpd in the archive? Apache can do the same ...

 hardlink 0.2 is written in C, and uses a binary tree to map
 (dev_t, off_t) to a struct file which contains the stat information
 plus name for linking. It requires two allocations per file, one for
 the struct file with the filename, and one for the node in the tree
 (well, actually we only need the node for the first file with a
  specific (dev_t, off_t) tuple). A node has 3 pointers.

The hardlink I used at that time was written in python and definitely didn't 
do it the way I want.

hadori is written in C++11 which IMHO makes it look a little more readable. It 
started with tree based map and multimap, now it uses the unordered_ (hash 
based) versions which made it twice as fast in a typical workload.

The main logic is in hadori.C, handle_file and uses:

std::unordered_mapino_t, inode const kept;
std::unordered_mapino_t, ino_t to_link;
std::unordered_multimapoff_t, ino_t sizes;

class inode contains a struct stat, a file name and an adler checksum, but I 
plan to drop the last one because I think the hashing option is no great gain.


Grüße
Timo


signature.asc
Description: This is a digitally signed message part.


Bug#662080: ITP: hadori -- Hardlinks identical files

2012-03-03 Thread Goswin von Brederlow
Timo Weingärtner t...@tiwe.de writes:

 Package: wnpp
 Severity: wishlist
 X-Debbugs-CC: debian-de...@lists.debian.org

Package name: hadori
 Version: 0.2
 Upstream Author: Timo Weingärtner t...@tiwe.de
 URL: https://github.com/tiwe-de/hadori
 License: GPL3+
 Description: Hardlinks identical files
  This might look like yet another hardlinking tool, but it is the only one
  which only memorizes one filename per inode. That results in less merory
  consumption and faster execution compared to its alternatives. Therefore
  (and because all the other names are already taken) it's called
  HArdlinking DOne RIght.
  .
  Advantages over other hardlinking tools:
   * predictability: arguments are scanned in order, each first version is kept
   * much lower CPU and memory consumption
   * hashing option: speedup on many equal-sized, mostly identical files

 The initial comparison was with hardlink, which got OOM killed with a hundred 
 backups of my home directory. Last night I compared it to duff and rdfind 
 which would have happily linked files with different st_mtime and st_mode.

 I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
 number.


 Greetings
 Timo

I've been thinking about the problem of memory consumption too. But I've
come to a different solution. One that doesn't need memory at all.

Instead of remembering inodes, filenames and checksums create a global
cache (e.g. directory hierachy like .cache/start of hash/hash)
and hardlink every file to there. If you want/need to include uid, gid,
mtime, mode in there then make that part of the .cache path.

Garbage collection in the cache would be removing all files with a link
count of 1.

Going one step further link files with unique size [uid, gid, mtime,
...] to .cache/size and change that into .cache/size/start of
hash/hash when you find a second file with the same size that isn't
identical. That would save on the expensive hashing of clearly unique
files.

You could also use a hash that computes the first byte from the first
4k, second byte from 64k, thrid from 1mb and so on. That way you can
check if the beginning of 2 files match without having to checksum the
whole file or literally comprare the two.

MfG
Goswin



--
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87lingkis2.fsf@frosties.localnet