Bug#662080: ITP: hadori -- Hardlinks identical files
On Sun, Mar 04, 2012 at 03:18:50AM +0100, Timo Weingärtner wrote: Hallo Julian Andres, 2012-03-04 um 01:07:42 schriebst Du: On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote: The initial comparison was with hardlink, which got OOM killed with a hundred backups of my home directory. Last night I compared it to duff and rdfind which would have happily linked files with different st_mtime and st_mode. You might want to try hardlink 0.2~rc1. In any case, I don't think we need yet another such tool in the archive. If you want that algorithm, we can implement it in hardlink 0.2 using probably about 10 lines. I had that locally and it works, so if you want it, we can add it and avoid the need for one more hack in that space. And why is lighttpd in the archive? Apache can do the same ... Can it? Apache uses threads, lighttpd uses an event loop for processing. That's a different topic though. But in any case, avoiding yet another tool with the same security issues (CVE-2011-3632) and bugs (and more bugs) as we currently have would be a good idea. hadori bugs: - Race, possible data loss: Calls unlink() before link(). If link() fails the data might be lost (best solution appears to be to link to a temporary file in the target directory and then rename to target name, making the replacement atomic) - Error checking: Errors when opening files or reading files are not checked (ifstream uses the failbit and stuff). Common security issue, same as CVE-2011-3632 for Fedora's hardlink: [Unsafe operations on changing trees] - If a regular file is replaced by a non-regular one before an open() for reading, the program reads from a non-regular file - A source file is replaced by one file with different owner or permissions after the stat() and before the link() - A component of the path is replaced by a symbolic link after the initial stat()ing and readdir()ing. An attacker may use that to write outside of the intented directory. (Fixed in Fedora's hardlink, and my hardlink by adding a section to the manual page stating that it is not safe to run the program on changing trees). Possibly hardlink only bugs: - Exaggeration of sizes. hardlink currently counts every link replaced -st_size, even is st_nlink 1. I don't know what hadori does there. (and it requires more work to fix that in hardlink, as we currently do not combine links to the same inode in any way, and likely want --dry-run to work correct as well). You can also drop your race check. The tool is unsafe on changing trees anyway, so you don't need to check whether someone else deleted the file, especially if you're then linking to it anyway. hardlink 0.2 is written in C, and uses a binary tree to map (dev_t, off_t) to a struct file which contains the stat information plus name for linking. It requires two allocations per file, one for the struct file with the filename, and one for the node in the tree (well, actually we only need the node for the first file with a specific (dev_t, off_t) tuple). A node has 3 pointers. The hardlink I used at that time was written in python and definitely didn't do it the way I want. I know. I knew that there were problems on large trees in 2009, but got nowhere with a fix in Python. We still have the two passes in hardlink and thus need to keep all the files currently, as I did not carry the link-first mode over from my temporary C++ rewrite, as memory usage was not much different in my test case. But as my test case was just running on /, the whole thing may not be representative. If there are lots of duplicates, link-first can definitely help. The one that works exactly as as you want is most likely Fedora's hardlink. hadori is written in C++11 which IMHO makes it look a little more readable. Yes. It looks readable, but also has far less features than hardlink (which were added to hardlink because of user requests). You can get it even shorter and more efficient if you use POSIX's nftw() to walk the directory tree instead of your recurse()/recurse_start() pair (you just need to write one callback that takes a complete file path, a type info, a stat buffer, and the basename of the file). Does not work on Haiku though, or BSDs prior to 2002. Alternatively, there's also fts() if you don't care about the System V line of UNIXes (e.g. Solaris). You can tell both to not cross mount points and do not follow symbolic links, which makes the whole thing very easy. It started with tree based map and multimap, now it uses the unordered_ (hash based) versions which made it twice as fast in a typical workload. That's strange. In my (not published) C++ version of hardlink, unordered (multi) maps were only slightly faster than ordered ones. I then rewrote the code in C to make it more readable to the common DD who does not want to work with C++, and more
Bug#662080: ITP: hadori -- Hardlinks identical files
On Sun, Mar 04, 2012 at 07:00:13AM +0100, Goswin von Brederlow wrote: Timo Weingärtner t...@tiwe.de writes: Package: wnpp Severity: wishlist X-Debbugs-CC: debian-de...@lists.debian.org Package name: hadori Version: 0.2 Upstream Author: Timo Weingärtner t...@tiwe.de URL: https://github.com/tiwe-de/hadori License: GPL3+ Description: Hardlinks identical files This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less merory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it's called HArdlinking DOne RIght. . Advantages over other hardlinking tools: * predictability: arguments are scanned in order, each first version is kept * much lower CPU and memory consumption * hashing option: speedup on many equal-sized, mostly identical files The initial comparison was with hardlink, which got OOM killed with a hundred backups of my home directory. Last night I compared it to duff and rdfind which would have happily linked files with different st_mtime and st_mode. I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug number. Greetings Timo I've been thinking about the problem of memory consumption too. But I've come to a different solution. One that doesn't need memory at all. I know yet another solution. For each file you visit, you simply visit the complete tree. Than you have n + 1 visits, but get constant space usage. Instead of remembering inodes, filenames and checksums create a global cache (e.g. directory hierachy like .cache/start of hash/hash) and hardlink every file to there. If you want/need to include uid, gid, mtime, mode in there then make that part of the .cache path. Garbage collection in the cache would be removing all files with a link count of 1. Going one step further link files with unique size [uid, gid, mtime, ...] to .cache/size and change that into .cache/size/start of hash/hash when you find a second file with the same size that isn't identical. That would save on the expensive hashing of clearly unique files. So implement an object store and replace files outside the object store with hardlinks to the store. Yes, this is guaranteed to work for some cases, but also has problems. If you create files first, and then move them to the store, you still need to check every file with link count != 1 and check whether it is in the cache already. And for this, you need a lookup by inode if you want to avoid hashing. And this is basically the same hierarchy as git has: .git/objects/first 2 hex digits of sha1sum/remaining sha1sum You could also use a hash that computes the first byte from the first 4k, second byte from 64k, thrid from 1mb and so on. That way you can check if the beginning of 2 files match without having to checksum the whole file or literally comprare the two. If the beginning can match. They're not guaranteed to match just because the hashes match. -- Julian Andres Klode - Debian Developer, Ubuntu Member See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/. pgpEizNmg3bwX.pgp Description: PGP signature
Bug#662080: ITP: hadori -- Hardlinks identical files
Hallo Julian Andres, 2012-03-04 um 12:31:39 schriebst Du: But in any case, avoiding yet another tool with the same security issues (CVE-2011-3632) and bugs (and more bugs) as we currently have would be a good idea. hadori bugs: - Race, possible data loss: Calls unlink() before link(). If link() fails the data might be lost (best solution appears to be to link to a temporary file in the target directory and then rename to target name, making the replacement atomic) I copied that from ln -f, which has the same bug then. - Error checking: Errors when opening files or reading files are not checked (ifstream uses the failbit and stuff). If only one of the files fails nothing bad happens. If both fail bad things might happen, that's right. Common security issue, same as CVE-2011-3632 for Fedora's hardlink: [Unsafe operations on changing trees] - If a regular file is replaced by a non-regular one before an open() for reading, the program reads from a non-regular file - A source file is replaced by one file with different owner or permissions after the stat() and before the link() - A component of the path is replaced by a symbolic link after the initial stat()ing and readdir()ing. An attacker may use that to write outside of the intented directory. (Fixed in Fedora's hardlink, and my hardlink by adding a section to the manual page stating that it is not safe to run the program on changing trees). I think that kind of bugs will stay until it is possible open/link by inode number. Perhaps *at() can help at the file currently examined. Right now I only used it for my backups which are only accessible by me (and root). Possibly hardlink only bugs: - Exaggeration of sizes. hardlink currently counts every link replaced -st_size, even is st_nlink 1. I don't know what hadori does there. hadori does not have statistics. They should be easy to add, but I had no use for them. You can also drop your race check. The tool is unsafe on changing trees anyway, so you don't need to check whether someone else deleted the file, especially if you're then linking to it anyway. I wanted it to exit when something unexpected happens. I knew that there were problems on large trees in 2009, but got nowhere with a fix in Python. We still have the two passes in hardlink and thus need to keep all the files currently, as I did not carry the link-first mode over from my temporary C++ rewrite, as memory usage was not much different in my test case. But as my test case was just running on /, the whole thing may not be representative. If there are lots of duplicates, link-first can definitely help. The one that works exactly as as you want is most likely Fedora's hardlink. I've searched for other implementations and all the others do two passes when one is obviously enough. Yes. It looks readable, but also has far less features than hardlink (which were added to hardlink because of user requests). I still don't get what --maximize (and --minimize) are needed for. In my incremental full backup scenario I get best results with keep-first. When hardlinking only $last and $dest (see below) even --maximize can disconnect files from older backups. It started with tree based map and multimap, now it uses the unordered_ (hash based) versions which made it twice as fast in a typical workload. That's strange. In my (not published) C++ version of hardlink, unordered (multi) maps were only slightly faster than ordered ones. I then rewrote the code in C to make it more readable to the common DD who does not want to work with C++, and more portable. And it does not seem correct if you spend so much time in the map, at least not without caching. And normally, you most likely do not have the tree(s) you're hardlinking on cached. I have, because I usually run: $ rsync -aH $source $dest --link-dest $last $ hadori $last $dest Grüße Timo signature.asc Description: This is a digitally signed message part.
Bug#662080: ITP: hadori -- Hardlinks identical files
On Sun, Mar 04, 2012 at 07:26:13PM +0100, Timo Weingärtner wrote: Hallo Julian Andres, 2012-03-04 um 12:31:39 schriebst Du: But in any case, avoiding yet another tool with the same security issues (CVE-2011-3632) and bugs (and more bugs) as we currently have would be a good idea. hadori bugs: - Race, possible data loss: Calls unlink() before link(). If link() fails the data might be lost (best solution appears to be to link to a temporary file in the target directory and then rename to target name, making the replacement atomic) I copied that from ln -f, which has the same bug then. Could be. The other way is definitely safer. - Error checking: Errors when opening files or reading files are not checked (ifstream uses the failbit and stuff). If only one of the files fails nothing bad happens. If both fail bad things might happen, that's right. Common security issue, same as CVE-2011-3632 for Fedora's hardlink: [Unsafe operations on changing trees] - If a regular file is replaced by a non-regular one before an open() for reading, the program reads from a non-regular file - A source file is replaced by one file with different owner or permissions after the stat() and before the link() - A component of the path is replaced by a symbolic link after the initial stat()ing and readdir()ing. An attacker may use that to write outside of the intented directory. (Fixed in Fedora's hardlink, and my hardlink by adding a section to the manual page stating that it is not safe to run the program on changing trees). I think that kind of bugs will stay until it is possible open/link by inode number. Perhaps *at() can help at the file currently examined. Nobody said they have to be fixed. As I wrote, the fix was to mention it in the manpage. Right now I only used it for my backups which are only accessible by me (and root). Possibly hardlink only bugs: - Exaggeration of sizes. hardlink currently counts every link replaced -st_size, even is st_nlink 1. I don't know what hadori does there. hadori does not have statistics. They should be easy to add, but I had no use for them. You can also drop your race check. The tool is unsafe on changing trees anyway, so you don't need to check whether someone else deleted the file, especially if you're then linking to it anyway. I wanted it to exit when something unexpected happens. But then there are just other cases where you don't, like opening files. And also, users will complain if you exit just because one file has a problem. They expect the program to continue with the remaining ones (at least they expected this in my audio conversion script, so I do this in hardlink as well). I knew that there were problems on large trees in 2009, but got nowhere with a fix in Python. We still have the two passes in hardlink and thus need to keep all the files currently, as I did not carry the link-first mode over from my temporary C++ rewrite, as memory usage was not much different in my test case. But as my test case was just running on /, the whole thing may not be representative. If there are lots of duplicates, link-first can definitely help. The one that works exactly as as you want is most likely Fedora's hardlink. I've searched for other implementations and all the others do two passes when one is obviously enough. Fedora's hardlink should do one pass only and link the first file to later files. It's fairly simple code. But nobody apart from Fedora and some other RPM distributions use it. Yes. It looks readable, but also has far less features than hardlink (which were added to hardlink because of user requests). I still don't get what --maximize (and --minimize) are needed for. In my incremental full backup scenario I get best results with keep-first. When hardlinking only $last and $dest (see below) even --maximize can disconnect files from older backups. That's to be expected. I don't know the reasons either. I thought they came from hardlinkpy (of which hardlink was a rewrite with the goal of increasing object-orientedness, and readability). But it seems they didn't. It turns out that this was a user request, or actually a feature of that user's fork of hardlinkpy: BEGINNE LOGBUCH UM Fri Dec 19 15:48:39 2008 Dez 19 18:54:56 chf Gegebenenfalls sollte man es zusammenführen, weil die Modifikation, welche ich benötige, auch den Link-Count der Dateien vergleicht und immer diejenige mit dem höheren erhält und die mit dem niedrigeren durch einen Hardlink auf die andere ersetzt. (If a non-German-speaking person wants to read this, try a translator software) I think he probably ran this on all of his backups at once. And it makes sense if
Bug#662080: ITP: hadori -- Hardlinks identical files
Julian Andres Klode j...@debian.org writes: On Sun, Mar 04, 2012 at 07:00:13AM +0100, Goswin von Brederlow wrote: Timo Weingärtner t...@tiwe.de writes: Package: wnpp Severity: wishlist X-Debbugs-CC: debian-de...@lists.debian.org Package name: hadori Version: 0.2 Upstream Author: Timo Weingärtner t...@tiwe.de URL: https://github.com/tiwe-de/hadori License: GPL3+ Description: Hardlinks identical files This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less merory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it's called HArdlinking DOne RIght. . Advantages over other hardlinking tools: * predictability: arguments are scanned in order, each first version is kept * much lower CPU and memory consumption * hashing option: speedup on many equal-sized, mostly identical files The initial comparison was with hardlink, which got OOM killed with a hundred backups of my home directory. Last night I compared it to duff and rdfind which would have happily linked files with different st_mtime and st_mode. I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug number. Greetings Timo I've been thinking about the problem of memory consumption too. But I've come to a different solution. One that doesn't need memory at all. I know yet another solution. For each file you visit, you simply visit the complete tree. Than you have n + 1 visits, but get constant space usage. Instead of remembering inodes, filenames and checksums create a global cache (e.g. directory hierachy like .cache/start of hash/hash) and hardlink every file to there. If you want/need to include uid, gid, mtime, mode in there then make that part of the .cache path. Garbage collection in the cache would be removing all files with a link count of 1. Going one step further link files with unique size [uid, gid, mtime, ...] to .cache/size and change that into .cache/size/start of hash/hash when you find a second file with the same size that isn't identical. That would save on the expensive hashing of clearly unique files. So implement an object store and replace files outside the object store with hardlinks to the store. Yes, this is guaranteed to work for some cases, but also has problems. If you create files first, and then move them to the store, you still need to check every file with link count != 1 and check whether it is in the cache already. And for this, you need a lookup by inode if you want to avoid hashing. And this is basically the same hierarchy as git has: .git/objects/first 2 hex digits of sha1sum/remaining sha1sum In the above every file is in the cache. A link count of 1 would indicate a new file that hasn't been processed yet. Unfortunately you can also have files with link count != 1 that aren't processed yet, e.g. 2 new files that are hardlinked to each other. Detecting wether a file is already in cache or not actualy needs to check 2 things: 1) link count == 1 = new file, add to cache 2) link count != 1 but hash of file not known (e.g. extended attribute not set) = new set of files that are hardlinks to each other Actually the link count can be completly ignored if you always add a flag when you've processed a file. Note: The above wastes time in the 2nd case since it would checksum all the files that are hardlinks one by one and replace them with hardlinks into the cache. But you could remember the inode and name of the first occurance. This would only use up memory proportionally to the number of new inodes. You could also use a hash that computes the first byte from the first 4k, second byte from 64k, thrid from 1mb and so on. That way you can check if the beginning of 2 files match without having to checksum the whole file or literally comprare the two. If the beginning can match. They're not guaranteed to match just because the hashes match. This wouldn't be to proof identity but to quickly proof difference. If the first 4k differ then the file will not match. Only makes sense if you have a lot of big files of equal size. MfG Goswin -- To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87hay39aqg.fsf@frosties.localnet
Bug#662080: ITP: hadori -- Hardlinks identical files
Package: wnpp Severity: wishlist X-Debbugs-CC: debian-de...@lists.debian.org Package name: hadori Version: 0.2 Upstream Author: Timo Weingärtner t...@tiwe.de URL: https://github.com/tiwe-de/hadori License: GPL3+ Description: Hardlinks identical files This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less merory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it's called HArdlinking DOne RIght. . Advantages over other hardlinking tools: * predictability: arguments are scanned in order, each first version is kept * much lower CPU and memory consumption * hashing option: speedup on many equal-sized, mostly identical files The initial comparison was with hardlink, which got OOM killed with a hundred backups of my home directory. Last night I compared it to duff and rdfind which would have happily linked files with different st_mtime and st_mode. I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug number. Greetings Timo signature.asc Description: This is a digitally signed message part.
Bug#662080: ITP: hadori -- Hardlinks identical files
On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote: Package: wnpp Severity: wishlist X-Debbugs-CC: debian-de...@lists.debian.org Package name: hadori Version: 0.2 Upstream Author: Timo Weingärtner t...@tiwe.de URL: https://github.com/tiwe-de/hadori License: GPL3+ Description: Hardlinks identical files This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less merory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it's called HArdlinking DOne RIght. . Advantages over other hardlinking tools: * predictability: arguments are scanned in order, each first version is kept * much lower CPU and memory consumption * hashing option: speedup on many equal-sized, mostly identical files The initial comparison was with hardlink, which got OOM killed with a hundred backups of my home directory. Last night I compared it to duff and rdfind which would have happily linked files with different st_mtime and st_mode. You might want to try hardlink 0.2~rc1. In any case, I don't think we need yet another such tool in the archive. If you want that algorithm, we can implement it in hardlink 0.2 using probably about 10 lines. I had that locally and it works, so if you want it, we can add it and avoid the need for one more hack in that space. hardlink 0.2 is written in C, and uses a binary tree to map (dev_t, off_t) to a struct file which contains the stat information plus name for linking. It requires two allocations per file, one for the struct file with the filename, and one for the node in the tree (well, actually we only need the node for the first file with a specific (dev_t, off_t) tuple). A node has 3 pointers. -- Julian Andres Klode - Debian Developer, Ubuntu Member See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/. pgpTObJIud0UX.pgp Description: PGP signature
Bug#662080: ITP: hadori -- Hardlinks identical files
Hallo Julian Andres, 2012-03-04 um 01:07:42 schriebst Du: On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote: The initial comparison was with hardlink, which got OOM killed with a hundred backups of my home directory. Last night I compared it to duff and rdfind which would have happily linked files with different st_mtime and st_mode. You might want to try hardlink 0.2~rc1. In any case, I don't think we need yet another such tool in the archive. If you want that algorithm, we can implement it in hardlink 0.2 using probably about 10 lines. I had that locally and it works, so if you want it, we can add it and avoid the need for one more hack in that space. And why is lighttpd in the archive? Apache can do the same ... hardlink 0.2 is written in C, and uses a binary tree to map (dev_t, off_t) to a struct file which contains the stat information plus name for linking. It requires two allocations per file, one for the struct file with the filename, and one for the node in the tree (well, actually we only need the node for the first file with a specific (dev_t, off_t) tuple). A node has 3 pointers. The hardlink I used at that time was written in python and definitely didn't do it the way I want. hadori is written in C++11 which IMHO makes it look a little more readable. It started with tree based map and multimap, now it uses the unordered_ (hash based) versions which made it twice as fast in a typical workload. The main logic is in hadori.C, handle_file and uses: std::unordered_mapino_t, inode const kept; std::unordered_mapino_t, ino_t to_link; std::unordered_multimapoff_t, ino_t sizes; class inode contains a struct stat, a file name and an adler checksum, but I plan to drop the last one because I think the hashing option is no great gain. Grüße Timo signature.asc Description: This is a digitally signed message part.
Bug#662080: ITP: hadori -- Hardlinks identical files
Timo Weingärtner t...@tiwe.de writes: Package: wnpp Severity: wishlist X-Debbugs-CC: debian-de...@lists.debian.org Package name: hadori Version: 0.2 Upstream Author: Timo Weingärtner t...@tiwe.de URL: https://github.com/tiwe-de/hadori License: GPL3+ Description: Hardlinks identical files This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less merory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it's called HArdlinking DOne RIght. . Advantages over other hardlinking tools: * predictability: arguments are scanned in order, each first version is kept * much lower CPU and memory consumption * hashing option: speedup on many equal-sized, mostly identical files The initial comparison was with hardlink, which got OOM killed with a hundred backups of my home directory. Last night I compared it to duff and rdfind which would have happily linked files with different st_mtime and st_mode. I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug number. Greetings Timo I've been thinking about the problem of memory consumption too. But I've come to a different solution. One that doesn't need memory at all. Instead of remembering inodes, filenames and checksums create a global cache (e.g. directory hierachy like .cache/start of hash/hash) and hardlink every file to there. If you want/need to include uid, gid, mtime, mode in there then make that part of the .cache path. Garbage collection in the cache would be removing all files with a link count of 1. Going one step further link files with unique size [uid, gid, mtime, ...] to .cache/size and change that into .cache/size/start of hash/hash when you find a second file with the same size that isn't identical. That would save on the expensive hashing of clearly unique files. You could also use a hash that computes the first byte from the first 4k, second byte from 64k, thrid from 1mb and so on. That way you can check if the beginning of 2 files match without having to checksum the whole file or literally comprare the two. MfG Goswin -- To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87lingkis2.fsf@frosties.localnet