On 2011-05-02, Justin Sherrill <[email protected]> wrote: Hi Justin,
> You could dump out the B-tree information. I don't know how clear a > picture would come from that, and it may require some massaging of > data anyway since nonduplicated files may have some degree of > matching, duplicated data anyway, especially when dealing with larger > image file. That's a bit beyond my current C programming skills I guess, and a little to much effort for this little cleanup project. Anyway, thanks for the idea. > If you are sure that the corruption lies at the end of the files, you > could loop over the files, read the first x bytes of each, then MD5 > that data. Matching MD5 = matching file. It mostly is at the end. This suggestion (partitioning files into chunks) is what I had done so far (on Linux) with a few lines of shell (changed old existing script for that), then, due to inherent inefficiencies, in python. A handful of lines, and output "inode, chunkId, hash" to file or SQL, then go from there. I had hoped hammer, as a deduplicating filesystem, had tools that could easily give me that information without "hacks" like above. Regards Thomas > On Sun, May 1, 2011 at 2:39 PM, Thomas Keusch ><[email protected]> wrote: >> Hello, >> >> now that Dragonfly's HAMMER has got deduplication I ask myself if there >> is a simple way to identify "pairs" or groups of files which share a lot >> of data, i.e. are mostly identical. >> >> I have a rather large repository of downloaded pictures, which contain >> a lot of dupes in multiple locations. I have no problems finding those >> given some time and a shell prompt. >> >> I'm interested in identifying broken files. Broken in the sense that >> A is an incomplete version of B (some bytes missing), or B a damaged >> version of A (some additional bytes at the end). >> >> Is there a way to get to something like this: >> >> "File A shares 1234 (98.3%) data blocks with file B" >> "File A shares xxxx (xx.x%) data blocks with file C" >> >> Getting a step closer helps too. >> >> Thanks for any insights. >> >> >> Regards >> Thomas >>
