Re: diff or deduplicate two volumes with different folder structures
On Thu, Sep 22, 2016 at 12:56 PM, Matthew Miller wrote: > On Thu, Sep 22, 2016 at 07:57:48PM +0200, Roberto Ragusa wrote: >> > Don't use MD5. You will get unintentional file collisions. (SHA-256 is >> > good. It depends on just how much you are comparing.) >> MD5 unintentional collisions? >> It is 128 bit, so you will have a collision after about 2^64 files, >> according to the birthday theorem. > > It's pretty unlikely in the real world, but... > > ONE="d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70" > TWO="d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70" > echo $ONE | xxd -r -p | md5sum > echo $TWO | xxd -r -p | md5sum > echo $ONE | xxd -r -p | sha256sum > echo $TWO | xxd -r -p | sha256sum > Right, this use case doesn't require a cryptographic function. It's just over 120,000 files. More likely than a collision is a file copy has a bit flip, the copies end up with different md5sums, and therefore I end up storing both good and bad copies. -- Chris Murphy ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On Thu, Sep 22, 2016 at 07:57:48PM +0200, Roberto Ragusa wrote: > > Don't use MD5. You will get unintentional file collisions. (SHA-256 is > > good. It depends on just how much you are comparing.) > MD5 unintentional collisions? > It is 128 bit, so you will have a collision after about 2^64 files, > according to the birthday theorem. It's pretty unlikely in the real world, but... ONE="d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70" TWO="d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70" echo $ONE | xxd -r -p | md5sum echo $TWO | xxd -r -p | md5sum echo $ONE | xxd -r -p | sha256sum echo $TWO | xxd -r -p | sha256sum -- Matthew Miller Fedora Project Leader ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On 09/21/2016 01:01 AM, a...@clueserver.org wrote: > Don't use MD5. You will get unintentional file collisions. (SHA-256 is > good. It depends on just how much you are comparing.) MD5 unintentional collisions? It is 128 bit, so you will have a collision after about 2^64 files, according to the birthday theorem. -- Roberto Ragusamail at robertoragusa.it ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
What I ended up doing: $ find /brickA -type f -exec md5sum "{}" + > brickA.txt $ find /brickB -type f -exec md5sum "{}" + > brickB.txt $ cut -c 1-32 brickA.txt > brickA_md5.txt $ grep -v -F -f brickA_md5.txt brickB.txt > onbrickB_notonbrickA.txt Thanks for the help everyone. Chris Murphy ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
> On Tue, Sep 20, 2016 at 10:52:10PM +0200, Ahmad Samir wrote: >> One last try (sometimes an issue nags): >> $ find A -exec md5sum '{}' + > a-md5 >> $ find B -exec md5sum '{}' + > b-md5 >> $ cat a-md5 b-md5 > All >> $ sort -u -k 1,1 All > dupes >> >> Now, (I hopefully got my head around it this time...), the dupes file >> should contain a list of files that exist in _both_ A and B; but every >> two files that have the same md5sum will have _only one_ of them >> listed (either in A OR B). So if you delete that list of files you >> should end up with only unique files in both locations. > > At the start ISTR you said the two directory trees were different. > I took that to mean that two files with identical contents could > be in different directories within the two trees. > > If I was wrong in that assumption and each pair of identical > files would be in the same relative path I have two suggestions. > > 1. Sort a-md5 and b-md5 >Use the comm(1) command. It will give lines in both files, >in file a-md5 only and in b-md5 only with 0, 1, or 2 tabs. >You can also use options to get the 3 columns individually. >To do this you would have cd to A or B and run the find cmds >as "find .", not "find A or B". > > 2. Get a copy of an old program called dircmp* and run it on the >two trees directly. It will output files only in tree A, >only in tree B, then output files in both noting whether >they are the same or different contents. > > I don't have the compiled version of dircmp, but I have a ksh > shell script version that is quite similar. Don't use MD5. You will get unintentional file collisions. (SHA-256 is good. It depends on just how much you are comparing.) What I use is a perl script that takes the directories I want to dedupe and build a hash table of all the file sizes. I then go through that set of hashes and ignore anything that has only one element for a particular file size. Once I have a list of files with the same size, I then build a hash table for the SHA-256 sums for those files. (I plan on adding a preprocess to only hash the first 16k or so as a first pass to weed out large files that are actually different.) Any place where I find a match on both file size and SHA-256 hash I add to a queue to process later. Sounds a bit complex, but it works pretty well. Depending on the number of actual matches, you can go through a few terabytes in a short period of time. I hope that makes sense. ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On Tue, Sep 20, 2016 at 10:52:10PM +0200, Ahmad Samir wrote: > One last try (sometimes an issue nags): > $ find A -exec md5sum '{}' + > a-md5 > $ find B -exec md5sum '{}' + > b-md5 > $ cat a-md5 b-md5 > All > $ sort -u -k 1,1 All > dupes > > Now, (I hopefully got my head around it this time...), the dupes file > should contain a list of files that exist in _both_ A and B; but every > two files that have the same md5sum will have _only one_ of them > listed (either in A OR B). So if you delete that list of files you > should end up with only unique files in both locations. At the start ISTR you said the two directory trees were different. I took that to mean that two files with identical contents could be in different directories within the two trees. If I was wrong in that assumption and each pair of identical files would be in the same relative path I have two suggestions. 1. Sort a-md5 and b-md5 Use the comm(1) command. It will give lines in both files, in file a-md5 only and in b-md5 only with 0, 1, or 2 tabs. You can also use options to get the 3 columns individually. To do this you would have cd to A or B and run the find cmds as "find .", not "find A or B". 2. Get a copy of an old program called dircmp* and run it on the two trees directly. It will output files only in tree A, only in tree B, then output files in both noting whether they are the same or different contents. I don't have the compiled version of dircmp, but I have a ksh shell script version that is quite similar. Jon -- Jon H. LaBadie jo...@jgcomp.com ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
One last try (sometimes an issue nags): $ find A -exec md5sum '{}' + > a-md5 $ find B -exec md5sum '{}' + > b-md5 $ cat a-md5 b-md5 > All $ sort -u -k 1,1 All > dupes Now, (I hopefully got my head around it this time...), the dupes file should contain a list of files that exist in _both_ A and B; but every two files that have the same md5sum will have _only one_ of them listed (either in A OR B). So if you delete that list of files you should end up with only unique files in both locations. -- Ahmad Samir ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On Mon, 19 Sep 2016 17:23:39 -0600 Chris Murphy wrote: > Drives A and B have many overlapping files but I want to find out what > files don't exist on each. Thwarting this is directory structure > differs between the two drives, and I'm fairly certain some of the > file names differ on the two drives also. > > Therefore I need something hash based. I started with this: > > > $ find /brickA -type f -exec md5sum "{}" + > brickA.txt > $ find /brickB -type f -exec md5sum "{}" + > brickB.txt > > What I need next is to: > > Make a copy of the files, brickAcopy.txt and brickBcopy.txt > Loop: Extract each md5sum in brickA.txt, grep for it in brickAcopy.txt > and brickBcopy.txt, and if it's found in both, delete the line in both > files. > > What remains in each file are paths to files that don't exist on the > other drive. This must be a solved problem, so I'm open to alternative > approaches. > Ideas? Here's some linux utilities a quick search turned up. http://www.howtogeek.com/201140/how-to-find-and-remove-duplicate-files-on-linux/ http://askubuntu.com/questions/3865/how-to-find-and-delete-duplicate-files At least fslint and fdupes are in the fedora repositories, maybe others. ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On Tue, Sep 20, 2016 at 11:55 AM, Ahmad Samir wrote: > On 20 September 2016 at 13:00, Ahmad Samir wrote: >> On 20 September 2016 at 12:34, Ahmad Samir wrote: >>> On 20 September 2016 at 10:33, Ahmad Samir wrote: Here's a crude way: $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff Ignoring lines beginning with @@, +++ or --- , the lines beginning with - are in A but not B ... etc >>> >>> Please disregard that, it won't work... >>> >> >> More experimenting: >> $ find A -exec md5sum '{}' + > a-md5 >> $ find B -exec md5sum '{}' + > b-md5 >> $ cat a-md5 b-md5 > All >> $ sort -u -k 1,1 All >> >> that should output a list of files that are in one dir but not the other. >> > > Doesn't work either, sorry for the noise. I appreciate the effort. Maybe I'm overestimating how common this situation must be, or underestimating the difficulty. Anyway it's not super urgent. Btrfs gets in-band deduplication pretty soon so the older volume can just have both path structures with deduped data. The volume is too small to do out-of-band dedup which requires copying all the data over first, and then deduping it. -- Chris Murphy ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On 20 September 2016 at 13:00, Ahmad Samir wrote: > On 20 September 2016 at 12:34, Ahmad Samir wrote: >> On 20 September 2016 at 10:33, Ahmad Samir wrote: >>> >>> Here's a crude way: >>> $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt >>> $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt >>> $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff >>> >>> Ignoring lines beginning with @@, +++ or --- , the lines beginning >>> with - are in A but not B ... etc >>> >> >> Please disregard that, it won't work... >> > > More experimenting: > $ find A -exec md5sum '{}' + > a-md5 > $ find B -exec md5sum '{}' + > b-md5 > $ cat a-md5 b-md5 > All > $ sort -u -k 1,1 All > > that should output a list of files that are in one dir but not the other. > Doesn't work either, sorry for the noise. > -- > Ahmad Samir -- Ahmad Samir ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On 20 September 2016 at 12:34, Ahmad Samir wrote: > On 20 September 2016 at 10:33, Ahmad Samir wrote: >> >> Here's a crude way: >> $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt >> $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt >> $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff >> >> Ignoring lines beginning with @@, +++ or --- , the lines beginning >> with - are in A but not B ... etc >> > > Please disregard that, it won't work... > More experimenting: $ find A -exec md5sum '{}' + > a-md5 $ find B -exec md5sum '{}' + > b-md5 $ cat a-md5 b-md5 > All $ sort -u -k 1,1 All that should output a list of files that are in one dir but not the other. -- Ahmad Samir ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On 20 September 2016 at 10:33, Ahmad Samir wrote: > On 20 September 2016 at 01:23, Chris Murphy wrote: >> Drives A and B have many overlapping files but I want to find out what >> files don't exist on each. Thwarting this is directory structure >> differs between the two drives, and I'm fairly certain some of the >> file names differ on the two drives also. >> >> Therefore I need something hash based. I started with this: >> >> >> $ find /brickA -type f -exec md5sum "{}" + > brickA.txt >> $ find /brickB -type f -exec md5sum "{}" + > brickB.txt >> > > Here's a crude way: > $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt > $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt > $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff > > Ignoring lines beginning with @@, +++ or --- , the lines beginning > with - are in A but not B ... etc > Please disregard that, it won't work... > [...] > > -- > Ahmad Samir -- Ahmad Samir ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On 20 September 2016 at 01:23, Chris Murphy wrote: > Drives A and B have many overlapping files but I want to find out what > files don't exist on each. Thwarting this is directory structure > differs between the two drives, and I'm fairly certain some of the > file names differ on the two drives also. > > Therefore I need something hash based. I started with this: > > > $ find /brickA -type f -exec md5sum "{}" + > brickA.txt > $ find /brickB -type f -exec md5sum "{}" + > brickB.txt > Here's a crude way: $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff Ignoring lines beginning with @@, +++ or --- , the lines beginning with - are in A but not B ... etc [...] -- Ahmad Samir ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
Re: diff or deduplicate two volumes with different folder structures
On 09/19/2016 06:23 PM, Chris Murphy wrote: > Drives A and B have many overlapping files but I want to find out what > files don't exist on each. you might consider; rsync -avh /brickA/ /brickB/ then rsync -avh /brickB/ /brickA/ to dupe files on both drives. read 'man rsync' for arguments '-a', '-v', '-h' and for the '-c, --checksum' feature. to find same file with different names, you will still need to run 'find' with '-exec md5sum', but only on 1 drive as both drives will have same dupes of diff names. -- peace out. CentOS GNU/Linux 6.8 tc,hago. g . =+= Tired of having your microsoft os hacked? Change to Linux os, used by microsoft hackers. =+= in a world with out fences, who needs gates. =+= ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org
diff or deduplicate two volumes with different folder structures
Drives A and B have many overlapping files but I want to find out what files don't exist on each. Thwarting this is directory structure differs between the two drives, and I'm fairly certain some of the file names differ on the two drives also. Therefore I need something hash based. I started with this: $ find /brickA -type f -exec md5sum "{}" + > brickA.txt $ find /brickB -type f -exec md5sum "{}" + > brickB.txt What I need next is to: Make a copy of the files, brickAcopy.txt and brickBcopy.txt Loop: Extract each md5sum in brickA.txt, grep for it in brickAcopy.txt and brickBcopy.txt, and if it's found in both, delete the line in both files. What remains in each file are paths to files that don't exist on the other drive. This must be a solved problem, so I'm open to alternative approaches. Both drives use Btrfs, I can create snapshots and perform a "dedup" operation on those snapshots directly. Ideally the dedup would delete the files in both snapshots (i.e. it'd be considered data loss if it weren't for the snapshots) just to save time. But if necessary I'll just do a one way dedup with the two operations reversed and suffer the extra processing time. Ideas? -- Chris Murphy ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org