Re: diff or deduplicate two volumes with different folder structures

2016-09-22 Thread Chris Murphy
On Thu, Sep 22, 2016 at 12:56 PM, Matthew Miller
 wrote:
> On Thu, Sep 22, 2016 at 07:57:48PM +0200, Roberto Ragusa wrote:
>> > Don't use MD5. You will get unintentional file collisions. (SHA-256 is
>> > good. It depends on just how much you are comparing.)
>> MD5 unintentional collisions?
>> It is 128 bit, so you will have a collision after about 2^64 files,
>> according to the birthday theorem.
>
> It's pretty unlikely in the real world, but...
>
> ONE="d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70"
> TWO="d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70"
> echo $ONE | xxd -r -p | md5sum
> echo $TWO | xxd -r -p | md5sum
> echo $ONE | xxd -r -p | sha256sum
> echo $TWO | xxd -r -p | sha256sum
>

Right, this use case doesn't require a cryptographic function. It's
just over 120,000 files. More likely than a collision is a file copy
has a bit flip, the copies end up with different md5sums, and
therefore I end up storing both good and bad copies.


-- 
Chris Murphy
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-22 Thread Matthew Miller
On Thu, Sep 22, 2016 at 07:57:48PM +0200, Roberto Ragusa wrote:
> > Don't use MD5. You will get unintentional file collisions. (SHA-256 is
> > good. It depends on just how much you are comparing.)
> MD5 unintentional collisions?
> It is 128 bit, so you will have a collision after about 2^64 files,
> according to the birthday theorem.

It's pretty unlikely in the real world, but...

ONE="d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70"
TWO="d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70"
echo $ONE | xxd -r -p | md5sum
echo $TWO | xxd -r -p | md5sum
echo $ONE | xxd -r -p | sha256sum
echo $TWO | xxd -r -p | sha256sum


-- 
Matthew Miller

Fedora Project Leader
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-22 Thread Roberto Ragusa
On 09/21/2016 01:01 AM, a...@clueserver.org wrote:

> Don't use MD5. You will get unintentional file collisions. (SHA-256 is
> good. It depends on just how much you are comparing.)

MD5 unintentional collisions?
It is 128 bit, so you will have a collision after about 2^64 files,
according to the birthday theorem.

-- 
   Roberto Ragusamail at robertoragusa.it
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-21 Thread Chris Murphy
What I ended up doing:

$ find /brickA -type f -exec md5sum "{}" + > brickA.txt
$ find /brickB -type f -exec md5sum "{}" + > brickB.txt
$ cut -c 1-32 brickA.txt > brickA_md5.txt
$ grep -v -F -f brickA_md5.txt brickB.txt > onbrickB_notonbrickA.txt

Thanks for the help everyone.

Chris Murphy
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread alan
> On Tue, Sep 20, 2016 at 10:52:10PM +0200, Ahmad Samir wrote:
>> One last try (sometimes an issue nags):
>> $ find A -exec md5sum '{}' + > a-md5
>> $ find B -exec md5sum '{}' + > b-md5
>> $ cat a-md5 b-md5 > All
>> $ sort -u -k 1,1 All > dupes
>>
>> Now, (I hopefully got my head around it this time...), the dupes file
>> should contain a list of files that exist in _both_ A and B; but every
>> two files that have the same md5sum will have _only one_ of them
>> listed (either in A OR B). So if you delete that list of files you
>> should end up with only unique files in both locations.
>
> At the start ISTR you said the two directory trees were different.
> I took that to mean that two files with identical contents could
> be in different directories within the two trees.
>
> If I was wrong in that assumption and each pair of identical
> files would be in the same relative path I have two suggestions.
>
> 1. Sort a-md5 and b-md5
>Use the comm(1) command.  It will give lines in both files,
>in file a-md5 only and in b-md5 only with 0, 1, or 2 tabs.
>You can also use options to get the 3 columns individually.
>To do this you would have cd to A or B and run the find cmds
>as "find .", not "find A or B".
>
> 2. Get a copy of an old program called dircmp* and run it on the
>two trees directly.  It will output files only in tree A,
>only in tree B, then output files in both noting whether
>they are the same or different contents.
>
> I don't have the compiled version of dircmp, but I have a ksh
> shell script version that is quite similar.

Don't use MD5. You will get unintentional file collisions. (SHA-256 is
good. It depends on just how much you are comparing.)

What I use is a perl script that takes the directories I want to dedupe
and build a hash table of all the file sizes. I then go through that set
of hashes and ignore anything that has only one element for a particular
file size. Once I have a list of files with the same size, I then build a
hash table for the SHA-256 sums for those files. (I plan on adding a
preprocess to only hash the first 16k or so as a first pass to weed out
large files that are actually different.) Any place where I find a match
on both file size and SHA-256 hash I add to a queue to process later.

Sounds a bit complex, but it works pretty well. Depending on the number of
actual matches, you can go through a few terabytes in a short period of
time.

I hope that makes sense.




___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread Jon LaBadie
On Tue, Sep 20, 2016 at 10:52:10PM +0200, Ahmad Samir wrote:
> One last try (sometimes an issue nags):
> $ find A -exec md5sum '{}' + > a-md5
> $ find B -exec md5sum '{}' + > b-md5
> $ cat a-md5 b-md5 > All
> $ sort -u -k 1,1 All > dupes
> 
> Now, (I hopefully got my head around it this time...), the dupes file
> should contain a list of files that exist in _both_ A and B; but every
> two files that have the same md5sum will have _only one_ of them
> listed (either in A OR B). So if you delete that list of files you
> should end up with only unique files in both locations.

At the start ISTR you said the two directory trees were different.
I took that to mean that two files with identical contents could
be in different directories within the two trees.

If I was wrong in that assumption and each pair of identical
files would be in the same relative path I have two suggestions.

1. Sort a-md5 and b-md5
   Use the comm(1) command.  It will give lines in both files,
   in file a-md5 only and in b-md5 only with 0, 1, or 2 tabs.
   You can also use options to get the 3 columns individually.
   To do this you would have cd to A or B and run the find cmds
   as "find .", not "find A or B".

2. Get a copy of an old program called dircmp* and run it on the
   two trees directly.  It will output files only in tree A,
   only in tree B, then output files in both noting whether
   they are the same or different contents.

I don't have the compiled version of dircmp, but I have a ksh
shell script version that is quite similar.

Jon
-- 
Jon H. LaBadie  jo...@jgcomp.com
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread Ahmad Samir
One last try (sometimes an issue nags):
$ find A -exec md5sum '{}' + > a-md5
$ find B -exec md5sum '{}' + > b-md5
$ cat a-md5 b-md5 > All
$ sort -u -k 1,1 All > dupes

Now, (I hopefully got my head around it this time...), the dupes file
should contain a list of files that exist in _both_ A and B; but every
two files that have the same md5sum will have _only one_ of them
listed (either in A OR B). So if you delete that list of files you
should end up with only unique files in both locations.

-- 
Ahmad Samir
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread stan
On Mon, 19 Sep 2016 17:23:39 -0600
Chris Murphy  wrote:

> Drives A and B have many overlapping files but I want to find out what
> files don't exist on each. Thwarting this is directory structure
> differs between the two drives, and I'm fairly certain some of the
> file names differ on the two drives also.
> 
> Therefore I need something hash based. I started with this:
> 
> 
> $ find /brickA -type f -exec md5sum "{}" + > brickA.txt
> $ find /brickB -type f -exec md5sum "{}" + > brickB.txt
> 
> What I need next is to:
> 
> Make a copy of the files, brickAcopy.txt and brickBcopy.txt
> Loop: Extract each md5sum in brickA.txt, grep for it in brickAcopy.txt
> and brickBcopy.txt, and if it's found in both, delete the line in both
> files.
> 
> What remains in each file are paths to files that don't exist on the
> other drive. This must be a solved problem, so I'm open to alternative
> approaches.

> Ideas?
 
Here's some linux utilities a quick search turned up.
http://www.howtogeek.com/201140/how-to-find-and-remove-duplicate-files-on-linux/
http://askubuntu.com/questions/3865/how-to-find-and-delete-duplicate-files

At least fslint and fdupes are in the fedora repositories, maybe others.
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 11:55 AM, Ahmad Samir  wrote:
> On 20 September 2016 at 13:00, Ahmad Samir  wrote:
>> On 20 September 2016 at 12:34, Ahmad Samir  wrote:
>>> On 20 September 2016 at 10:33, Ahmad Samir  wrote:

 Here's a crude way:
 $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt
 $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt
 $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff

 Ignoring lines beginning with @@, +++ or --- , the lines beginning
 with - are in A but not B ... etc

>>>
>>> Please disregard that, it won't work...
>>>
>>
>> More experimenting:
>> $ find A -exec md5sum '{}' + > a-md5
>> $ find B -exec md5sum '{}' + > b-md5
>> $ cat a-md5 b-md5 > All
>> $ sort -u -k 1,1 All
>>
>> that should output a list of files that are in one dir but not the other.
>>
>
> Doesn't work either, sorry for the noise.

I appreciate the effort. Maybe I'm overestimating how common this
situation must be, or underestimating the difficulty.

Anyway it's not super urgent. Btrfs gets in-band deduplication pretty
soon so the older volume can just have both path structures with
deduped data. The volume is too small to do out-of-band dedup which
requires copying all the data over first, and then deduping it.


-- 
Chris Murphy
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread Ahmad Samir
On 20 September 2016 at 13:00, Ahmad Samir  wrote:
> On 20 September 2016 at 12:34, Ahmad Samir  wrote:
>> On 20 September 2016 at 10:33, Ahmad Samir  wrote:
>>>
>>> Here's a crude way:
>>> $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt
>>> $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt
>>> $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff
>>>
>>> Ignoring lines beginning with @@, +++ or --- , the lines beginning
>>> with - are in A but not B ... etc
>>>
>>
>> Please disregard that, it won't work...
>>
>
> More experimenting:
> $ find A -exec md5sum '{}' + > a-md5
> $ find B -exec md5sum '{}' + > b-md5
> $ cat a-md5 b-md5 > All
> $ sort -u -k 1,1 All
>
> that should output a list of files that are in one dir but not the other.
>

Doesn't work either, sorry for the noise.

> --
> Ahmad Samir



-- 
Ahmad Samir
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread Ahmad Samir
On 20 September 2016 at 12:34, Ahmad Samir  wrote:
> On 20 September 2016 at 10:33, Ahmad Samir  wrote:
>>
>> Here's a crude way:
>> $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt
>> $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt
>> $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff
>>
>> Ignoring lines beginning with @@, +++ or --- , the lines beginning
>> with - are in A but not B ... etc
>>
>
> Please disregard that, it won't work...
>

More experimenting:
$ find A -exec md5sum '{}' + > a-md5
$ find B -exec md5sum '{}' + > b-md5
$ cat a-md5 b-md5 > All
$ sort -u -k 1,1 All

that should output a list of files that are in one dir but not the other.

-- 
Ahmad Samir
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread Ahmad Samir
On 20 September 2016 at 10:33, Ahmad Samir  wrote:
> On 20 September 2016 at 01:23, Chris Murphy  wrote:
>> Drives A and B have many overlapping files but I want to find out what
>> files don't exist on each. Thwarting this is directory structure
>> differs between the two drives, and I'm fairly certain some of the
>> file names differ on the two drives also.
>>
>> Therefore I need something hash based. I started with this:
>>
>>
>> $ find /brickA -type f -exec md5sum "{}" + > brickA.txt
>> $ find /brickB -type f -exec md5sum "{}" + > brickB.txt
>>
>
> Here's a crude way:
> $ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt
> $ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt
> $ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff
>
> Ignoring lines beginning with @@, +++ or --- , the lines beginning
> with - are in A but not B ... etc
>

Please disregard that, it won't work...

> [...]
>
> --
> Ahmad Samir



-- 
Ahmad Samir
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-20 Thread Ahmad Samir
On 20 September 2016 at 01:23, Chris Murphy  wrote:
> Drives A and B have many overlapping files but I want to find out what
> files don't exist on each. Thwarting this is directory structure
> differs between the two drives, and I'm fairly certain some of the
> file names differ on the two drives also.
>
> Therefore I need something hash based. I started with this:
>
>
> $ find /brickA -type f -exec md5sum "{}" + > brickA.txt
> $ find /brickB -type f -exec md5sum "{}" + > brickB.txt
>

Here's a crude way:
$ find /brickA -type f -exec md5sum "{}" + | sort > brickA.txt
$ find /brickB -type f -exec md5sum "{}" + | sort > brickB.txt
$ diff -U 0 brickA.txt brickB.txt | sort -k 1.1,1.1 > A-B.diff

Ignoring lines beginning with @@, +++ or --- , the lines beginning
with - are in A but not B ... etc

[...]

-- 
Ahmad Samir
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org


Re: diff or deduplicate two volumes with different folder structures

2016-09-19 Thread geo.inbox.ignored


On 09/19/2016 06:23 PM, Chris Murphy wrote:
> Drives A and B have many overlapping files but I want to find out what
> files don't exist on each.

you might consider;

  rsync -avh /brickA/ /brickB/
 then
  rsync -avh /brickB/ /brickA/

to dupe files on both drives.

read 'man rsync' for arguments '-a', '-v', '-h' and for the
'-c, --checksum' feature.

to find same file with different names, you will still need to run
'find' with '-exec md5sum', but only on 1 drive as both drives will
have same dupes of diff names.


-- 

peace out.

CentOS GNU/Linux 6.8

tc,hago.

g
.

=+=
Tired of having your microsoft os hacked?
Change to Linux os, used by microsoft hackers.
=+=
in a world with out fences, who needs gates.
=+=
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org