Re: [BackupPC-users] Storing Identical Files?

2023-02-12 Thread backuppc
You basically need to read in all the attrib files and create a list
of pairs of file --> md5sum then you can 'grep' to find the missing
ones.

It's slow...
I created and posted a script to do this a while ago...

Samual Flossie wrote at about 11:30:28 -0500 on Sunday, February 12, 2023:
 > I would like to know the inverse of Kennth's question:
 > How might one identify the real file associated with a pool file?
 > I am trying to figure out what real files are missing or lost forever, now
 > that I see hundreds of "BackupPC_refCountUpdate: missing pool file
 > bf24efdde8d7b6021e71eed312869c2a count 4" in nightly logs.
 > 
 > On Sun, Feb 12, 2023 at 9:06 AM Kenneth Porter 
 > wrote:
 > 
 > > How might one identify a pool file associated with a real file? One
 > > could then verify that the same pool file represents both clients'
 > > copies of the same file. Should a simple md5sum of the files on the
 > > clients match that of the pool file?
 > >
 > >
 > >
 > >
 > > ___
 > > BackupPC-users mailing list
 > > BackupPC-users@lists.sourceforge.net
 > > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > > Wiki:https://github.com/backuppc/backuppc/wiki
 > > Project: https://backuppc.github.io/backuppc/
 > >
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Storing Identical Files?

2023-02-12 Thread backuppc
Two options:
- Read the associated attribute file in the pc tree to determine the
  pool file corresponding to any give file
- Calculate the md5sum of the file and look up the corresponding
  md5sum in the pool tree (assuming there are no md5sum collisions
  which are extremely unlikely. If there are collisions, then find the
  version corresponding to the contents)
 
Kenneth Porter wrote at about 06:03:46 -0800 on Sunday, February 12, 2023:
 > How might one identify a pool file associated with a real file? One 
 > could then verify that the same pool file represents both clients' 
 > copies of the same file. Should a simple md5sum of the files on the 
 > clients match that of the pool file?
 > 
 > 
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Storing Identical Files?

2023-02-12 Thread Samual Flossie
I would like to know the inverse of Kennth's question:
How might one identify the real file associated with a pool file?
I am trying to figure out what real files are missing or lost forever, now
that I see hundreds of "BackupPC_refCountUpdate: missing pool file
bf24efdde8d7b6021e71eed312869c2a count 4" in nightly logs.

On Sun, Feb 12, 2023 at 9:06 AM Kenneth Porter 
wrote:

> How might one identify a pool file associated with a real file? One
> could then verify that the same pool file represents both clients'
> copies of the same file. Should a simple md5sum of the files on the
> clients match that of the pool file?
>
>
>
>
> ___
> BackupPC-users mailing list
> BackupPC-users@lists.sourceforge.net
> List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
> Wiki:https://github.com/backuppc/backuppc/wiki
> Project: https://backuppc.github.io/backuppc/
>
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Storing Identical Files?

2023-02-12 Thread Guillermo Rozas
On Sun, Feb 12, 2023, 04:36 Christian Völker  wrote:

> Meanwhile I realized I had a different issue. The share was backed up one
> day, on the other day (for different reasons) the clientB.pl was
> overwritten by a previous version and the share did not get backed up.
> Howeverm, in graphs I notice a shrink of the red line:
> [image: PoolUsage]
>
> So it is doing deduplication and I am just not patient enough?
>

When you say "I see an increase in pool usage that doesn't come down", you
refer to the green area or to the red line? The second, as stated in the
legend, is "Prior to pooling and compression". It's not affected by
deduplication or compression, it's the size the pool would have if both
features were deactivated. The real pool size is the green area, which
should be similar to making "du -s" on the pool folder.

Best regards,
Guillermo

>
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Storing Identical Files?

2023-02-12 Thread Kenneth Porter
How might one identify a pool file associated with a real file? One 
could then verify that the same pool file represents both clients' 
copies of the same file. Should a simple md5sum of the files on the 
clients match that of the pool file?





___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Storing Identical Files?

2023-02-12 Thread G.W. Haywood via BackupPC-users

Hi there,

Today I realized I hadn't sent this.  It may have been overtaken by
events but here it is anyway...

On Sat, 11 Feb 2023, Christian V?lker wrote:

I have two clients which have a large share. These two (Debian) clients 
sync this share on a daily base through rsync (through a third clientC, 
but this should not make a difference). On clientA there is a cron job 
doing rsync to clientC and on clientB there is a cron job doing rsync 
from clientC. So in the end all three hosts have identical data. ...


BackupPC itself is only backing up host clientA so far (since months 
now).? So the data is stored in /var/lib/backuppc.


Now I added the clientB share to BackupPC ... usage of the pool
increased approximately about the size of the share ...


You have missed some important information.

1. May we see does your BackupPC configuration files?

2. What is 'large' in 'large share'?  Obviously adding an extra client
to the backup will produce a requirement for storage of a large amount
of metadata.  Perhaps that's what you're seeing, although without more
information about data volume it's difficult to guess what's going on.

3. Do the files in the shares change?  I presume that they do or there
would be no need to sync them, so that begs the next two questions

4. When do the files change? and

5. When do the backups take place?

Obviously if large numbers of the files change between backups and the
first backup takes place before the changes while the second backup
takes place after it, then you cannot expect deduplication to help.


*  Is there dupe detecion on BackupPC?


Yes.  We routinely back up just under 20 Terabytes of data from 12
hosts.  After pooling and compression the pool size is 640 Gigabytes.


* If so, why does my pool size not decrease after a while?

* If by default it has to decrease, is there an explanation why it
does not on my host?


I do not know the answers to these questions.  More information is needed.

Faced with this kind of situation I would investigate, in order to

(1) justify my trust in the numbers on which I base any conclusions and

(2) verify (if possible for a few, hopefully large, sample duplicated
files) that the physical storage location for the duplicated files on
the storage medium was the same - thus demonstrating deduplication.

--

73,
Ged.


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Storing Identical Files?

2023-02-11 Thread Christian Völker

Hi,

thanks for your ideas.

So unless you have either an md5sum collision (extremely unlikely
unless creating them intentionally -- as in (number of files)*2^-128
unlikely), you shouldn't have any files in your pool with an
underscore in them.
Under pool/ I do not have any files with an underscore. Under pc/ there 
are loads of them all named "attrib_*". I guess this is ok so far.



If the contents are the same, then indeed for some reason
de-duplication isn't working.
The content are the same- I rsync'ed them (with -avH). It might be some 
attributes (last access and so on) might be different. As well, path 
might be different: /srv/pics on clientA vs. /srv/share/pics on clientB. 
But that is for sure the only difference.



The only thing that I could think of
that could possibly cause duplicates is if the compression is set
differently on the different backups -- but I'm not sure that would
even create a problem.

They all are uncompressed (cpool ist empty and compression is disabled):
*$Conf{CompressLevel} = 0;*


Also, confirm that all your backups are in a v4 pool...

Yes, they are all in v4, v3 is disabled:
*$Conf{PoolV3Enabled} = 0;*


Meanwhile I realized I had a different issue. The share was backed up 
one day, on the other day (for different reasons) the clientB.pl was 
overwritten by a previous version and the share did not get backed up. 
Howeverm, in graphs I notice a shrink of the red line:

PoolUsage

So it is doing deduplication and I am just not patient enough?


Really unsure
thanks for all hints!

Greetings
/KNEBB



Christian Völker wrote at about 09:12:22 +0100 on Saturday, February 11, 2023:
  > Hi,
  >
  > I am using BackupPC now for years. It is really great. Meanwhile I use
  > v4.4.0 on Debian.
  >
  > As far as I understood it os very efficient in storing identical data.
  > Now I noticed something which let me doubt this. I guess there is an
  > explanation. So what do I have?
  >
  > I have two clients which have a large share. These two (Debian) clients
  > sync this share on a daily base through rsync (through a third clientC,
  > but this should not make a difference). On clientA there is a cron job
  > doing rsync to clientC and on clientB there is a cron job doing rsync
  > from clientC. So in the end all three hosts have identical data. rsync
  > command runs through ssh and use "-avH".
  >
  > BackupPC itself is only backing up host clientA so far (since months
  > now).  So the data is stored in /var/lib/backuppc.
  >
  > Now I added the clientB share to BackupPC and expected the filesystem
  > usage on /var/lib/backuppc to stay more or less equal after the backupc
  > of clientB as the data is already stored from clientA. At least after a
  > while when doing some cleanups.
  >
  > Unfortunately, the usage of the pool increased approximately about the
  > size of the share and has not been dropped since (more than a week now).
  >
  > So my questions are:
  >
  >   *   Is there dupe detecion on BackupPC?
  >   * If so, why does my pool size not decrease after a while?
  >   * If by default it has to decrease, is there an explanation why it
  > does not on my host?
  >
  > Thanks a lot!
  >
  >
  > /KNEBB
  >
  >
  >
  > ___
  > BackupPC-users mailing list
  >BackupPC-users@lists.sourceforge.net
  > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
  > Wiki:https://github.com/backuppc/backuppc/wiki
  > Project:https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project:https://backuppc.github.io/backuppc/
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Storing Identical Files?

2023-02-11 Thread Greg Harris
Dumb question.  They aren’t encrypted on the drive are they?

Thanks,

Greg Harris

On Feb 11, 2023, at 8:49 PM, 
backu...@kosowsky.org wrote:

It does and in fact almost has to since pool files are stored
according to their md5sum.
So unless you have either an md5sum collision (extremely unlikely
unless creating them intentionally -- as in (number of files)*2^-128
unlikely), you shouldn't have any files in your pool with an
underscore in them.

If you have any such files, use BackupPC_zcat to compare their
contents. If they are different, then congrats you have
(unintentionally) created a blue moon md5sum collision.

If the contents are the same, then indeed for some reason
de-duplication isn't working. The only thing that I could think of
that could possibly cause duplicates is if the compression is set
differently on the different backups -- but I'm not sure that would
even create a problem.

Also, confirm that all your backups are in a v4 pool...

Christian Völker wrote at about 09:12:22 +0100 on Saturday, February 11, 2023:
Hi,

I am using BackupPC now for years. It is really great. Meanwhile I use
v4.4.0 on Debian.

As far as I understood it os very efficient in storing identical data.
Now I noticed something which let me doubt this. I guess there is an
explanation. So what do I have?

I have two clients which have a large share. These two (Debian) clients
sync this share on a daily base through rsync (through a third clientC,
but this should not make a difference). On clientA there is a cron job
doing rsync to clientC and on clientB there is a cron job doing rsync
from clientC. So in the end all three hosts have identical data. rsync
command runs through ssh and use "-avH".

BackupPC itself is only backing up host clientA so far (since months
now).  So the data is stored in /var/lib/backuppc.

Now I added the clientB share to BackupPC and expected the filesystem
usage on /var/lib/backuppc to stay more or less equal after the backupc
of clientB as the data is already stored from clientA. At least after a
while when doing some cleanups.

Unfortunately, the usage of the pool increased approximately about the
size of the share and has not been dropped since (more than a week now).

So my questions are:

 *   Is there dupe detecion on BackupPC?
 * If so, why does my pool size not decrease after a while?
 * If by default it has to decrease, is there an explanation why it
   does not on my host?

Thanks a lot!


/KNEBB



___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Storing Identical Files?

2023-02-11 Thread backuppc
It does and in fact almost has to since pool files are stored
according to their md5sum.
So unless you have either an md5sum collision (extremely unlikely
unless creating them intentionally -- as in (number of files)*2^-128
unlikely), you shouldn't have any files in your pool with an
underscore in them.

If you have any such files, use BackupPC_zcat to compare their
contents. If they are different, then congrats you have
(unintentionally) created a blue moon md5sum collision.

If the contents are the same, then indeed for some reason
de-duplication isn't working. The only thing that I could think of
that could possibly cause duplicates is if the compression is set
differently on the different backups -- but I'm not sure that would
even create a problem.

Also, confirm that all your backups are in a v4 pool...

Christian Völker wrote at about 09:12:22 +0100 on Saturday, February 11, 2023:
 > Hi,
 > 
 > I am using BackupPC now for years. It is really great. Meanwhile I use 
 > v4.4.0 on Debian.
 > 
 > As far as I understood it os very efficient in storing identical data. 
 > Now I noticed something which let me doubt this. I guess there is an 
 > explanation. So what do I have?
 > 
 > I have two clients which have a large share. These two (Debian) clients 
 > sync this share on a daily base through rsync (through a third clientC, 
 > but this should not make a difference). On clientA there is a cron job 
 > doing rsync to clientC and on clientB there is a cron job doing rsync 
 > from clientC. So in the end all three hosts have identical data. rsync 
 > command runs through ssh and use "-avH".
 > 
 > BackupPC itself is only backing up host clientA so far (since months 
 > now).  So the data is stored in /var/lib/backuppc.
 > 
 > Now I added the clientB share to BackupPC and expected the filesystem 
 > usage on /var/lib/backuppc to stay more or less equal after the backupc 
 > of clientB as the data is already stored from clientA. At least after a 
 > while when doing some cleanups.
 > 
 > Unfortunately, the usage of the pool increased approximately about the 
 > size of the share and has not been dropped since (more than a week now).
 > 
 > So my questions are:
 > 
 >   *   Is there dupe detecion on BackupPC?
 >   * If so, why does my pool size not decrease after a while?
 >   * If by default it has to decrease, is there an explanation why it
 > does not on my host?
 > 
 > Thanks a lot!
 > 
 > 
 > /KNEBB
 > 
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/