Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
Either there is some rare edge or condition in BackupPC causing de-duplication to fail (which I think is unlikely) or perhaps at some point your backuppc instance crashed in the middle of saving one of the above pool files in some indeterminate state that caused it to create a duplicate in the next backup. Did you by any chance change the compression level? (though I don't think this will break de-duplication) What is the creation date of each of the chain elements? My Occam's razor guess would be that all 5 were created as part of the same backup. We did change the compression level (but before doing that we have checked documentation and different compression levels are supposed to be full compatible). Yes, all those files cpool files seem to be from the same wake up/run, here are their dates: -rw-r- 1 backuppc backuppc 396 Jan 6 01:05 cpool/e8/c4/e9c4eb5dbc31a6b94c8227921fe7f91001 -rw-r- 1 backuppc backuppc 332 Jan 6 01:05 cpool/80/ac/80ad24bf7a216782015f19937ca2e0c801 -rw-r- 1 backuppc backuppc 337 Jan 6 01:21 cpool/f4/5a/f45b2ef773ddec67f7de7a8097ec636901 -rw-r- 1 backuppc backuppc 283 Jan 6 01:05 cpool/e2/78/e2797a05cc323d392bec223f4fc3d1c501 -rw-r- 1 backuppc backuppc 292 Jan 6 01:21 cpool/52/20/52214a481a6dad3ec037c73bae5518bd01 Actually this server first started at 5-Jan , so 5-Jan and 6-Jan was among the first backups that this server took. As you suggest, it's really hard to know whether there is an issue when you have not yet run a full BackupPC_nightly. That being said, why do you divide it up over so many days? Is your backup set that *large* that it can't complete in fewer nights? Yes, I will give it a few days/weeks to check again. Wait first to have first refCount completed. 2nd backup server sizing stats, when in a "steady" state, will be much like the stats of the 1st server. This is about 70-80 hosts utilizing about 10 TB (in the BackupPC data folder size), keeping about 1300 backups in total (full and incrementals). We divide our nightly in so many days, because we run nightly in work hours and the actual backups at night. We have one wakeup in the middle of the day that allows only nightly to run, all real backups are at night. When nightly is running, BackupPC gets much slower. So when a restore is needed this can be a pain. Running a restore of some GB (not many) while nightly is running (during work hours) can be really slow due to heavy I/O. We expect our storage even on steady state to be about 50-60% full, so there is no rush to remove files. Having fast restores on demand in business hours is definitely the priority here over emptying disk space some days earlier. Thank you. ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
Giannis Economou wrote at about 23:18:25 +0200 on Thursday, March 16, 2023: > Thank you for the extra useful "find" command. > I see from your command: those extra collisions are not 32chars files > but 35 due to the "ext" (extension), but without _ separator. > > Based on that, I did my investigation. > > Some findings below... > > A. First BackupPC server: > This is a backup server running since Jan 2023 (replaced other bpc > servers running for years). > Status now: > - Pool is 10311.16GiB comprising 50892587 files and 16512 directories > (as of 2023-03-16 13:23), > - Pool hashing gives 5 repeated files with longest chain 1, > - Nightly cleanup removed 3100543 files of size 1206.32GiB (around > 2023-03-16 13:23), > > From those 5 cpool files, I investigated two of them (using the nice > wiki page here > https://github.com/backuppc/backuppc/wiki/How-to-find-which-backups-reference-a-particular-pool-file) > Each file of those two under investigation, was referenced in different > hosts and different individual backups. > > For both cpool files investigated, the actual file on disk was *exactly* > the same file. > So we talk about a file in different hosts, in different paths, > different dates, but having the same filename and same content. > So practically I understand that we have deduplications that did not > happen for some reason. Either there is some rare edge or condition in BackupPC causing de-duplication to fail (which I think is unlikely) or perhaps at some point your backuppc instance crashed in the middle of saving one of the above pool files in some indeterminate state that caused it to create a duplicate in the next backup. Did you by any chance change the compression level? (though I don't think this will break de-duplication) What is the creation date of each of the chain elements? My Occam's razor guess would be that all 5 were created as part of the same backup. > > B. Second BackupPC server: > This is a brand new backup server, running only for 3 days until today, > not even a week youg (also replaced other bpc servers running for years). > Therefore we only have a few backups there until today, more will add up > as days go by. > Status now: > - Pool is 2207.69GiB comprising 26161662 files and 12384 directories (as > of 2023-03-16 13:11), > - Pool hashing gives 275 repeated files with longest chain 1, > - Nightly cleanup removed 0 files of size 0.00GiB (around 2023-03-16 13:11), > > What I see here is a different situation... > > 1. Count of files found by "find" is different that the collisions > reported in status page (find counts 231, bpc status page counts 275) > 2. Many pool files among the list from "find", are not referenced at all > in poolCnt files (give a count of 0 in their "poolCnt") > 3. Even worse, many cpool/*/poolCnt files are completely missing from > the cpool/. > For example I have a colission file: > cpool/ce/b0/ceb1e6764a28b208d51a7801052118d701 > but file: cpool/ce/poolCnt > does not even exist. > > Maybe refCount is not completed yet (because server is still new)? > > One guess is that this might happen because I have: > BackupPCNightlyPeriod: 4 > PoolSizeNightlyUpdatePeriod: 16 > PoolNightlyDigestCheckPercent: 1 > and the server is still new. > > I will wait for a few days. > (BTW, storage is local and everything seems error free from day-1 on the > system) > > If such inconsistencies persist, I guess I will have to investigate > "BackupPC_fsck" for this server. > > As you suggest, it's really hard to know whether there is an issue when you have not yet run a full BackupPC_nightly. That being said, why do you divide it up over so many days? Is your backup set that *large* that it can't complete in fewer nights? > > > On 16/3/2023 8:31 μ.μ., backu...@kosowsky.org wrote: > > Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023: > > > In my v4 pool are collisions still generating _0, _1, _2 etc filenames > > > in the pool/ ? > > > > According to the code in Lib.pm, it appears that unlike v3, there is > > no underscore -- it's just an (unsigned) long added to the end of the > > 16 byte digest. > > > > > > > > (as in the example from the docs mentions: > > > __TOPDIR__/pool/1/2/3/123456789abcdef0 > > > __TOPDIR__/pool/1/2/3/123456789abcdef0_0 > > > __TOPDIR__/pool/1/2/3/123456789abcdef0_1 > > > ) > > > > That is for v3 as indicated by the 3-layer pool. > > > > > > > > I am using compression (I only have cpool/ dir) and I am asking because > > > on both servers running: > > > find cpool/ -name "*_0" -print > > > find cpool/ -name "*_*" -print > > > > > > brings zero results. > > > > Try: > > > > find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex > > ".*/[0-9a-f]\{32\}" ! -name "LOCK" ! -name "poolCnt"
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
Thank you for the extra useful "find" command. I see from your command: those extra collisions are not 32chars files but 35 due to the "ext" (extension), but without _ separator. Based on that, I did my investigation. Some findings below... A. First BackupPC server: This is a backup server running since Jan 2023 (replaced other bpc servers running for years). Status now: - Pool is 10311.16GiB comprising 50892587 files and 16512 directories (as of 2023-03-16 13:23), - Pool hashing gives 5 repeated files with longest chain 1, - Nightly cleanup removed 3100543 files of size 1206.32GiB (around 2023-03-16 13:23), From those 5 cpool files, I investigated two of them (using the nice wiki page here https://github.com/backuppc/backuppc/wiki/How-to-find-which-backups-reference-a-particular-pool-file) Each file of those two under investigation, was referenced in different hosts and different individual backups. For both cpool files investigated, the actual file on disk was *exactly* the same file. So we talk about a file in different hosts, in different paths, different dates, but having the same filename and same content. So practically I understand that we have deduplications that did not happen for some reason. B. Second BackupPC server: This is a brand new backup server, running only for 3 days until today, not even a week youg (also replaced other bpc servers running for years). Therefore we only have a few backups there until today, more will add up as days go by. Status now: - Pool is 2207.69GiB comprising 26161662 files and 12384 directories (as of 2023-03-16 13:11), - Pool hashing gives 275 repeated files with longest chain 1, - Nightly cleanup removed 0 files of size 0.00GiB (around 2023-03-16 13:11), What I see here is a different situation... 1. Count of files found by "find" is different that the collisions reported in status page (find counts 231, bpc status page counts 275) 2. Many pool files among the list from "find", are not referenced at all in poolCnt files (give a count of 0 in their "poolCnt") 3. Even worse, many cpool/*/poolCnt files are completely missing from the cpool/. For example I have a colission file: cpool/ce/b0/ceb1e6764a28b208d51a7801052118d701 but file: cpool/ce/poolCnt does not even exist. Maybe refCount is not completed yet (because server is still new)? One guess is that this might happen because I have: BackupPCNightlyPeriod: 4 PoolSizeNightlyUpdatePeriod: 16 PoolNightlyDigestCheckPercent: 1 and the server is still new. I will wait for a few days. (BTW, storage is local and everything seems error free from day-1 on the system) If such inconsistencies persist, I guess I will have to investigate "BackupPC_fsck" for this server. Thank you. On 16/3/2023 8:31 μ.μ., backu...@kosowsky.org wrote: Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023: > In my v4 pool are collisions still generating _0, _1, _2 etc filenames > in the pool/ ? According to the code in Lib.pm, it appears that unlike v3, there is no underscore -- it's just an (unsigned) long added to the end of the 16 byte digest. > > (as in the example from the docs mentions: > __TOPDIR__/pool/1/2/3/123456789abcdef0 > __TOPDIR__/pool/1/2/3/123456789abcdef0_0 > __TOPDIR__/pool/1/2/3/123456789abcdef0_1 > ) That is for v3 as indicated by the 3-layer pool. > > I am using compression (I only have cpool/ dir) and I am asking because > on both servers running: > find cpool/ -name "*_0" -print > find cpool/ -name "*_*" -print > > brings zero results. Try: find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex ".*/[0-9a-f]\{32\}" ! -name "LOCK" ! -name "poolCnt" > > > Thank you. > > > On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote: > > Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023: > > > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote: > > > > > > > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023: > > > > > There is no reason to be concerned. This is normal. > > > > > > > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have an > > > > md5sum collision -- as in 1.47*10^-29 > > > > > > Why are you assuming this is "randomly" happening? Any time an identical file exists in more than one place on the client filesystem, there will be a collision. This is common in lots of cases. Desktop environments frequently have duplicated files scattered around. I used BackupPC for website backups; my chain length was approximately equal to the number of WordPress sites I was hosting. > > > > You are simply not understanding how file de-duplication and pool > > chains work in v4. > > > > Identical files contribute only a single chain instance -- no matter > > how many clients you are backing up and no matter how many backups you
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023: > In my v4 pool are collisions still generating _0, _1, _2 etc filenames > in the pool/ ? According to the code in Lib.pm, it appears that unlike v3, there is no underscore -- it's just an (unsigned) long added to the end of the 16 byte digest. > > (as in the example from the docs mentions: > __TOPDIR__/pool/1/2/3/123456789abcdef0 > __TOPDIR__/pool/1/2/3/123456789abcdef0_0 > __TOPDIR__/pool/1/2/3/123456789abcdef0_1 > ) That is for v3 as indicated by the 3-layer pool. > > I am using compression (I only have cpool/ dir) and I am asking because > on both servers running: > find cpool/ -name "*_0" -print > find cpool/ -name "*_*" -print > > brings zero results. Try: find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex ".*/[0-9a-f]\{32\}" ! -name "LOCK" ! -name "poolCnt" > > > Thank you. > > > On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote: > > Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023: > > > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote: > > > > > > > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, > > 2023: > > > > > There is no reason to be concerned. This is normal. > > > > > > > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have > > an > > > > md5sum collision -- as in 1.47*10^-29 > > > > > > Why are you assuming this is "randomly" happening? Any time an > > identical file exists in more than one place on the client filesystem, > > there will be a collision. This is common in lots of cases. Desktop > > environments frequently have duplicated files scattered around. I used > > BackupPC for website backups; my chain length was approximately equal to > > the number of WordPress sites I was hosting. > > > > You are simply not understanding how file de-duplication and pool > > chains work in v4. > > > > Identical files contribute only a single chain instance -- no matter > > how many clients you are backing up and no matter how many backups you > > save of each client. This is what de-duplication does. > > > > The fact that they appear on different clients and/or in different > > parts of the filesystem is reflected in the attrib files in the pc > > subdirectories for each client. This is where the metadata is stored. > > > > Chain lengths have to do with pool storage of the file contents > > (ignoring metadata). Lengths greater than 1 only occur if you have > > md5sum hash collisions -- i.e., two files (no matter on what client or > > where in the filesystem) with non-identical contents but the same > > md5sum hash. > > > > Such collisions are statistically exceedingly unlikely to occur on > > normal data where you haven't worked hard to create such collisions. > > > > For example, on my backup server: > >Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 > > directories (as of 2023-03-16 01:11), > >Pool hashing gives 0+0 repeated files with longest chain 0+0, > > > > I strongly suggest you read the documentation on BackupPC before > > making wildly erroneous assumptions about chains. You can also look at > > the code in BackupPC_refCountUpdate which defines how $fileCntRep and > > $fileCntRepMax are calculated. > > > > Also, if what you said were true, the OP would have multiple chains - > > presumably one for each distinct file that is "scattered around" > > > > If you are using v4.x and have pool hashing with such collisions, it > > would be great to see them. I suspect you are either using v3 or you > > are using v4 with a legacy v3 pool > > > > > > You would have to work hard to artificially create such collisions. > > > > > > $ echo 'hello world' > ~/file_a > > > $ cp ~/file_a ~/file_b > > > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo > > "MATCH" > > > > > > __ > > > Rob Sheldon > > > Contract software developer, devops, security, technical lead > > > > > > > > > ___ > > > BackupPC-users mailing list > > > BackupPC-users@lists.sourceforge.net > > > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > > > Wiki:https://github.com/backuppc/backuppc/wiki > > > Project: https://backuppc.github.io/backuppc/ > > > > > > ___ > > BackupPC-users mailing list > > BackupPC-users@lists.sourceforge.net > > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > > Wiki:https://github.com/backuppc/backuppc/wiki > > Project: https://backuppc.github.io/backuppc/ > > > ___ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List:https://lists.sourceforge.net/lists/listinfo/backuppc-
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
Rob Sheldon wrote at about 09:56:51 -0700 on Thursday, March 16, 2023: > The mailing list tyrant's reply however made me realize I don't actually > need to be chasing this down just right now, so I won't be following up on > this further. > I was dormant on this list for quite a few years -- which probably > contributed to my misinformation -- and I suddenly have remembered why. :-) Whatever. - I help people but have high expectations that if I am going to invest time in helping people with answers then I in turn expect those seeking help to invest time and effort in troubleshooting the problem in advance and formulating a specific question. This is a sysadmin-level tool and I don't get paid to coddle users. - You have no standards for how questions are formulated and then proceed repeatedly to give diametrically wrong advice. Garbage in, garbage out. Similarly, - I spent significant time today reviewing the code and verifying my understanding of BackupPC before answering - You seemingly spent zero time researching the problem and just cited your vague understanding of how BackupPC works based on information that you admit is from "quite a few years" ago Who is a more helpful contributor to the list? Please put on some big boy pants... > > __ > Rob Sheldon > Contract software developer, devops, security, technical lead > > > ___ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki:https://github.com/backuppc/backuppc/wiki > Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
In my v4 pool are collisions still generating _0, _1, _2 etc filenames in the pool/ ? (as in the example from the docs mentions: __TOPDIR__/pool/1/2/3/123456789abcdef0 __TOPDIR__/pool/1/2/3/123456789abcdef0_0 __TOPDIR__/pool/1/2/3/123456789abcdef0_1 ) I am using compression (I only have cpool/ dir) and I am asking because on both servers running: find cpool/ -name "*_0" -print find cpool/ -name "*_*" -print brings zero results. Thank you. On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote: Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023: > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote: > > > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023: > > > There is no reason to be concerned. This is normal. > > > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have an > > md5sum collision -- as in 1.47*10^-29 > > Why are you assuming this is "randomly" happening? Any time an identical file exists in more than one place on the client filesystem, there will be a collision. This is common in lots of cases. Desktop environments frequently have duplicated files scattered around. I used BackupPC for website backups; my chain length was approximately equal to the number of WordPress sites I was hosting. You are simply not understanding how file de-duplication and pool chains work in v4. Identical files contribute only a single chain instance -- no matter how many clients you are backing up and no matter how many backups you save of each client. This is what de-duplication does. The fact that they appear on different clients and/or in different parts of the filesystem is reflected in the attrib files in the pc subdirectories for each client. This is where the metadata is stored. Chain lengths have to do with pool storage of the file contents (ignoring metadata). Lengths greater than 1 only occur if you have md5sum hash collisions -- i.e., two files (no matter on what client or where in the filesystem) with non-identical contents but the same md5sum hash. Such collisions are statistically exceedingly unlikely to occur on normal data where you haven't worked hard to create such collisions. For example, on my backup server: Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 directories (as of 2023-03-16 01:11), Pool hashing gives 0+0 repeated files with longest chain 0+0, I strongly suggest you read the documentation on BackupPC before making wildly erroneous assumptions about chains. You can also look at the code in BackupPC_refCountUpdate which defines how $fileCntRep and $fileCntRepMax are calculated. Also, if what you said were true, the OP would have multiple chains - presumably one for each distinct file that is "scattered around" If you are using v4.x and have pool hashing with such collisions, it would be great to see them. I suspect you are either using v3 or you are using v4 with a legacy v3 pool > > You would have to work hard to artificially create such collisions. > > $ echo 'hello world' > ~/file_a > $ cp ~/file_a ~/file_b > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo "MATCH" > > __ > Rob Sheldon > Contract software developer, devops, security, technical lead > > > ___ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki:https://github.com/backuppc/backuppc/wiki > Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
On Thu, Mar 16, 2023, at 8:59 AM, Les Mikesell wrote: > On Thu, Mar 16, 2023 at 10:53 AM Rob Sheldon wrote: > > > > Why are you assuming this is "randomly" happening? Any time an identical > > file exists in more than one place on the client filesystem, there will be > > a collision. This is common in lots of cases. Desktop environments > > frequently have duplicated files scattered around. I used BackupPC for > > website backups; my chain length was approximately equal to the number of > > WordPress sites I was hosting. > > > Identical files are not collisions to backuppc - they are de-duplicated. Hey Les, Thanks for the heads-up. Your message prompted me to pull the current codebase and do some grepping around in it to better understand this. I believed that the MD5 hash was the mechanism used for deduplication, but that could have been wrong at any point, V3 or otherwise. The mailing list tyrant's reply however made me realize I don't actually need to be chasing this down just right now, so I won't be following up on this further. I was dormant on this list for quite a few years -- which probably contributed to my misinformation -- and I suddenly have remembered why. :-) __ Rob Sheldon Contract software developer, devops, security, technical lead ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023: > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote: > > > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023: > > > There is no reason to be concerned. This is normal. > > > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have an > > md5sum collision -- as in 1.47*10^-29 > > Why are you assuming this is "randomly" happening? Any time an identical > file exists in more than one place on the client filesystem, there will be a > collision. This is common in lots of cases. Desktop environments frequently > have duplicated files scattered around. I used BackupPC for website backups; > my chain length was approximately equal to the number of WordPress sites I > was hosting. You are simply not understanding how file de-duplication and pool chains work in v4. Identical files contribute only a single chain instance -- no matter how many clients you are backing up and no matter how many backups you save of each client. This is what de-duplication does. The fact that they appear on different clients and/or in different parts of the filesystem is reflected in the attrib files in the pc subdirectories for each client. This is where the metadata is stored. Chain lengths have to do with pool storage of the file contents (ignoring metadata). Lengths greater than 1 only occur if you have md5sum hash collisions -- i.e., two files (no matter on what client or where in the filesystem) with non-identical contents but the same md5sum hash. Such collisions are statistically exceedingly unlikely to occur on normal data where you haven't worked hard to create such collisions. For example, on my backup server: Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 directories (as of 2023-03-16 01:11), Pool hashing gives 0+0 repeated files with longest chain 0+0, I strongly suggest you read the documentation on BackupPC before making wildly erroneous assumptions about chains. You can also look at the code in BackupPC_refCountUpdate which defines how $fileCntRep and $fileCntRepMax are calculated. Also, if what you said were true, the OP would have multiple chains - presumably one for each distinct file that is "scattered around" If you are using v4.x and have pool hashing with such collisions, it would be great to see them. I suspect you are either using v3 or you are using v4 with a legacy v3 pool > > You would have to work hard to artificially create such collisions. > > $ echo 'hello world' > ~/file_a > $ cp ~/file_a ~/file_b > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo "MATCH" > > __ > Rob Sheldon > Contract software developer, devops, security, technical lead > > > ___ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki:https://github.com/backuppc/backuppc/wiki > Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
On Thu, Mar 16, 2023 at 10:53 AM Rob Sheldon wrote: > > Why are you assuming this is "randomly" happening? Any time an identical file > exists in more than one place on the client filesystem, there will be a > collision. This is common in lots of cases. Desktop environments frequently > have duplicated files scattered around. I used BackupPC for website backups; > my chain length was approximately equal to the number of WordPress sites I > was hosting. > Identical files are not collisions to backuppc - they are de-duplicated. -- Les Mikesell lesmikes...@gmail.com ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote: > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023: > > There is no reason to be concerned. This is normal. > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have an > md5sum collision -- as in 1.47*10^-29 Why are you assuming this is "randomly" happening? Any time an identical file exists in more than one place on the client filesystem, there will be a collision. This is common in lots of cases. Desktop environments frequently have duplicated files scattered around. I used BackupPC for website backups; my chain length was approximately equal to the number of WordPress sites I was hosting. > You would have to work hard to artificially create such collisions. $ echo 'hello world' > ~/file_a $ cp ~/file_a ~/file_b $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo "MATCH" __ Rob Sheldon Contract software developer, devops, security, technical lead ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023: > There is no reason to be concerned. This is normal. Not really! The OP claimed he is using v4 which uses full file md5sum checksums as the file name hash. It *should* be extremely, once-in-a-blue-moon, rare to randomly have an md5sum collision -- as in 1.47*10^-29 Randomly having 165 hash collisions is several orders of magnitude more unlikely. You would have to work hard to artificially create such collisions. > Searching for "backuppc pool hashing chain" finds > https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. > The documentation could perhaps be more specific about this, but BackupPC > "chains" hash collisions, so no data is lost or damaged. This is a pretty > standard compsci data structure. If you search the BackupPC repository for > the message you're seeing, you find > https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311, > which gives you a couple of variable names that you can use to search the > codebase and satisfy your curiosity. > This OLD (ca 2008) reference is totally IRRELEVANT for v4.x. Hash collisions on v3.x were EXTREMELY common and it was not unusual to have even long chains of collisions since the md5sum was computed using only the first and last 128KB (I think) of the file plus the file length. So any long file whose middle section changed by even just one byte would create a collision chain. > "165 repeated files with longest chain 1" just means that there are 165 > different files that had a digest (hash) that matched another file. This can > happen for a number of different reasons. A common one is that there are > identical files on the backuppc client system that have different paths. > > I've been running BackupPC for a *long* time and have always had chained > hashes in my server status. It's an informational message, not an error. > Are you running v3 or v4? You are right though it's technically not an error even in v4 in that md5sum collisions do of course exist -- just that they are extremely rare. It would REALLY REALLY REALLY help if the OP would do some WORK to help HIMSELF diagnose the problem. Since it seems like only a single md5sum chain is involved, it would seem blindingly obvious that the first step in troubleshooting would be to determine what is the nature of these files with the same md5sum hash. Specifically, - Do they indeed all have identical md5sum? - Are they indeed distinct files (despite having the same md5sum)? - How if at all are these files special? > On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote: > > The message appears in "Server Information" section, inside server > > status, in the web interface of BackupPC (page "BackupPC Server Status"). > > > > Both servers are running BackupPC version 4.4.0. > > 1st server says: "Pool hashing gives 165 repeated files with longest > > chain 1" > > 2nd server says: "Pool hashing gives 5 repeated files with longest chain > > 1" > > > > I could not find more info/details about this message in the documentation. > > Only a comment in github is mentioning this message, and to my > > understanding it is related to hash collisions in the pool. > > > > Since hash collisions might be worrying (in theory), my question is if > > this is alarming and in that case if there is something that we can do > > about it (for example actions to zero out any collisions if needed). > > > > > > Thank you very much. > > __ > Rob Sheldon > Contract software developer, devops, security, technical lead > > > ___ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki:https://github.com/backuppc/backuppc/wiki > Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
All is clear now. Thank you so much for your detailed reply. On 16/3/2023 8:54 π.μ., Rob Sheldon wrote: There is no reason to be concerned. This is normal. Searching for "backuppc pool hashing chain" finds https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. The documentation could perhaps be more specific about this, but BackupPC "chains" hash collisions, so no data is lost or damaged. This is a pretty standard compsci data structure. If you search the BackupPC repository for the message you're seeing, you find https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311, which gives you a couple of variable names that you can use to search the codebase and satisfy your curiosity. "165 repeated files with longest chain 1" just means that there are 165 different files that had a digest (hash) that matched another file. This can happen for a number of different reasons. A common one is that there are identical files on the backuppc client system that have different paths. I've been running BackupPC for a *long* time and have always had chained hashes in my server status. It's an informational message, not an error. On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote: The message appears in "Server Information" section, inside server status, in the web interface of BackupPC (page "BackupPC Server Status"). Both servers are running BackupPC version 4.4.0. 1st server says: "Pool hashing gives 165 repeated files with longest chain 1" 2nd server says: "Pool hashing gives 5 repeated files with longest chain 1" I could not find more info/details about this message in the documentation. Only a comment in github is mentioning this message, and to my understanding it is related to hash collisions in the pool. Since hash collisions might be worrying (in theory), my question is if this is alarming and in that case if there is something that we can do about it (for example actions to zero out any collisions if needed). Thank you very much. __ Rob Sheldon Contract software developer, devops, security, technical lead ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
There is no reason to be concerned. This is normal. Searching for "backuppc pool hashing chain" finds https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. The documentation could perhaps be more specific about this, but BackupPC "chains" hash collisions, so no data is lost or damaged. This is a pretty standard compsci data structure. If you search the BackupPC repository for the message you're seeing, you find https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311, which gives you a couple of variable names that you can use to search the codebase and satisfy your curiosity. "165 repeated files with longest chain 1" just means that there are 165 different files that had a digest (hash) that matched another file. This can happen for a number of different reasons. A common one is that there are identical files on the backuppc client system that have different paths. I've been running BackupPC for a *long* time and have always had chained hashes in my server status. It's an informational message, not an error. On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote: > The message appears in "Server Information" section, inside server > status, in the web interface of BackupPC (page "BackupPC Server Status"). > > Both servers are running BackupPC version 4.4.0. > 1st server says: "Pool hashing gives 165 repeated files with longest > chain 1" > 2nd server says: "Pool hashing gives 5 repeated files with longest chain 1" > > I could not find more info/details about this message in the documentation. > Only a comment in github is mentioning this message, and to my > understanding it is related to hash collisions in the pool. > > Since hash collisions might be worrying (in theory), my question is if > this is alarming and in that case if there is something that we can do > about it (for example actions to zero out any collisions if needed). > > > Thank you very much. __ Rob Sheldon Contract software developer, devops, security, technical lead ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
That's the sort of attitude that puts genuine people off asking for help on mailing lists. Rather than posting a cutting reply, why not simply ask for a little more information if that's what's needed? Please try to remember that English is not everyone's first language and at the end of the day we are all human and have feelings... On 14/03/2023 19:17, backu...@kosowsky.org wrote: Can you please invest some minimal effort in asking a specific question and supplying relevant data? Not just a random line pulled presumably from a log file or web page and saying "I guess it's a problem". Giannis Economou wrote at about 15:59:27 +0200 on Tuesday, March 14, 2023: > Hello, > > running 2 backuppc servers and: > > * on 1st one we get: > Pool hashing gives 165 repeated files with longest chain 1 > > * on 2nd one we get: > Pool hashing gives 5 repeated files with longest chain 1 > > > I guess this might be a problem? If yes, what can we do about this? > > > > Thank you. > > > > > > ___ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki:https://github.com/backuppc/backuppc/wiki > Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
The message appears in "Server Information" section, inside server status, in the web interface of BackupPC (page "BackupPC Server Status"). Both servers are running BackupPC version 4.4.0. 1st server says: "Pool hashing gives 165 repeated files with longest chain 1" 2nd server says: "Pool hashing gives 5 repeated files with longest chain 1" I could not find more info/details about this message in the documentation. Only a comment in github is mentioning this message, and to my understanding it is related to hash collisions in the pool. Since hash collisions might be worrying (in theory), my question is if this is alarming and in that case if there is something that we can do about it (for example actions to zero out any collisions if needed). Thank you very much. On 14/3/2023 9:17 μ.μ., backu...@kosowsky.org wrote: Can you please invest some minimal effort in asking a specific question and supplying relevant data? Not just a random line pulled presumably from a log file or web page and saying "I guess it's a problem". Giannis Economou wrote at about 15:59:27 +0200 on Tuesday, March 14, 2023: > Hello, > > running 2 backuppc servers and: > > * on 1st one we get: > Pool hashing gives 165 repeated files with longest chain 1 > > * on 2nd one we get: > Pool hashing gives 5 repeated files with longest chain 1 > > > I guess this might be a problem? If yes, what can we do about this? > > > > Thank you. > > > > > > ___ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki:https://github.com/backuppc/backuppc/wiki > Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/
Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1
Can you please invest some minimal effort in asking a specific question and supplying relevant data? Not just a random line pulled presumably from a log file or web page and saying "I guess it's a problem". Giannis Economou wrote at about 15:59:27 +0200 on Tuesday, March 14, 2023: > Hello, > > running 2 backuppc servers and: > > * on 1st one we get: > Pool hashing gives 165 repeated files with longest chain 1 > > * on 2nd one we get: > Pool hashing gives 5 repeated files with longest chain 1 > > > I guess this might be a problem? If yes, what can we do about this? > > > > Thank you. > > > > > > ___ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki:https://github.com/backuppc/backuppc/wiki > Project: https://backuppc.github.io/backuppc/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/