Re: Client-side deduplication during extraction

2017-11-20 Thread James Cass
+1 for me.  This sounds like a good idea.
That's my 2 satoshis.  :-)

On Sun, Nov 19, 2017 at 8:03 PM, Colin Percival 
wrote:

> On 11/19/17 12:37, Robie Basak wrote:
> > On Sat, Apr 08, 2017 at 07:52:54PM -0700, Colin Percival wrote:
> >> On 04/04/17 13:06, Robie Basak wrote:
> >>> Since the redundancy is there and my client has all the details,
> >>> is there any way I can take advantage of this?
> >>
> >> Not right now.  This is something I've been thinking about implementing,
> >> but it's rather complicated (the tarsnap "read" path would need to look
> at
> >> data on disk to see what it can "reuse", and normally it doesn't read
> any
> >> files from disk).
> >
> > In case it helps others, I hacked together a client-side cache for this
> > one task. It appears to have worked. Patch below.
>
> Ah yes, I was thinking in terms of "notice that we're extracting the file
> 'foo' and there is already a file 'foo', then read that file in and split
> it into blocks in case any can be reused" -- the case you've covered here
> of keeping a cache of downloaded blocks is much simpler (but only covers
> the "multiple downloads of the same data" case, not the more general case
> of "synchronizing" a system with an archive).
>
> > This is absolutely a hack and not production ready (no concurrency, bad
> > error handling, hardcoded cache path whose directory must be created in
> > advance and permissions set manually, etc), but for a one-off task it
> > was enough for me to get my data out.
> > [snip patch]
>
> Yes, this patch definitely looks like it does what you want.  I'd consider
> including it (well, with details tidied up) but I'm not sure if anyone else
> would want to use this functionality... anyone else on the list interested?
>
> --
> Colin Percival
> Security Officer Emeritus, FreeBSD | The power to serve
> Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
>


Re: Client-side deduplication during extraction

2017-11-19 Thread Colin Percival
On 11/19/17 12:37, Robie Basak wrote:
> On Sat, Apr 08, 2017 at 07:52:54PM -0700, Colin Percival wrote:
>> On 04/04/17 13:06, Robie Basak wrote:
>>> Since the redundancy is there and my client has all the details,
>>> is there any way I can take advantage of this?
>>
>> Not right now.  This is something I've been thinking about implementing,
>> but it's rather complicated (the tarsnap "read" path would need to look at
>> data on disk to see what it can "reuse", and normally it doesn't read any
>> files from disk).
> 
> In case it helps others, I hacked together a client-side cache for this
> one task. It appears to have worked. Patch below.

Ah yes, I was thinking in terms of "notice that we're extracting the file
'foo' and there is already a file 'foo', then read that file in and split
it into blocks in case any can be reused" -- the case you've covered here
of keeping a cache of downloaded blocks is much simpler (but only covers
the "multiple downloads of the same data" case, not the more general case
of "synchronizing" a system with an archive).

> This is absolutely a hack and not production ready (no concurrency, bad
> error handling, hardcoded cache path whose directory must be created in
> advance and permissions set manually, etc), but for a one-off task it
> was enough for me to get my data out.
> [snip patch]

Yes, this patch definitely looks like it does what you want.  I'd consider
including it (well, with details tidied up) but I'm not sure if anyone else
would want to use this functionality... anyone else on the list interested?

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid


Re: Client-side deduplication during extraction

2017-11-19 Thread Robie Basak
On Sat, Apr 08, 2017 at 07:52:54PM -0700, Colin Percival wrote:
> On 04/04/17 13:06, Robie Basak wrote:
> > I'd like to retrieve and permanently archive (offline) a full set of
> > archives stored with one particular key using Tarsnap.
> > 
> > These are of course deduplicated at Tarsnap's end. But if I download
> > them one at at time (using something like "tarsnap --list-archives|xargs
> > tarsnap -r ..." for example), it'll cost me a ton of bandwidth - both at
> > my end which is metered, and in Tarsnap's bandwidth charges.
> > 
> > I'd like my bandwith bill to be the "Compressed size/(unique data)"
> > figure from --print-stats, not the "Compressed size/All archives"
> > figure. Since the redundancy is there and my client has all the details,
> > is there any way I can take advantage of this?
> 
> Not right now.  This is something I've been thinking about implementing,
> but it's rather complicated (the tarsnap "read" path would need to look at
> data on disk to see what it can "reuse", and normally it doesn't read any
> files from disk).

In case it helps others, I hacked together a client-side cache for this
one task. It appears to have worked. Patch below.

This is absolutely a hack and not production ready (no concurrency, bad
error handling, hardcoded cache path whose directory must be created in
advance and permissions set manually, etc), but for a one-off task it
was enough for me to get my data out.

---
 tar/storage/storage_read.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/tar/storage/storage_read.c b/tar/storage/storage_read.c
index 2c19650..62bf6b7 100644
--- a/tar/storage/storage_read.c
+++ b/tar/storage/storage_read.c
@@ -13,6 +13,7 @@
 #include "storage_internal.h"
 #include "sysendian.h"
 #include "warnp.h"
+#include "hexify.h"
 
 #include "storage.h"
 
@@ -313,6 +314,20 @@ storage_read_file(STORAGE_R * S, uint8_t * buf, size_t 
buflen,
}
}
 
+   int old_errno = errno;
+   char hashbuf[65];
+   hexify(name, hashbuf, 32);
+   char *cache_path;
+   if(asprintf(_path, "/tmp/tarsnap-cache/%c-%s", class, hashbuf) < 
0) abort();
+   FILE *fp = fopen(cache_path, "r");
+   if (fp) {
+   if (fread(buf, buflen, 1, fp) != 1) abort();
+   if (fclose(fp)) abort();
+   free(cache_path);
+   return 0;
+   } else {
+   errno = old_errno;
+   }
/* Initialize structure. */
C.buf = buf;
C.buflen = buflen;
@@ -326,6 +341,13 @@ storage_read_file(STORAGE_R * S, uint8_t * buf, size_t 
buflen,
goto err0;
 
 done:
+   if (!C.status) {
+   FILE *fp = fopen(cache_path, "w");
+   if (!fp) abort();
+   if(fwrite(buf, buflen, 1, fp) != 1) abort();
+   if(fclose(fp)) abort();
+   }
+   free(cache_path);
/* Return status code from server. */
return (C.status);
 
-- 
2.7.4


Re: Client-side deduplication during extraction

2017-04-08 Thread Colin Percival
Hi Robie,

On 04/04/17 13:06, Robie Basak wrote:
> I'd like to retrieve and permanently archive (offline) a full set of
> archives stored with one particular key using Tarsnap.
> 
> These are of course deduplicated at Tarsnap's end. But if I download
> them one at at time (using something like "tarsnap --list-archives|xargs
> tarsnap -r ..." for example), it'll cost me a ton of bandwidth - both at
> my end which is metered, and in Tarsnap's bandwidth charges.
> 
> I'd like my bandwith bill to be the "Compressed size/(unique data)"
> figure from --print-stats, not the "Compressed size/All archives"
> figure. Since the redundancy is there and my client has all the details,
> is there any way I can take advantage of this?

Not right now.  This is something I've been thinking about implementing,
but it's rather complicated (the tarsnap "read" path would need to look at
data on disk to see what it can "reuse", and normally it doesn't read any
files from disk).

> If not, then I am planning to use an us-east-1 EC2 instance so that at
> least the Tarsnap server<->client bandwidth is in one place. I can then
> use that machine to deduplicate and then the download to my machine here
> can at least be efficient. In this case, will I still end up being
> billed by Tarsnap for the "Compressed size/All archives" figure?

If you extract all of the archives, yes.

How are you planning on storing your data after you extract all of the
archives?  Something like ZFS which provides filesystem level deduplication?

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid