[Rpm-ecosystem] Proposed zchunk file format - V4

2018-04-16 Thread Jonathan Dieter
Here's version four with a swap from fixed-length integers to variable-
length compressed integers which allow us to skip compression of the
index (since the non-integer data is all uncompressable checksums). 
I've also added the uncompressed size of each chunk to the index to
make it easier to figure out how much space to allocate for the
uncompressed chunk.

+-+-+-+-+-++=++
|   ID| Checksum type (ci) | Header checksum | Compression type (ci ) |
+-+-+-+-+-++=++

+=+===+=+
| Index size (ci) | Index | Compressed Dict |
+=+===+=+

+===+===+
|   Chunk   |   Chunk   | ==> More chunks
+===+===+

(ci)
 Compressed (unsigned) integer - An variable length little endian
 integer where the first seven bits of the number are stored in the
 first byte, followed by the next seven bits in the next byte, and so
 on.  The top bit of all bytes except the final byte must be zero, and
 the top bit of the final byte must be one, indicating the end of the
 number.

ID
 '\0ZCK1', identifies file as zchunk version 1 file

Checksum type
 This is an 8-bit unsigned integer containing the type of checksum
 used to generate the header checksum and the total data checksum, but
 *not* the chunk checksums.

 Current values:
   0 = SHA-1
   1 = SHA-256

Header checksum
 This is the checksum of everything from the beginning of the file
 until the end of the index when the header checksum is all \0's.

Compression type
 This is an integer containing the type of compression used to
 compress dict and chunks.

 Current values:
   0 - Uncompressed
   2 - zstd

Index size
 This is an integer containing the size of the index.

Index
 This is the index, which is described in the next section.

Compressed Dict (optional)
 This is a custom dictionary used when compressing each chunk.
 Because each chunk is compressed completely separately from the
 others, the custom dictionary gives us much better overall
 compression.  The custom dictionary is compressed without a custom
 dictionary (for obvious reasons).

Chunk
 This is a chunk of data, compressed with the custom dictionary
 provided above.


The index:

+==+==+===+
| Chunk checksum type (ci) | Chunk count (ci) | Data checksum |
+==+==+===+

+===+==+===+
| Dict checksum | Dict length (ci) | Uncompressed dict length (ci) |
+===+==+===+

++===+==+
| Chunk checksum | Chunk length (ci) | Uncompressed length (ci) | ...
++===+==+

Chunk checksum type
 This is an integer containing the type of checksum used to generate
 the chunk checksums.

 Current values:
   0 = SHA-1
   1 = SHA-256

Chunk count
 This is a count of the number of chunks in the zchunk file.

Checksum of all data
 This is the checksum of everything after the index, including the
 compressed dict and all the compressed chunks.  This checksum is
 generated using the overall checksum type, *not* the chunk checksum
 type.

Dict checksum
 This is the checksum of the compressed dict, used to detect whether
 two dicts are identical.  If there is no dict, the checksum must be
 all zeros.

Dict length
 This is an integer containing the length of the dict.  If there is no
 dict, this must be a zero.

Uncompressed dict length
 This is an integer containing the length of the dict after it has
 been decompressed.  If there is no dict, this must be a zero.

Chunk checksum
 This is the checksum of the compressed chunk, used to detect whether
 any two chunks are identical.

Chunk length
 This is an integer containing the length of the chunk.

Uncompressed dict length
 This is an integer containing the length of the chunk after it has
 been decompressed.

The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format - V3

2018-03-12 Thread Jonathan Dieter
On Mon, 2018-03-12 at 15:42 +0100, Michal Domonkos wrote:
> Hi Jonathan,
> 
> To me, the zchunk idea looks good.
> 
> Incidentally, for the last couple of months, I have been trying to
> rethink the way we cache metadata on the clients, as part of the
> libdnf (re)design efforts. My goal was to de-duplicate the data
> between similar repos in the cache as well as decrease the size that
> needs to be downloaded every time (inevitably leading to this topic).
> 
> I came up with two different strategies:
> 
> 1) Chunking

> That made me think that either using git (libgit2) directly or doing a
> small, lightweight implementation of the core concepts might be the
> way to go. I even played with the latter a bit (I didn't get to
> breaking down primary.xml, though):
> https://github.com/dmnks/rhs-proto
> 
> In the context of this thread, this is basically what you do with
> zchunk (just much better) :)

Yes, I guess. ;)  The git concept sounds interesting, but it will be a
lot of work and will require some huge changes in how we deal with
metadata.  For the moment, I think I'll focus on more evolutionary
changes.

> 2) Deltas
> 
> Later, during this year's devconf, I had a few "brainstorming"
> sessions with Florian Festi who pointed out that the differences in
> metadata updates might often be on the sub-package level (e.g. NEVRA
> in the version tag) so chunking on the package boundaries might not
> give us the best results possible. Instead perhaps, we could generate
> deltas on the binary level.

My current tests were chunked on srpm boundaries, not package
boundaries (not sure if we're using the same terminology here).  The
problem with chunking on the package boundary in the xml is that it
creates many more chunks than grouping by srpm, and the smaller chunks
hurt compression (even with a dictionary) and the larger number
increase the size of the index.  In Fedora, I think it's impossible to
only change the metadata for one sub-package belonging to an srpm, and,
even if it *is* possible, it's very rare.

> An alternative would be to pre-generate (compressed) binary deltas for
> the last N versions and let clients download an index file that will
> tell them what deltas they're missing and should download. This is
> basically what debian's pdiff format does. One downside to this
> approach is that it doesn't give us the de-duplication on clients
> consuming multiple repos with similar content (probably quite common
> with RHEL subscriptions at least).

I'm not a huge fan of pre-generated binary deltas.  They might give
smaller deltas than a chunking solution, but at the cost of generating
and maintaining the deltas.  It's been enough of a pain for rpms that
(at least on an individual basis) change infrequently.  For metadata
that changes every day, I think it would be too much.

The beauty of the zchunk format (or zsync, or any other chunked format)
is that we don't have to download different files based on what we
have, but rather, we download either fewer or more parts of the same
file based on what we have.  From the server side, we don't have to
worry about the deltas, and the clients just get what they need.

Having said that, the current zchunk proposal doesn't really address
deduplication from multiple repos with similar content.  I suppose some
way of extending zchunk to allow you to specify multiple local sources
would be a way fixing that, but I'm still working on getting it to
build from *one* local source (getting very close, but not quite
there).

> Then I stumbled upon casync which combines the benefits of both
> strategies; it chunks based on the shape of the data (arguably giving
> better results than chunking on the package boundaries), and it
> doesn't require a smart protocol. However, it involves a lot of HTTP
> requests as you already mentioned.

Just to be clear, zchunk is casync with smart boundaries (or, better
put, application-specified boundaries) and a single file backend so
requests can be sent as http range requests rather than hundreds of
individual http requests.
 
> Despite that, I'm still leaning towards chunking as being the better
> solution of the two. The question is, how much granularity we want.
> You made a good point: the repodata format is fixed (be it xml or
> solv), so we might as well take advantage of it to detect boundaries
> for chunking, rather than using a rolling hash (but I have no data to
> back it up). I'm not sure how to approach the many-GET-requests (or
> the lack of range support) problem, though.

If you've been following the conversation between myself and Michael
Schroeder on rpm-ecosystem (starting with http://lists.rpm.org/pipermai
l/rpm-ecosystem/2018-March/000553.html), I did some comparisons between
zsync (which uses a rolling hash) and zchunk.  There's also the fact
that zsync uses gzip compression, while zchunk uses zstd by default,
but quite a bit of the difference is due to the larger window size a
rolling hash provides.

With zchunk, if

Re: [Rpm-ecosystem] Proposed zchunk file format

2018-03-03 Thread Jonathan Dieter
On Fri, 2018-03-02 at 12:44 +, Michael Schroeder wrote:
> On Fri, Mar 02, 2018 at 02:33:09PM +0200, Jonathan Dieter wrote:
> > No, I didn't expect it to have much effect.  Since openSUSE's xml
> > file
> > are (presumably) ordered so new packages come last, do you have any
> > old
> > primary.xml files lying around that I can test?
> > 
> > If not, I'll grab them from the next few updates.
> 
> They are ordered for the update channels of Leap, but Tumbleweed
> is a rolling release distro and thus not ordered. (This also means
> that delte repo downloads currently don't work that well for
> Tumbleweed,
> so I'm eager to find something better).
> 
> How about using the Fedora metadata but reorder the entries with
> the buildtime as sort key?

That works.  Here are the numbers.  They are closer, but only by a few
percentage points, which surprised me.  Zsync does beat zchunk in a few
cases, but they're all when the delta is very small (< 50k).  Any time
the delta is larger than 100k, zchunk wins by a minimum of 20%.

Interestingly, zchunk's numbers also generally got better when the
metadata was sorted by build date, but I think that's because my
current "chunk by srpm" algorithm only puts two packages with the same
srpm in the same chunk if they're next to each other.  When sorted by
build date, they're guaranteed to be next to each other, while if
sorted by name, some packages might be far away from each other (i.e.
dbus and python3-dbus won't be next to each other if sorted by name).  

zsync - sorted by build date
1->2 - 1457710
2->3 - 1051405
3->4 - 489221
4->5 - 33851
5->6 - 41331
6->7 - 1607445
7->8 - 26625
1->4 - 2206614
3->6 - 544855
6->8 - 1612897

zchunk - sorted by build date - chunked by srpm
1->2 - 1108238 - 24% smaller
2->3 - 768845 - 27% smaller
3->4 - 340866 - 30% smaller
4->5 - 36576 - 8% larger
5->6 - 41412 - < 1% larger
6->7 - 1208562 - 25% smaller
7->8 - 12083 - 55% smaller
1->4 - 1714803 - 22% smaller
3->6 - 370844 - 32% smaller
6->8 - 1214039 - 25% smaller
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-03-02 Thread Michael Schroeder
On Fri, Mar 02, 2018 at 02:33:09PM +0200, Jonathan Dieter wrote:
> No, I didn't expect it to have much effect.  Since openSUSE's xml file
> are (presumably) ordered so new packages come last, do you have any old
> primary.xml files lying around that I can test?
> 
> If not, I'll grab them from the next few updates.

They are ordered for the update channels of Leap, but Tumbleweed
is a rolling release distro and thus not ordered. (This also means
that delte repo downloads currently don't work that well for Tumbleweed,
so I'm eager to find something better).

How about using the Fedora metadata but reorder the entries with
the buildtime as sort key?

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-03-02 Thread Jonathan Dieter
On Thu, 2018-03-01 at 10:12 +, Michael Schroeder wrote:
> On Wed, Feb 28, 2018 at 09:31:39AM +0200, Jonathan Dieter wrote:
> > Ok, here are some numbers comparing zsync and zchunk.  For testing, I
> > have eight f27-updates primary.xml files dating from Dec 7 until Feb
> > 12.  3-6 are on consecutive days in mid-January, while 7 and 8 are two
> > days apart in mid-February.
> 
> The gzip --rsyncable + zsync approach only makes sense when the repository
> metadata is ordered so that new packages come last. Did you do that in
> your tests?

No, I didn't expect it to have much effect.  Since openSUSE's xml file
are (presumably) ordered so new packages come last, do you have any old
primary.xml files lying around that I can test?

If not, I'll grab them from the next few updates.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-03-01 Thread Michael Schroeder
On Wed, Feb 28, 2018 at 09:31:39AM +0200, Jonathan Dieter wrote:
> On Fri, 2018-02-23 at 14:14 +, Michael Schroeder wrote:
> > This may be an unfair question, but how does it compare to the
> > 'gzip --rsyncable' + zsync approach that we (openSUSE) are
> > using since almost eight years? I guess it's better, but how much?
> 
> Ok, here are some numbers comparing zsync and zchunk.  For testing, I
> have eight f27-updates primary.xml files dating from Dec 7 until Feb
> 12.  3-6 are on consecutive days in mid-January, while 7 and 8 are two
> days apart in mid-February.

The gzip --rsyncable + zsync approach only makes sense when the repository
metadata is ordered so that new packages come last. Did you do that in
your tests?

Thanks,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format - V3

2018-02-28 Thread Jonathan Dieter
I've been working on a C implementation of this spec, and came up with
a few other changes.  I think it's important to have a checksum of the
index as well as the data as we want to be able to verify that the
index is as expected before trying to parse it.

I've also added in the ability to use a different hash type for the
chunk checksums versus the index checksum and overall checksum.  The
idea is that a weaker checksum may be chosen for the chunks to reduce
the size of the index without weakening overall verification.

+-+-+-+-+-+---++--+
|ID   | Checksum type | Index checksum | Compression type |
+-+-+-+-+-+---++--+

+-+-+-+-+-+-+-+-+==+=+
|  Index size   | Compressed Index | Compressed Dict |
+-+-+-+-+-+-+-+-+==+=+

+===+===+
|   Chunk   |   Chunk   | ==> More chunks
+===+===+

ID
 '\0ZCK1', identifies file as zchunk version 1 file

Checksum type
 This is an 8-bit unsigned integer containing the type of checksum
 used to generate the index checksum and the total data checksum, but
 *not* the chunk checksums

 Current values:
   0 = SHA-1
   1 = SHA-256

Index checksum
 This is the checksum of everything from this point until the end of
 the index.  It includes the compression type, the index size, and the
 compressed index.

Compression type
 This is an 8-bit unsigned integer containing the type of compression
 used to compress dict and chunks.

 Current values:
   0 - Uncompressed
   2 - zstd

Index size
 This is a 64-bit LE unsigned integer containing the size of compressed
 index.

Compressed Index
 This is the index, which is described in the next section.  The index
 is compressed without a custom dictionary.

Compressed Dict (optional)
 This is a custom dictionary used when compressing each chunk.
 Because each chunk is compressed completely separately from the
 others, the custom dictionary gives us much better overall
 compression.  The custom dictionary is compressed without a custom
 dictionary (for obvious reasons).

Chunk
 This is a chunk of data, compressed with the custom dictionary
 provided above.


The index:

+-+-+-+-+-+-+-+-+-+==+
| Chunk checksum type |  Chunk count  | Checksum of all data |
+-+-+-+-+-+-+-+-+-+==+

++-+-+-+-+-+-+-+-+
| Dict checksum  |  End of dict  |
++-+-+-+-+-+-+-+-+

++-+-+-+-+-+-+-+-+
| Chunk checksum | End of chunk  |  ==> More
++-+-+-+-+-+-+-+-+

Chunk checksum type
 This is an 8-bit unsigned integer containing the type of checksum used
 to generate the chunk checksums.

 Current values:
   0 = SHA-1
   1 = SHA-256

Chunk count
 This is a count of the number of chunks in the zchunk file.

Checksum of all data
 This is the checksum of everything after the index, including the
 compressed dict and all the compressed chunks.  This checksum is
 generated using the overall checksum type, *not* the chunk checksum
 type.

Dict checksum
 This is the checksum of the compressed dict, used to detect whether
 two dicts are identical.  If there is no dict, the checksum must be
 all zeros.

End of dict
 This is a 64-bit LE unsigned integer containing the location of the
 end of the dict starting from the end of the index.  This gives us the
 information we need to find and decompress the dict.  If there is no
 dict, this must be a zero.

Chunk checksum
 This is the checksum of the compressed chunk, used to detect whether
 any two chunks are identical.

End of chunk
 This is the location of the end of the chunk starting from the end of
 the index.  This gives us the information we need to find and
 decompress each chunk.


The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-27 Thread Jonathan Dieter
On Fri, 2018-02-23 at 14:14 +, Michael Schroeder wrote:
> This may be an unfair question, but how does it compare to the
> 'gzip --rsyncable' + zsync approach that we (openSUSE) are
> using since almost eight years? I guess it's better, but how much?

Ok, here are some numbers comparing zsync and zchunk.  For testing, I
have eight f27-updates primary.xml files dating from Dec 7 until Feb
12.  3-6 are on consecutive days in mid-January, while 7 and 8 are two
days apart in mid-February.  The numbers show the delta that would be
downloaded, not including the index/control file.

zsync
1->2 - 1620590
2->3 - 1229518
3->4 - 561777
4->5 - 35304
5->6 - 51649
6->7 - 1869237
7->8 - 20386
1->4 - 2354936
3->6 - 626871
6->8 - Failure

zchunk - chunked by srpm
1->2 - 1157234 - 29% smaller
2->3 - 788071 - 36% smaller
3->4 - 332596 - 41% smaller
4->5 - 34982 - 1% smaller
5->6 - 23236 - 55% smaller
6->7 - 1289018 - 31% smaller
7->8 - 10816 - 47% smaller
1->4 - 1796044 - 24% smaller
3->6 - 387162 - 38% smaller
6->8 - 1295027

As you can see, zchunk ranges from 1%-55% smaller than zsync, with an
average of around 30% (closer to 25% for larger differences).

The zchunk files are roughly 3% smaller than the equivalent rsyncable
gzip'd file, even with the embedded index.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-26 Thread Michael Schroeder
On Fri, Feb 23, 2018 at 11:15:40PM +0200, Jonathan Dieter wrote:
> On Fri, 2018-02-23 at 14:14 +, Michael Schroeder wrote:
> > Hi Jonathan!
> > 
> > On Fri, Feb 16, 2018 at 08:52:23PM +0200, Jonathan Dieter wrote:
> > > So here's my proposed file format for the zchunk file.  Should I
> > > add
> > > some flags to facilitate possible different compression formats?
> > > 
> > > +-+-+-+-+-+-+-+-+-+-+-+-+==+=+
> > > >  ID   |  Index size   | Compressed Index | Compressed Dict |
> > > 
> > > +-+-+-+-+-+-+-+-+-+-+-+-+==+=+
> > > 
> > > +===+===+
> > > >   Chunk   |   Chunk   | ==> More chunks
> > > 
> > > +===+===+
> > > [...]
> > 
> > This may be an unfair question, but how does it compare to the
> > 'gzip --rsyncable' + zsync approach that we (openSUSE) are
> > using since almost eight years? I guess it's better, but how much?
> > 
> > Cheers,
> >   Michael.
> 
> I've run some tests with zsync (since it's not in Fedora, I rebuilt the
> latest Tumbleweed source rpm), but ran into problems (which is probably
> unsurprising, given that upstream hasn't released an update in eight
> years).

Oh, I didn't propose to use the zsync tool itself, but just the
file format. I.e. --rsyncable compressed files that are accompanied
by .zsync files.

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-26 Thread Michael Schroeder
On Fri, Feb 23, 2018 at 03:23:00PM -0500, Colin Walters wrote:
> 
> 
> On Fri, Feb 23, 2018, at 9:14 AM, Michael Schroeder wrote:
> 
> > This may be an unfair question, but how does it compare to the
> > 'gzip --rsyncable' + zsync approach that we (openSUSE) are
> > using since almost eight years? I guess it's better, but how much?
> 
> Where is that code?  `git grep zsync` in zypper git master has zero
> hits.  I don't see any obvious library dependencies like librepo, it isn't 
> obvious
> to me that it's in repos.cc in the source (that's what fetches metadata 
> right?).

You need to check libzypp, not zypper.

> And I don't see any zsync files in e.g.:
> http://download.opensuse.org/distribution/leap/42.3/repo/oss/suse/

Well, it's because we do it a bit different. openSUSE uses metalinks
for all files on download.opensuse.org. A metalink file consists of
a list of mirrors plus block checksums.

Now if you look at how zsync works, it's a strong checksum and a
rolling checksum for every block. So what we've decided to do is
just add the rolling checksums to the metalink files we generate.

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-23 Thread Jonathan Dieter
On Fri, 2018-02-23 at 14:14 +, Michael Schroeder wrote:
> Hi Jonathan!
> 
> On Fri, Feb 16, 2018 at 08:52:23PM +0200, Jonathan Dieter wrote:
> > So here's my proposed file format for the zchunk file.  Should I
> > add
> > some flags to facilitate possible different compression formats?
> > 
> > +-+-+-+-+-+-+-+-+-+-+-+-+==+=+
> > >  ID   |  Index size   | Compressed Index | Compressed Dict |
> > 
> > +-+-+-+-+-+-+-+-+-+-+-+-+==+=+
> > 
> > +===+===+
> > >   Chunk   |   Chunk   | ==> More chunks
> > 
> > +===+===+
> > [...]
> 
> This may be an unfair question, but how does it compare to the
> 'gzip --rsyncable' + zsync approach that we (openSUSE) are
> using since almost eight years? I guess it's better, but how much?
> 
> Cheers,
>   Michael.

I've run some tests with zsync (since it's not in Fedora, I rebuilt the
latest Tumbleweed source rpm), but ran into problems (which is probably
unsurprising, given that upstream hasn't released an update in eight
years).

When testing the difference between two subsequent gzip --rsyncable
primary.xml's, zsync worked perfectly and only downloaded the 20k delta
(plus the 192k zsync control file).

When testing between two gzip --rsyncable primary.xml's that were about
four weeks apart, zsync was unable to build the new primary.xml, so I
was unable to get better numbers.

I do see zchunk as a new compression format that allows for easy deltas
as opposed to the add-on to existing files that zsync is.

Zsync also doesn't seem to support https, and uses crcs and MD4 hashes
to identify whether a block has changed, while I'd prefer SHA-256 or
better.

I do like the idea of using rsync's rolling sum to figure out where a
new chunk starts, and I'm going to see whether it might give us better
results than my current manual method.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-23 Thread Colin Walters


On Fri, Feb 23, 2018, at 3:54 PM, Jonathan Dieter wrote:
> On Fri, 2018-02-23 at 15:23 -0500, Colin Walters wrote:
> 
> > And I don't see any zsync files in e.g.:
> > http://download.opensuse.org/distribution/leap/42.3/repo/oss/suse/
> 
> I found a copy of zsync at https://download.opensuse.org/repositories/n
> etwork/openSUSE_Tumbleweed/src/zsync-0.6.2-35.23.src.rpm that I was
> able to rebuild in Fedora.

I mean the zsync checksum metadata itself that zypper would
need to find in the binary rpm-md repos to perform delta metadata downloads.

___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-23 Thread Jonathan Dieter
On Fri, 2018-02-23 at 15:23 -0500, Colin Walters wrote:

> And I don't see any zsync files in e.g.:
> http://download.opensuse.org/distribution/leap/42.3/repo/oss/suse/

I found a copy of zsync at https://download.opensuse.org/repositories/n
etwork/openSUSE_Tumbleweed/src/zsync-0.6.2-35.23.src.rpm that I was
able to rebuild in Fedora.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-23 Thread Colin Walters


On Fri, Feb 23, 2018, at 9:14 AM, Michael Schroeder wrote:

> This may be an unfair question, but how does it compare to the
> 'gzip --rsyncable' + zsync approach that we (openSUSE) are
> using since almost eight years? I guess it's better, but how much?

Where is that code?  `git grep zsync` in zypper git master has zero
hits.  I don't see any obvious library dependencies like librepo, it isn't 
obvious
to me that it's in repos.cc in the source (that's what fetches metadata right?).

And I don't see any zsync files in e.g.:
http://download.opensuse.org/distribution/leap/42.3/repo/oss/suse/
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-23 Thread Michael Schroeder

Hi Jonathan!

On Fri, Feb 16, 2018 at 08:52:23PM +0200, Jonathan Dieter wrote:
> So here's my proposed file format for the zchunk file.  Should I add
> some flags to facilitate possible different compression formats?
> 
> +-+-+-+-+-+-+-+-+-+-+-+-+==+=+
> |  ID   |  Index size   | Compressed Index | Compressed Dict |
> +-+-+-+-+-+-+-+-+-+-+-+-+==+=+
> 
> +===+===+
> |   Chunk   |   Chunk   | ==> More chunks
> +===+===+
> [...]

This may be an unfair question, but how does it compare to the
'gzip --rsyncable' + zsync approach that we (openSUSE) are
using since almost eight years? I guess it's better, but how much?

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


[Rpm-ecosystem] Proposed zchunk file format - V2

2018-02-19 Thread Jonathan Dieter
Neal, thanks for the feedback.  After taking your comments into
consideration, here's version 2.  

+-+-+-+-+-+--+-+-+-+-+-+-+-+-+
|ID   | Compression type |  Index size   |
+-+-+-+-+-+--+-+-+-+-+-+-+-+-+

+==+=+
| Compressed Index | Compressed Dict |
+==+=+

+===+===+
|   Chunk   |   Chunk   | ==> More chunks
+===+===+

ID
 '\0ZCK1', identifies file as zchunk version 1 file

Compression type
 Type of compression used to compress dict and chunks

 Current values:
   0 - Uncompressed
   2 - zstd

Index size
 This is a 64-bit unsigned integer containing the size of compressed 
 index.

Compressed Index
 This is the index, which is described in the next section.  The index 
 is compressed without a custom dictionary.

Compressed Dict (optional)
 This is a custom dictionary used when compressing each chunk.
 Because each chunk is compressed completely separately from the
 others, the custom dictionary gives us much better overall
 compression.  The custom dictionary is compressed without a custom
 dictionary (for obvious reasons).

Chunk
 This is a chunk of data, compressed with the custom dictionary
 provided above.


The index:

+---+==+
| Checksum type | Checksum of all data |
+---+==+

++-+-+-+-+-+-+-+-+
| Dict checksum  |  End of dict  |
++-+-+-+-+-+-+-+-+

++-+-+-+-+-+-+-+-+
| Chunk checksum | End of chunk  |  ==> More
++-+-+-+-+-+-+-+-+

Checksum type
 This is the type of checksum used to generate the checksums in the 
 index.

 Current values:
   0 = SHA-256

Checksum of all data
 This is the checksum of the compressed dict and all the compressed 
 chunks, used to verify that the file is actually the same, even in 
 the unlikely event of a hash collision for one of the chunks

Dict checksum
 This is the checksum of the compressed dict, used to detect whether 
 two dicts are identical.  If there is no dict, the checksum must be
 all zeros.

End of dict
 This is the location of the end of the dict starting from the end of 
 the index.  This gives us the information we need to find and 
 decompress the dict.  If there is no dict, the checksum must be all
 zeros.

Chunk checksum
 This is the checksum of the compressed chunk, used to detect whether 
 any two chunks are identical.

End of chunk
 This is the location of the end of the chunk starting from the end of 
 the index.  This gives us the information we need to find and 
 decompress each chunk.


The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Proposed zchunk file format

2018-02-17 Thread Neal Gompa
On Fri, Feb 16, 2018 at 1:52 PM, Jonathan Dieter  wrote:
> So here's my proposed file format for the zchunk file.  Should I add
> some flags to facilitate possible different compression formats?
>

I think it'd be smart to make sure that if other compression formats
were needed, it would be easy to implement. So flags for facilitating
that would be a good idea.

> +-+-+-+-+-+-+-+-+-+-+-+-+==+=+
> |  ID   |  Index size   | Compressed Index | Compressed Dict |
> +-+-+-+-+-+-+-+-+-+-+-+-+==+=+
>
> +===+===+
> |   Chunk   |   Chunk   | ==> More chunks
> +===+===+
>
> ID
>  '\0ZCK', identifies file as zchunk file
>
> Index size
>  This is a 64-bit unsigned integer containing the size of compressed
>  index.
>
> Compressed Index
>  This is the index, which is described in the next section.  The index
>  is compressed using standard zstd compression without a custom
>  dictionary.
>
> Compressed Dict
>  This is a custom dictionary used when compressing each chunk.
>  Because each chunk is compressed completely separately from the
>  others, the custom dictionary gives us much better overall
>  compression.  The custom dictionary is compressed using standard zstd
>  compression without using a separate custom dictionary (for obvious
>  reasons).
>
> Chunk
>  This is a chunk of data, compressed using zstd with the custom
>  dictionary provided above.
>
>
> The index:
>
> +++-+-+-+-+-+-+-+-+
> |  sha256sum
>  |  End of dict  |
> +++-+-+-+-+-+-+-+-+
>
> +++-+-+-+-+-+-+-+-+
> |  sha256sum  | End of chunk  |  ==> More
> +++-+-+-+-+-+-+-+-+
>
> sha256sum of compressed dict
>  This is a binary sha256sum of the compressed chunk, used to detect
>  whether two dicts are identical.
>
> End of dict
>  This is the location of the end of the dict with 0 being the end of
>
> the index.  This gives us the information we need to find and
>  decompress the dict.
>
> sha256sum of compressed chunk
>  This is a binary sha256sum of the compressed chunk, used to detect
>
> whether any two chunks are identical.
>

I suggest you add something to indicate what kind of checksum it is,
because when it has to be changed for whatever reason, we need a way
to make the format obvious for checksums.

> End of chunk
>  This is the location of the end of the chunk with 0 being the end of
>  the index.  This gives us the information we need to find and
>  decompress each chunk.
>
>
> The index is designed to be able to be extracted from the file on the
> server and downloaded separately, to facilitate downloading only the
> parts of the file that are needed, but must then be re-embedded when
> assembling the file so the user only needs to keep one file.

Overall, it looks pretty good to me.


-- 
真実はいつも一つ!/ Always, there's only one truth!
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


[Rpm-ecosystem] Proposed zchunk file format

2018-02-16 Thread Jonathan Dieter
So here's my proposed file format for the zchunk file.  Should I add
some flags to facilitate possible different compression formats?

+-+-+-+-+-+-+-+-+-+-+-+-+==+=+
|  ID   |  Index size   | Compressed Index | Compressed Dict |
+-+-+-+-+-+-+-+-+-+-+-+-+==+=+

+===+===+
|   Chunk   |   Chunk   | ==> More chunks
+===+===+

ID
 '\0ZCK', identifies file as zchunk file

Index size
 This is a 64-bit unsigned integer containing the size of compressed 
 index.

Compressed Index
 This is the index, which is described in the next section.  The index 
 is compressed using standard zstd compression without a custom 
 dictionary.

Compressed Dict
 This is a custom dictionary used when compressing each chunk.  
 Because each chunk is compressed completely separately from the 
 others, the custom dictionary gives us much better overall 
 compression.  The custom dictionary is compressed using standard zstd 
 compression without using a separate custom dictionary (for obvious 
 reasons).

Chunk
 This is a chunk of data, compressed using zstd with the custom 
 dictionary provided above.


The index:

+++-+-+-+-+-+-+-+-+
|  sha256sum
 |  End of dict  |
+++-+-+-+-+-+-+-+-+

+++-+-+-+-+-+-+-+-+
|  sha256sum  | End of chunk  |  ==> More
+++-+-+-+-+-+-+-+-+

sha256sum of compressed dict
 This is a binary sha256sum of the compressed chunk, used to detect 
 whether two dicts are identical.

End of dict
 This is the location of the end of the dict with 0 being the end of 

the index.  This gives us the information we need to find and 
 decompress the dict.

sha256sum of compressed chunk
 This is a binary sha256sum of the compressed chunk, used to detect 

whether any two chunks are identical.

End of chunk
 This is the location of the end of the chunk with 0 being the end of 
 the index.  This gives us the information we need to find and 
 decompress each chunk.


The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem