Re: [Rpm-ecosystem] Some points about zchunk

2018-09-27 Thread Jonathan Dieter
On Thu, 2018-09-27 at 14:55 -0400, Neal Gompa wrote:
> On Thu, Sep 27, 2018 at 2:45 PM Jonathan Dieter  wrote:
> > On Thu, 2018-09-27 at 14:17 -0400, Neal Gompa wrote:
> > > DNF is now using libdnf, so you shouldn't need to repeat it twice.
> > > 
> > > But that said, the path you describe makes sense.
> > 
> > So should I go with the first flow which abstracts zchunk away from
> > librepo clients or the second flow which is a bit less convoluted, but
> > requires librepo clients to understand zchunk?
> > 
> 
> The second flow.

Ok, thanks for the feedback.  That's what I'll do.

Jonathan

___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-09-27 Thread Jonathan Dieter
On Thu, 2018-09-27 at 14:17 -0400, Neal Gompa wrote:
> On Thu, Sep 27, 2018 at 1:49 PM Jonathan Dieter  wrote:
> > Apologies that it's taken so long for me to follow up on this.  So,
> > I've been working on getting librepo and libdnf up-to-date with this
> > change and it's far more difficult than you would expect.  Currently
> > the process is something like this:
> > 
> > ✔ libdnf requests primary
> > ✔ librepo cleverly changes primary to primary_zck if primary_zck exists
> > ✔ librepo downloads primary_zck
> > ✔ libdnf gets path of primary from librepo
> > ✔ libdnf passes primary path to libsolv
> > ✘ libsolv can't open primary path, because we downloaded primary_zck
> > 
> > I think the way forward is to make libdnf more aware of zchunk with the
> > following (simpler) flow:
> > 
> > ✔ libdnf requests primary_zck.  If that fails, it requests primary
> > ✔ librepo downloads primary_zck
> > ✔ libdnf gets path of primary_zck from librepo
> > ✔ libdnf passes primary_zck path to libsolv
> > ✔ libsolv opens primary_zck path
> > 
> > There are three things that I want to verify before I move forward with
> > this:
> >1. dnf is now using libdnf, so I'm not going to have to repeat the code
> >   twice, right?
> >2. Are the librepo consumers happy with me moving some logic to libdnf?
> >   Is there anybody who's losing zchunk support with this move?
> >3. Is there a better way to do this?
> > 
> 
> The problem if librepo can't do it is that pure-librepo consumers are
> probably going to have issues. The main upcoming pure-librepo consumer
> is Koji to replace its yum.repoMDObject stuff and other repomd centric
> actions using the YUM API.

If this is the case, maybe we should keep it in pure librepo.  It means
doing some fancy footwork between step 2 and step 4 in the current
example, and it means that we're basically lying to our clients,
telling them that they're getting foo when they're really getting
foo_zck, but I don't think that matters as long as we're consistent.

> DNF is now using libdnf, so you shouldn't need to repeat it twice.
> 
> But that said, the path you describe makes sense.

So should I go with the first flow which abstracts zchunk away from
librepo clients or the second flow which is a bit less convoluted, but
requires librepo clients to understand zchunk?

Jonathan

___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-09-27 Thread Jonathan Dieter
On Thu, 2018-08-09 at 10:49 +0200, Jonathan Dieter wrote:
> On Sun, 2018-07-15 at 11:25 -0400, Neal Gompa wrote:
> > On Thu, Jul 12, 2018 at 3:27 PM Jonathan Dieter 
> > wrote:
> > > I'd go with _zck since  is, by default, xml, but, other
> > > than
> > > that, I think what you (and Michael) are suggesting makes sense.
> > > 
> > > Michael, I agree that there should be some way to manually enable
> > > zchunk (with a method that will work with future algorithms) in
> > > librepo.
> > > 
> > > I don't mind writing the code, but I'd like to hear some
> > > consensus on
> > > the method and element name.
> > 
> > I'm okay with this, but does this mean we could also have (cheaply)
> > zck of arbitrary xml files appended using tools like modifyrepo_c,
> > then?
> 
> Ok, I've gone with creating a new record _zck, that uses all the
> same attributes available for  plus  and
>  size>.  My updated pull request for createrepo_c has these changes
> now.

Apologies that it's taken so long for me to follow up on this.  So,
I've been working on getting librepo and libdnf up-to-date with this
change and it's far more difficult than you would expect.  Currently
the process is something like this:

✔ libdnf requests primary
✔ librepo cleverly changes primary to primary_zck if primary_zck exists
✔ librepo downloads primary_zck
✔ libdnf gets path of primary from librepo
✔ libdnf passes primary path to libsolv
✘ libsolv can't open primary path, because we downloaded primary_zck

I think the way forward is to make libdnf more aware of zchunk with the
following (simpler) flow:

✔ libdnf requests primary_zck.  If that fails, it requests primary
✔ librepo downloads primary_zck
✔ libdnf gets path of primary_zck from librepo
✔ libdnf passes primary_zck path to libsolv
✔ libsolv opens primary_zck path

There are three things that I want to verify before I move forward with
this:
   1. dnf is now using libdnf, so I'm not going to have to repeat the code
  twice, right?
   2. Are the librepo consumers happy with me moving some logic to libdnf?
  Is there anybody who's losing zchunk support with this move?
   3. Is there a better way to do this?

Jonathan

___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-16 Thread Jonathan Dieter
On Sun, 2018-07-15 at 11:25 -0400, Neal Gompa wrote:
> On Thu, Jul 12, 2018 at 3:27 PM Jonathan Dieter  wrote:
> > I'd go with _zck since  is, by default, xml, but, other than
> > that, I think what you (and Michael) are suggesting makes sense.
> > 
> > Michael, I agree that there should be some way to manually enable
> > zchunk (with a method that will work with future algorithms) in
> > librepo.
> > 
> > I don't mind writing the code, but I'd like to hear some consensus on
> > the method and element name.
> 
> I'm okay with this, but does this mean we could also have (cheaply)
> zck of arbitrary xml files appended using tools like modifyrepo_c,
> then?

Yes.  The createrepo_c code hasn't been written yet, but zchunk-0.7.6
supports automatic chunking using buzhash in libzck (as opposed to the
zck utility), so, all that's left is to have modifyrepo_c use it.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-15 Thread Neal Gompa
On Thu, Jul 12, 2018 at 3:27 PM Jonathan Dieter  wrote:
>
> On Wed, 2018-07-11 at 07:55 -0400, Neal Gompa wrote:
> > On Wed, Jul 11, 2018 at 7:30 AM Michael Schroeder  wrote:
> > >
> > > On Wed, Jul 11, 2018 at 12:23:47PM +0100, Jonathan Dieter wrote:
> > > > That's something I didn't think of, and you're absolutely right.
> > > >
> > > > So, to summarize, we'll leave the gzip entry as "primary", so legacy
> > > > systems will still download without any problems, and create a new
> > > > "primary@zchunk" that librepo will automagically download if zchunk is
> > > > supported?
> > >
> > > I think there should be some librepo option to enable this. Be it
> > > something boolean or a fancy string list that can be used for
> > > some future compression method.
> > >
> > > But I'm just one person. Can someone else on this list please speak
> > > up and tell us if we want to do something stupid or not? Maybe the
> > > librepo maintainer? Or the libdnf maintainer? Anybody?
> > >
> >
> > While I agree that we should have an attribute that declares the
> > compression format for fetching, I wonder how legacy clients would
> > handle it? There must be a reason why we haven't done it so far after
> > going from gzip to bzip2 to lzma to xz?
> >
> > It's not unprecedented for there to be stuff like this (we handle the
> > sqlite data variant with _db). We could, for consistency, do
> > "_zckxml" and have librepo fetch that when it's available.
>
> I'd go with _zck since  is, by default, xml, but, other than
> that, I think what you (and Michael) are suggesting makes sense.
>
> Michael, I agree that there should be some way to manually enable
> zchunk (with a method that will work with future algorithms) in
> librepo.
>
> I don't mind writing the code, but I'd like to hear some consensus on
> the method and element name.

I'm okay with this, but does this mean we could also have (cheaply)
zck of arbitrary xml files appended using tools like modifyrepo_c,
then?



-- 
真実はいつも一つ!/ Always, there's only one truth!
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-11 Thread Jonathan Dieter
On Wed, 2018-07-11 at 11:08 +, Michael Schroeder wrote:
> On Wed, Jul 11, 2018 at 11:20:00AM +0100, Jonathan Dieter wrote:
> > I must be missing something because I don't understand how that
> > follows.  As I understand it, dnf requests the primary metadata. 
> > Librepo then downloads either primary.xml.gz or primary.xml.zck. 
> > Librepo then asks libsolv to decompress the xml file and convert it
> > into a solv file.  dnf then uses the solv file directly.  Why should
> > dnf care whether librepo downloaded primary.xml.gz or primary.xml.zck?
> 
> But it's not librepo that calls libsolv, it's libdnf.

Ah, ok.  For some reason, I had it in my mind that librepo was calling
libsolv directly, but, after looking through the code, you're right. 
It seems that both libdnf and dnf just pass the path from librepo
straight to libsolv and assume that libsolv will be able to open it.

> Anyway, this discussion started because you said:
> 
> > I had originally planned to do something along these lines (I think I
> > used primary-zck rather than primary@zchunk), but realized that this
> > pushed the "choose best format" code into the top-level tools, rather
> > than leaving the decision in librepo.
> 
> So you're kind of contradicting yourself, IMHO.
> 
> Basically all libdnf does is call:
>   path = lr_yum_repo_path(yum_repo, "primary");
> and then:
>   fp_primary = solv_xfopen(hy_repo_get_string(hrepo, path);
>   repo_add_rpmmd(repo, fp_primary, 0, 0);
> 
> I don't see why librepo can't automagically download/return the
> "primary@zchunk" entry instead of "primary".

That's something I didn't think of, and you're absolutely right.

So, to summarize, we'll leave the gzip entry as "primary", so legacy
systems will still download without any problems, and create a new
"primary@zchunk" that librepo will automagically download if zchunk is
supported?

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-11 Thread Michael Schroeder
On Wed, Jul 11, 2018 at 11:20:00AM +0100, Jonathan Dieter wrote:
> I must be missing something because I don't understand how that
> follows.  As I understand it, dnf requests the primary metadata. 
> Librepo then downloads either primary.xml.gz or primary.xml.zck. 
> Librepo then asks libsolv to decompress the xml file and convert it
> into a solv file.  dnf then uses the solv file directly.  Why should
> dnf care whether librepo downloaded primary.xml.gz or primary.xml.zck?

But it's not librepo that calls libsolv, it's libdnf.

Anyway, this discussion started because you said:

> I had originally planned to do something along these lines (I think I
> used primary-zck rather than primary@zchunk), but realized that this
> pushed the "choose best format" code into the top-level tools, rather
> than leaving the decision in librepo.

So you're kind of contradicting yourself, IMHO.

Basically all libdnf does is call:
  path = lr_yum_repo_path(yum_repo, "primary");
and then:
  fp_primary = solv_xfopen(hy_repo_get_string(hrepo, path);
  repo_add_rpmmd(repo, fp_primary, 0, 0);

I don't see why librepo can't automagically download/return the
"primary@zchunk" entry instead of "primary".

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-11 Thread Jonathan Dieter
On Wed, 2018-07-11 at 08:28 +, Michael Schroeder wrote:
> On Tue, Jul 10, 2018 at 02:05:26PM +0100, Jonathan Dieter wrote:
> > The top-level tool only needs to deal with the uncompressed metadata. 
> > dnf/libdnf requests the primary metadata from librepo, which downloads
> > the zchunk version, passes it to libsolv which decompresses it and
> > creates the .solv file usable by the top-level tools.
> 
> Yes, so the selection of the flavor to download should be in dnf/libdnf.

I must be missing something because I don't understand how that
follows.  As I understand it, dnf requests the primary metadata. 
Librepo then downloads either primary.xml.gz or primary.xml.zck. 
Librepo then asks libsolv to decompress the xml file and convert it
into a solv file.  dnf then uses the solv file directly.  Why should
dnf care whether librepo downloaded primary.xml.gz or primary.xml.zck?

> > DNF neither knows, nor cares that librepo downloaded the zchunk metadata
> > rather than gz.
> 
> That's just because libsolv uses the file suffix to autodetect the
> compression.
> Actually dnf/libdnf sould ask libsolv if it supports the compression
> (by calling solv_xfopen_iscompressed()) and not blindly assume that
> it will magically work.

Agreed, but I think it should be librepo asking libsolv if it supports
the compression.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-11 Thread Michael Schroeder
On Tue, Jul 10, 2018 at 02:05:26PM +0100, Jonathan Dieter wrote:
> The top-level tool only needs to deal with the uncompressed metadata. 
> dnf/libdnf requests the primary metadata from librepo, which downloads
> the zchunk version, passes it to libsolv which decompresses it and
> creates the .solv file usable by the top-level tools.

Yes, so the selection of the flavor to download should be in dnf/libdnf.

> DNF neither knows, nor cares that librepo downloaded the zchunk metadata
> rather than gz.

That's just because libsolv uses the file suffix to autodetect the
compression.
Actually dnf/libdnf sould ask libsolv if it supports the compression
(by calling solv_xfopen_iscompressed()) and not blindly assume that
it will magically work.

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-10 Thread Jonathan Dieter
On Tue, 2018-07-10 at 11:17 +, Michael Schroeder wrote:
> On Mon, Jul 09, 2018 at 09:32:13PM +0100, Jonathan Dieter wrote:
> > I had originally planned to do something along these lines (I think I
> > used primary-zck rather than primary@zchunk), but realized that this
> > pushed the "choose best format" code into the top-level tools, rather
> > than leaving the decision in librepo.
> 
> But doesn't it in the top-level tools? How can librepo decide that
> it's ok to use zchunk if the top-level tool can't deal with it?
> IMHO the top-level tool has to tell librepo what compression/format it
> understands.

The top-level tool only needs to deal with the uncompressed metadata. 
dnf/libdnf requests the primary metadata from librepo, which downloads
the zchunk version, passes it to libsolv which decompresses it and
creates the .solv file usable by the top-level tools.  DNF neither
knows, nor cares that librepo downloaded the zchunk metadata rather
than gz.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-10 Thread Neal Gompa
On Tue, Jul 10, 2018 at 7:18 AM Michael Schroeder  wrote:
>
> On Mon, Jul 09, 2018 at 09:32:13PM +0100, Jonathan Dieter wrote:
> > I had originally planned to do something along these lines (I think I
> > used primary-zck rather than primary@zchunk), but realized that this
> > pushed the "choose best format" code into the top-level tools, rather
> > than leaving the decision in librepo.
>
> But doesn't it in the top-level tools? How can librepo decide that
> it's ok to use zchunk if the top-level tool can't deal with it?
> IMHO the top-level tool has to tell librepo what compression/format it
> understands.
>

Currently, it doesn't work that way. The assumption is that support is
always equal. This is a flawed assumption, but that's what it is.



-- 
真実はいつも一つ!/ Always, there's only one truth!
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-10 Thread Michael Schroeder
On Mon, Jul 09, 2018 at 09:32:13PM +0100, Jonathan Dieter wrote:
> I had originally planned to do something along these lines (I think I
> used primary-zck rather than primary@zchunk), but realized that this
> pushed the "choose best format" code into the top-level tools, rather
> than leaving the decision in librepo.

But doesn't it in the top-level tools? How can librepo decide that
it's ok to use zchunk if the top-level tool can't deal with it?
IMHO the top-level tool has to tell librepo what compression/format it
understands.

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-09 Thread Jonathan Dieter
On Mon, 2018-07-09 at 08:59 +, Michael Schroeder wrote:
> I tought about this a bit more over the weekend, and maybe we
> should do this in a bit more general way. Basically zchunk is
> just another compression format, like "xz" or "zstd". If we
> want to support yet another compression format, we proably wouldn't
> want to add new attributes to the existing elements, but instead
> add new elements. E.g.
> 
> 
>   
>   ...
> 
> 
>   
>   ...
> 
> 
> We might also want to add a "format" attribute in case we want
> to get switch from "xml" to something that can be parsed faster,
> like "json".
> 
> The zchunk compression format would be the same, but with added
> "header-size" and "header-checksum" elements (so back to what
> you had earier):
> 
> 
>   
>   ...
>   ...
>   ...
>   ...
>   ...
>   ...
>   ...
> 
> 
> The problem with all this is that we don't know how all the
> repomd.xml parsers behave when there are multiple  elements
> with the same type, so we might need to annotate the "type" with
> the compression/format, e.g. "primary@zchunk".

I had originally planned to do something along these lines (I think I
used primary-zck rather than primary@zchunk), but realized that this
pushed the "choose best format" code into the top-level tools, rather
than leaving the decision in librepo.

I suppose if librepo grew the ability to understand that primary@zchunk
 matches primary, it could work, but that would take some work, I
think.

What would be worth the effort would be switching back to header-size
and header-checksum, and making sure that createrepo can create zchunk-
only metadata as well as the current plan of zchunk+gz metadata.  In
other words, we only use zck-loc and zck-timestamp if it's zchunk+gz.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-09 Thread Michael Schroeder
On Sun, Jul 08, 2018 at 07:45:36PM +0100, Jonathan Dieter wrote:
> On Fri, 2018-07-06 at 11:48 +, Michael Schroeder wrote:
> > On Thu, Jul 05, 2018 at 08:07:58PM +0300, Jonathan Dieter wrote:
> > > My proposal is here:
> > > https://www.jdieter.net/downloads/zchunk/repomd.dtd
> > > 
> > > In summary, I'm just adding extra zchunk attributes to the main file
> > > element:
> > > zck-location
> > > header-checksum
> > > header-size
> > > zck-timestamp
> > > 
> > > librepo first downloads header-size of the file and then verifies that
> > > the header checksum matches and is valid.
> > 
> > Please use zck-header-checksum and zck-header-size instead.
> 
> Ok, will do.

I tought about this a bit more over the weekend, and maybe we
should do this in a bit more general way. Basically zchunk is
just another compression format, like "xz" or "zstd". If we
want to support yet another compression format, we proably wouldn't
want to add new attributes to the existing elements, but instead
add new elements. E.g.


  
  ...


  
  ...


We might also want to add a "format" attribute in case we want
to get switch from "xml" to something that can be parsed faster,
like "json".

The zchunk compression format would be the same, but with added
"header-size" and "header-checksum" elements (so back to what
you had earier):


  
  ...
  ...
  ...
  ...
  ...
  ...
  ...


The problem with all this is that we don't know how all the
repomd.xml parsers behave when there are multiple  elements
with the same type, so we might need to annotate the "type" with
the compression/format, e.g. "primary@zchunk".

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-08 Thread Jonathan Dieter
On Sun, 2018-07-08 at 19:45 +0100, Jonathan Dieter wrote:
> On Fri, 2018-07-06 at 11:48 +, Michael Schroeder wrote:
> > Ah, no, I think you misunderstood. Do *not* add md5 support. In fact,
> > I'd ask you to remove sha1 support as well to make your code smaller.
> > 
> > My point is that you shouldn't use 20 bytes just for chunk identification
> > purposes. As you said, it doesn't need to be cryptographically sound, we
> > don't have to make sure it withstands an attacker.
> > Just use the first 8 bytes of the sha256 sum instead (or sha512, as
> > it's a bit faster than sha256 IIRC).
> 
> Ok, that makes sense.  I'll add it as a new hash type (SHA512_64?) and
> make it the default for the chunk checksum.

Ok, I've added two new hash types, ZCK_HASH_SHA512 and
ZCK_HASH_SHA512_64.  The latter is the new default for chunk hashes.

https://github.com/zchunk/zchunk/commit/abdfa43ea05b1b3d6dbd3b330572abe
eb0d8444f

I think I'll leave the SHA1 support in, at least for a while, since
it's been the default chunk hash up until now, and I'd hate to break
any zchunk files that people might have created.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-08 Thread Jonathan Dieter
On Sun, 2018-07-08 at 19:45 +0100, Jonathan Dieter wrote:
> On Fri, 2018-07-06 at 11:48 +, Michael Schroeder wrote:
> > On Thu, Jul 05, 2018 at 08:07:58PM +0300, Jonathan Dieter wrote:
> > > librepo first downloads header-size of the file and then verifies that
> > > the header checksum matches and is valid.
> > 
> > Please use zck-header-checksum and zck-header-size instead.
> 
> Ok, will do.

And this is done and in the pull requests.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-08 Thread Jonathan Dieter
On Fri, 2018-07-06 at 11:48 +, Michael Schroeder wrote:
> On Thu, Jul 05, 2018 at 08:07:58PM +0300, Jonathan Dieter wrote:
> > My plan was to just keep the same dictionaries (a different one for
> > each metadata file) for at least a whole release, if not more.  My
> > dictionary generation script
> > (https://www.jdieter.net/downloads/zchunk-dicts/split.py)
> > removes checksums before running zstd -D, so the dictionary should
> > remain effective for a minimum of one release.
> 
> Yes, I guessed that you would say something like this. And it's
> also a reasonable thing to do.
> 
> It's just a shame that we can't generate the dictionary when creating
> the repository metadata, it would be such a nice feature to have.

Yep.

> > At the point where the dictionary changes, everybody just downloads the
> > full metadata again with the new dictionary and gets good deltas from
> > then on.
> > 
> > I'm planning to package up the optimal Fedora dictionaries, make them
> > Recommended: in createrepo_c, and only change them in Rawhide once,
> > somewhere around branching.
> > 
> > By using the same dictionaries, we are able to validate the checksums
> > before decompression, which keeps zchunk from decompressing unverified
> > data, a possible attack vector.
> 
> That depends. Maybe I'll implement dictionary transcoding for zchunk
> just in case the zstd algorithms don't change. Even if that's pretty
> unlikely.

Sure, that would be great!

> > >  2) What to put into repomd.xml? We'll need to old primary.xml.gz for
> > > compatibility reasons. It's a good security practice to minimize the 
> > > attack vector, so we should put the zchunk header checksum into
> > > the repodata.xml so that it can be verified before running the zchunk
> > > code. So primary.xml.zck with extra attributes for the header?
> > My proposal is here:
> > https://www.jdieter.net/downloads/zchunk/repomd.dtd
> > 
> > In summary, I'm just adding extra zchunk attributes to the main file
> > element:
> > zck-location
> > header-checksum
> > header-size
> > zck-timestamp
> > 
> > librepo first downloads header-size of the file and then verifies that
> > the header checksum matches and is valid.
> 
> Please use zck-header-checksum and zck-header-size instead.

Ok, will do.

> > librepo then grabs any common chunks from already downloaded metadata,
> > downloads the remaining chunks, and verifies the body checksum that's
> > embedded in the header.
> 
> And libzypp will never use librepo, so I have to implement all this
> myself ;)
> 
> The good thing is that libzypp already supports range downloads from
> multiple mirrors in parallel, because we already support delta
> metadata downloads. So I just need an libzchunk api that "fills"
> the target file with the data from the old metadata and returns a
> list of ranges that need to be downloaded.

There's zck_copy_chunks(zckCtx *src, zckCtx *tgt) that copies any
needed chunks from src to tgt.  You can run it multiple times with
multiple sources, if you so choose.

There's zck_get_missing_range(zckCtx *zck, int max_ranges) which will
return a zckRange* of the needed chunks.

Finally, there's zck_get_range_char(zckRange *range) which returns a
char* of the range returned from zck_get_missing_range().

The download process itself is a bit complex, because we don't actually
do the download in libzck.  There's an example of how to download at:
https://github.com/zchunk/zchunk/blob/master/src/zck_dl.c


> > >  4) Nitpick: Why does zchunk use sha1 checksums for the chunks? Either
> > > it's something that needs to be cryptographic sound, then sha1 is the
> > > wrong choice. Or it's just meant for identifying chunks, then
> > > md5 is probably faster/smaller. Or some other checksum. But you
> > > really don't need 20 bytes like with sha1.
> > 
> > It doesn't need to be cryptographically sound because we have a body
> > checksum that is sha256.  I'll look at adding MD5 support and
> > defaulting to it for the chunk checksum type.
> 
> Ah, no, I think you misunderstood. Do *not* add md5 support. In fact,
> I'd ask you to remove sha1 support as well to make your code smaller.
> 
> My point is that you shouldn't use 20 bytes just for chunk identification
> purposes. As you said, it doesn't need to be cryptographically sound, we
> don't have to make sure it withstands an attacker.
> Just use the first 8 bytes of the sha256 sum instead (or sha512, as
> it's a bit faster than sha256 IIRC).

Ok, that makes sense.  I'll add it as a new hash type (SHA512_64?) and
make it the default for the chunk checksum.

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-06 Thread Michael Schroeder
On Thu, Jul 05, 2018 at 08:07:58PM +0300, Jonathan Dieter wrote:
> My plan was to just keep the same dictionaries (a different one for
> each metadata file) for at least a whole release, if not more.  My
> dictionary generation script
> (https://www.jdieter.net/downloads/zchunk-dicts/split.py)
> removes checksums before running zstd -D, so the dictionary should
> remain effective for a minimum of one release.

Yes, I guessed that you would say something like this. And it's
also a reasonable thing to do.

It's just a shame that we can't generate the dictionary when creating
the repository metadata, it would be such a nice feature to have.

> At the point where the dictionary changes, everybody just downloads the
> full metadata again with the new dictionary and gets good deltas from
> then on.
> 
> I'm planning to package up the optimal Fedora dictionaries, make them
> Recommended: in createrepo_c, and only change them in Rawhide once,
> somewhere around branching.
> 
> By using the same dictionaries, we are able to validate the checksums
> before decompression, which keeps zchunk from decompressing unverified
> data, a possible attack vector.

That depends. Maybe I'll implement dictionary transcoding for zchunk
just in case the zstd algorithms don't change. Even if that's pretty
unlikely.

> >  2) What to put into repomd.xml? We'll need to old primary.xml.gz for
> > compatibility reasons. It's a good security practice to minimize the
> > attack vector, so we should put the zchunk header checksum into
> > the repodata.xml so that it can be verified before running the zchunk
> > code. So primary.xml.zck with extra attributes for the header? Or an
> > extra element that describes the zchunk header?
> 
> My proposal is here:
> https://www.jdieter.net/downloads/zchunk/repomd.dtd
> 
> In summary, I'm just adding extra zchunk attributes to the main file
> element:
> zck-location
> header-checksum
> header-size
> zck-timestamp
> 
> librepo first downloads header-size of the file and then verifies that
> the header checksum matches and is valid.

Please use zck-header-checksum and zck-header-size instead.

> librepo then grabs any common chunks from already downloaded metadata,
> downloads the remaining chunks, and verifies the body checksum that's
> embedded in the header.

And libzypp will never use librepo, so I have to implement all this
myself ;)

The good thing is that libzypp already supports range downloads from
multiple mirrors in parallel, because we already support delta
metadata downloads. So I just need an libzchunk api that "fills"
the target file with the data from the old metadata and returns a
list of ranges that need to be downloaded.

> >  3) I don't think signature support in zchunk is useful ;)
> 
> Fair enough.  ;)  It doesn't actually work yet, and I suspect that
> you're right in the librepo context, but I think it could be useful in
> other contexts.
> 
> >  4) Nitpick: Why does zchunk use sha1 checksums for the chunks? Either
> > it's something that needs to be cryptographic sound, then sha1 is the
> > wrong choice. Or it's just meant for identifying chunks, then
> > md5 is probably faster/smaller. Or some other checksum. But you
> > really don't need 20 bytes like with sha1.
> 
> It doesn't need to be cryptographically sound because we have a body
> checksum that is sha256.  I'll look at adding MD5 support and
> defaulting to it for the chunk checksum type.

Ah, no, I think you misunderstood. Do *not* add md5 support. In fact,
I'd ask you to remove sha1 support as well to make your code smaller.

My point is that you shouldn't use 20 bytes just for chunk identification
purposes. As you said, it doesn't need to be cryptographically sound, we
don't have to make sure it withstands an attacker.
Just use the first 8 bytes of the sha256 sum instead (or sha512, as
it's a bit faster than sha256 IIRC).

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] Some points about zchunk

2018-07-05 Thread Jonathan Dieter
Michael, thank you so much for your detailed review!  I really
appreciate the time you took to look at this in such detail!

I'm currently waiting to board a flight, so I'll make this brief and
I'll probably be unavailable until Monday.

Comments inline

On Thu, 2018-07-05 at 14:18 +, Michael Schroeder wrote:
> Hi,
> 
> here are some of my thoughts about Jonathan's zchunk compression:

> 
> Thoughts:
> -
> The basic algorithms and implementation are sound and work nicely.
> Kudos to Jonathan for doing such an amazing job.

Coming from you, this means a lot to me!  Thank you so much!

> Here's some points I have: (Please correct me if I'm wrong anywhere)
> 
>  1) The current implementation can't reuse chunks when the dictionary
> changes. That's a rather big limitation. A dictionary is a must
> if we want to go with small chunks.
> 
> We can also go with no dictionary and large chunks, this is somewhat
> the zchunk default. For the example above the buzhash algorithm would
> split the file into 193 chunks instead of the "package level" 1844
> chunks. Large chunks mean good compression, but the amount of data
> that can get reused will probably be much less. In that case we (SUSE)
> might as well stay with zsync and gzip -9 --rsyncable ;)
> 
> From an algorithmic point having different dictionaries is not a
> problem: you'd just need to store the checksum over the uncompressed
> chunks instead. But there's a big drawback: you can't reconstruct the
> identical file. That's because you need to re-compress the chunks
> you reuse with the new dictionary, and this may lead to different
> data if the zstd algorithm is different than the one used when
> creating the repository
> 
> We have the same problem with deltarpms, the recompression is the weak
> step. Repository creation is usually done on system that runs a different
> distribution version than the target, which makes this even more likely.
> 
> So we can reconstruct a zchunk file that gets the same data when
> uncompressed, but it might not be the identical zchunk file. But this
> may not be a problem at all, we just need to be sure that the
> verification step works.

My plan was to just keep the same dictionaries (a different one for
each metadata file) for at least a whole release, if not more.  My
dictionary generation script
(https://www.jdieter.net/downloads/zchunk-dicts/split.py)
removes checksums before running zstd -D, so the dictionary should
remain effective for a minimum of one release.

At the point where the dictionary changes, everybody just downloads the
full metadata again with the new dictionary and gets good deltas from
then on.

I'm planning to package up the optimal Fedora dictionaries, make them
Recommended: in createrepo_c, and only change them in Rawhide once,
somewhere around branching.

By using the same dictionaries, we are able to validate the checksums
before decompression, which keeps zchunk from decompressing unverified
data, a possible attack vector.

>  2) What to put into repomd.xml? We'll need to old primary.xml.gz for
> compatibility reasons. It's a good security practice to minimize the
> attack vector, so we should put the zchunk header checksum into
> the repodata.xml so that it can be verified before running the zchunk
> code. So primary.xml.zck with extra attributes for the header? Or an
> extra element that describes the zchunk header?

My proposal is here:
https://www.jdieter.net/downloads/zchunk/repomd.dtd

In summary, I'm just adding extra zchunk attributes to the main file
element:
zck-location
header-checksum
header-size
zck-timestamp

librepo first downloads header-size of the file and then verifies that
the header checksum matches and is valid.

librepo then grabs any common chunks from already downloaded metadata,
downloads the remaining chunks, and verifies the body checksum that's
embedded in the header.

>  3) I don't think signature support in zchunk is useful ;)

Fair enough.  ;)  It doesn't actually work yet, and I suspect that
you're right in the librepo context, but I think it could be useful in
other contexts.

>  4) Nitpick: Why does zchunk use sha1 checksums for the chunks? Either
> it's something that needs to be cryptographic sound, then sha1 is the
> wrong choice. Or it's just meant for identifying chunks, then
> md5 is probably faster/smaller. Or some other checksum. But you
> really don't need 20 bytes like with sha1.

It doesn't need to be cryptographically sound because we have a body
checksum that is sha256.  I'll look at adding MD5 support and
defaulting to it for the chunk checksum type.

> Ok, that's enough for now.

Thanks again for looking at this!

Jonathan
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem