Re: No progress from push when using bitmaps
On Fri, Mar 14, 2014 at 10:29 PM, Jeff King wrote: >> If an object is reused, we already know its compressed size. If it's >> not reused and is a loose object, we could use on-disk size. It's a >> lot harder to estimate an not-reused, deltified object. All we have is >> the uncompressed size, and the size of each delta in the delta chain.. >> Neither gives a good hint of what the compressed size would be. > > Hmm. I think we do have the compressed delta size after having run the > compression phase (because that is ultimately what we compare to find > the best delta). There are cases when we try not to find deltas (large blobs, file too small, or -delta attribute). The large blob case is especially interesting because progress bar crawls slowly when they write these objects. > Loose objects are probably the hardest here, as we > actually recompress them (IIRC, because packfiles encode the type/size > info outside of the compressed bit, whereas it is inside for loose > objects; the "experimental loose" format harmonized this, but it never > caught on). > > Without doing that recompression, any value you came up with would be an > estimate, though it would be pretty close (not off by more than a few > bytes per object). That's my hope. Although if they tweak compression level then the estimation could be off (gzip -9 and gzip -1 produce big difference in size) > However, you can't just run through the packing list > and add up the object sizes; you'd need to do a real "dry-run" through > the writing phase. There are probably more I'm missing, but you need at > least to figure out: > > 1. The actual compressed size of a full loose object, as described > above. > > 2. The variable-length headers for each object based on its type and > size. We could run through a "typical" repo, calculate the average header length then use it for all objects? > > 3. The final form that the object will take based on what has come > before. For example, if there is a max pack size, we may split an > object from its delta base, in which case we have to throw away the > delta. We don't know where those breaks will be until we walk > through the whole list. Ah this could probably be avoided. max pack size does not apply to streaming pack-objects, where progress bar is most shown. Falling back to object number in this case does not sound too bad. > > 4. If an object we attempt to reuse turns out to be corrupted, we > fall back to the non-reuse code path, which will have a different > size. So you'd need to actually check the reused object CRCs during > the dry-run (and for local repacks, not transfers, we actually > inflate and check the zlib, too, for safety). Ugh.. > > So I think it's _possible_. But it's definitely not trivial. For now, I > think it makes sense to go with something like the patch I posted > earlier (which I'll re-roll in a few minutes). That fixes what is IMHO a > regression in the bitmaps case. And it does not make it any harder for > somebody to later convert us to a true byte-counter (i.e., it is the > easy half already). Agreed. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No progress from push when using bitmaps
On Fri, Mar 14, 2014 at 05:21:59PM +0700, Duy Nguyen wrote: > On Fri, Mar 14, 2014 at 4:43 PM, Michael Haggerty > wrote: > > Would it be practical to change it to a percentage of bytes written? > > Then we'd have progress info that is both convenient *and* truthful. > > I agreed for a second, then remembered that we don't know the final > pack size until we finish writing it.. Not sure if we could estimate > (cheaply) with a good accuracy though. Right. I'm not sure what Michael meant by "it". We can send a percentage of bytes written for the reused pack (my option 2), but we do not know the total bytes for the rest of the objects. So we'd end up with two progress meters (one for the reused pack, and one for everything else), both counting up to different endpoints. And it would require quite a few changes to the progress code. > If an object is reused, we already know its compressed size. If it's > not reused and is a loose object, we could use on-disk size. It's a > lot harder to estimate an not-reused, deltified object. All we have is > the uncompressed size, and the size of each delta in the delta chain.. > Neither gives a good hint of what the compressed size would be. Hmm. I think we do have the compressed delta size after having run the compression phase (because that is ultimately what we compare to find the best delta). Loose objects are probably the hardest here, as we actually recompress them (IIRC, because packfiles encode the type/size info outside of the compressed bit, whereas it is inside for loose objects; the "experimental loose" format harmonized this, but it never caught on). Without doing that recompression, any value you came up with would be an estimate, though it would be pretty close (not off by more than a few bytes per object). However, you can't just run through the packing list and add up the object sizes; you'd need to do a real "dry-run" through the writing phase. There are probably more I'm missing, but you need at least to figure out: 1. The actual compressed size of a full loose object, as described above. 2. The variable-length headers for each object based on its type and size. 3. The final form that the object will take based on what has come before. For example, if there is a max pack size, we may split an object from its delta base, in which case we have to throw away the delta. We don't know where those breaks will be until we walk through the whole list. 4. If an object we attempt to reuse turns out to be corrupted, we fall back to the non-reuse code path, which will have a different size. So you'd need to actually check the reused object CRCs during the dry-run (and for local repacks, not transfers, we actually inflate and check the zlib, too, for safety). So I think it's _possible_. But it's definitely not trivial. For now, I think it makes sense to go with something like the patch I posted earlier (which I'll re-roll in a few minutes). That fixes what is IMHO a regression in the bitmaps case. And it does not make it any harder for somebody to later convert us to a true byte-counter (i.e., it is the easy half already). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No progress from push when using bitmaps
On Fri, Mar 14, 2014 at 4:43 PM, Michael Haggerty wrote: > Would it be practical to change it to a percentage of bytes written? > Then we'd have progress info that is both convenient *and* truthful. I agreed for a second, then remembered that we don't know the final pack size until we finish writing it.. Not sure if we could estimate (cheaply) with a good accuracy though. If an object is reused, we already know its compressed size. If it's not reused and is a loose object, we could use on-disk size. It's a lot harder to estimate an not-reused, deltified object. All we have is the uncompressed size, and the size of each delta in the delta chain.. Neither gives a good hint of what the compressed size would be. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No progress from push when using bitmaps
On 03/13/2014 11:07 PM, Jeff King wrote: > On Thu, Mar 13, 2014 at 03:01:09PM -0700, Shawn Pearce wrote: > >>> It would definitely be good to have throughput measurements while >>> writing out the pack. However, I'm not sure we have anything useful to >>> count. We know the total number of objects we're reusing, but we're not >>> actually parsing the data; we're just blitting it out as a stream. I >>> think the progress code will need some refactoring to handle a >>> throughput-only case. >> >> Yes. I think JGit suffers from this same bug, and again we never >> noticed it because usually only the servers are bitmapped, not the >> clients. >> >> pack-objects writes a throughput meter when its writing objects. >> Really just the bytes out/second would be enough to let the user know >> the client is working. Unfortunately I think that is still tied to the >> overall progress system having some other counter? > > Yes, I'm looking at it right now. The throughput meter is actually > connected to the sha1fd output. So really we just need to call > display_progress periodically as we loop through the data. It's a > one-liner fix. > > _But_ it still looks ugly, because, as you mention, it's tied to the > progress meter, which is counting up to N objects. So we basically sit > there at "0", pumping data, and then after the pack is done, we can say > we sent N. :) > > There are a few ways around this: > > 1. Add a new phase "Writing packs" which counts from 0 to 1. Even > though it's more accurate, moving from 0 to 1 really isn't that > useful (the throughput is, but the 0/1 just looks like noise). > > 2. Add a new phase "Writing reused objects" that counts from 0 bytes > up to N bytes. This looks stupid, though, because we are repeating > the current byte count both here and in the throughput. > > 3. Use the regular "Writing objects" progress, but fake the object > count. We know we are writing M bytes with N objects. Bump the > counter by 1 for every M/N bytes we write. Would it be practical to change it to a percentage of bytes written? Then we'd have progress info that is both convenient *and* truthful. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No progress from push when using bitmaps
On Thu, Mar 13, 2014 at 06:07:54PM -0400, Jeff King wrote: > 3. Use the regular "Writing objects" progress, but fake the object > count. We know we are writing M bytes with N objects. Bump the > counter by 1 for every M/N bytes we write. Here is that strategy. I think it looks pretty nice, and it seamlessly handles the case where you have extra objects to send on top of the reused pack (we just keep the same progress meter counting up). diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 831dd05..f187859 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -709,7 +709,7 @@ static struct object_entry **compute_write_order(void) static off_t write_reused_pack(struct sha1file *f) { unsigned char buffer[8192]; - off_t to_write; + off_t to_write, total; int fd; if (!is_pack_valid(reuse_packfile)) @@ -726,7 +726,7 @@ static off_t write_reused_pack(struct sha1file *f) if (reuse_packfile_offset < 0) reuse_packfile_offset = reuse_packfile->pack_size - 20; - to_write = reuse_packfile_offset - sizeof(struct pack_header); + total = to_write = reuse_packfile_offset - sizeof(struct pack_header); while (to_write) { int read_pack = xread(fd, buffer, sizeof(buffer)); @@ -739,10 +739,23 @@ static off_t write_reused_pack(struct sha1file *f) sha1write(f, buffer, read_pack); to_write -= read_pack; + + /* +* We don't know the actual number of objects written, +* only how many bytes written, how many bytes total, and +* how many objects total. So we can fake it by pretending all +* objects we are writing are the same size. This gives us a +* smooth progress meter, and at the end it matches the true +* answer. +*/ + written = reuse_packfile_objects * + (((double)(total - to_write)) / total); + display_progress(progress_state, written); } close(fd); - written += reuse_packfile_objects; + written = reuse_packfile_objects; + display_progress(progress_state, written); return reuse_packfile_offset - sizeof(struct pack_header); } -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No progress from push when using bitmaps
Jeff King writes: > There are a few ways around this: > > 1. Add a new phase "Writing packs" which counts from 0 to 1. Even > though it's more accurate, moving from 0 to 1 really isn't that > useful (the throughput is, but the 0/1 just looks like noise). > > 2. Add a new phase "Writing reused objects" that counts from 0 bytes > up to N bytes. This looks stupid, though, because we are repeating > the current byte count both here and in the throughput. > > 3. Use the regular "Writing objects" progress, but fake the object > count. We know we are writing M bytes with N objects. Bump the > counter by 1 for every M/N bytes we write. > > The first two require some non-trivial surgery to the progress code. I > am leaning towards the third. Not just because it's easy, but because I > think it actually shows the most intuitive display. Yes, it's fudging > the object numbers, but those are largely meaningless anyway (in fact, > it makes them _better_ because now they're even, instead of getting 95% > done and then hitting some blob that is as big as the rest of the repo > combined). I think the above argument, especially the "fudging but largely meaningless anyway" part, makes perfect sense. Thanks for looking into this. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No progress from push when using bitmaps
On Thu, Mar 13, 2014 at 03:01:09PM -0700, Shawn Pearce wrote: > > It would definitely be good to have throughput measurements while > > writing out the pack. However, I'm not sure we have anything useful to > > count. We know the total number of objects we're reusing, but we're not > > actually parsing the data; we're just blitting it out as a stream. I > > think the progress code will need some refactoring to handle a > > throughput-only case. > > Yes. I think JGit suffers from this same bug, and again we never > noticed it because usually only the servers are bitmapped, not the > clients. > > pack-objects writes a throughput meter when its writing objects. > Really just the bytes out/second would be enough to let the user know > the client is working. Unfortunately I think that is still tied to the > overall progress system having some other counter? Yes, I'm looking at it right now. The throughput meter is actually connected to the sha1fd output. So really we just need to call display_progress periodically as we loop through the data. It's a one-liner fix. _But_ it still looks ugly, because, as you mention, it's tied to the progress meter, which is counting up to N objects. So we basically sit there at "0", pumping data, and then after the pack is done, we can say we sent N. :) There are a few ways around this: 1. Add a new phase "Writing packs" which counts from 0 to 1. Even though it's more accurate, moving from 0 to 1 really isn't that useful (the throughput is, but the 0/1 just looks like noise). 2. Add a new phase "Writing reused objects" that counts from 0 bytes up to N bytes. This looks stupid, though, because we are repeating the current byte count both here and in the throughput. 3. Use the regular "Writing objects" progress, but fake the object count. We know we are writing M bytes with N objects. Bump the counter by 1 for every M/N bytes we write. The first two require some non-trivial surgery to the progress code. I am leaning towards the third. Not just because it's easy, but because I think it actually shows the most intuitive display. Yes, it's fudging the object numbers, but those are largely meaningless anyway (in fact, it makes them _better_ because now they're even, instead of getting 95% done and then hitting some blob that is as big as the rest of the repo combined). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No progress from push when using bitmaps
On Thu, Mar 13, 2014 at 2:26 PM, Jeff King wrote: > On Wed, Mar 12, 2014 at 05:21:21PM -0700, Shawn Pearce wrote: > >> Today I tried pushing a copy of linux.git from a client that had >> bitmaps into a JGit server. The client stalled for a long time with no >> progress, because it reused the existing pack. No progress appeared >> while it was sending the existing file on the wire: >> >> $ git push git://localhost/linux.git master >> Reusing existing pack: 2938117, done. >> Total 2938117 (delta 0), reused 0 (delta 0) >> remote: Resolving deltas: 66% (1637269/2455727) >> >> This is not the best user experience. :-( > > Yeah, I agree that sucks. I hadn't noticed it, as I don't typically have > my client repos bitmapped (and on fetch, the interesting progress is > coming from the local index-pack). > > It would definitely be good to have throughput measurements while > writing out the pack. However, I'm not sure we have anything useful to > count. We know the total number of objects we're reusing, but we're not > actually parsing the data; we're just blitting it out as a stream. I > think the progress code will need some refactoring to handle a > throughput-only case. Yes. I think JGit suffers from this same bug, and again we never noticed it because usually only the servers are bitmapped, not the clients. pack-objects writes a throughput meter when its writing objects. Really just the bytes out/second would be enough to let the user know the client is working. Unfortunately I think that is still tied to the overall progress system having some other counter? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No progress from push when using bitmaps
On Wed, Mar 12, 2014 at 05:21:21PM -0700, Shawn Pearce wrote: > Today I tried pushing a copy of linux.git from a client that had > bitmaps into a JGit server. The client stalled for a long time with no > progress, because it reused the existing pack. No progress appeared > while it was sending the existing file on the wire: > > $ git push git://localhost/linux.git master > Reusing existing pack: 2938117, done. > Total 2938117 (delta 0), reused 0 (delta 0) > remote: Resolving deltas: 66% (1637269/2455727) > > This is not the best user experience. :-( Yeah, I agree that sucks. I hadn't noticed it, as I don't typically have my client repos bitmapped (and on fetch, the interesting progress is coming from the local index-pack). It would definitely be good to have throughput measurements while writing out the pack. However, I'm not sure we have anything useful to count. We know the total number of objects we're reusing, but we're not actually parsing the data; we're just blitting it out as a stream. I think the progress code will need some refactoring to handle a throughput-only case. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
No progress from push when using bitmaps
Today I tried pushing a copy of linux.git from a client that had bitmaps into a JGit server. The client stalled for a long time with no progress, because it reused the existing pack. No progress appeared while it was sending the existing file on the wire: $ git push git://localhost/linux.git master Reusing existing pack: 2938117, done. Total 2938117 (delta 0), reused 0 (delta 0) remote: Resolving deltas: 66% (1637269/2455727) This is not the best user experience. :-( -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html