Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
On Mon, 17 Jul 2017 16:09:17 -0400 Ben Peartwrote: > > Is this change meant to ensure that Git code that operates on loose > > objects directly (bypassing storage-agnostic functions such as > > sha1_object_info_extended() and has_sha1_file()) still work? If yes, > > this patch appears incomplete (for example, read_loose_object() needs to > > be changed too), and this seems like a difficult task - in my patch set > > [1], I ended up deciding to create a separate type of storage and > > instead looked at the code that operates on *packed* objects directly > > (because there were fewer such methods) to ensure that they would work > > correctly in the presence of a separate type of storage. > > > > Yes, with this set of patches, we've been running successfully on > completely sparse clones (no commits, trees, or blobs) for several > months. read_loose_object() is only called by fsck when it is > enumerating existing loose objects so does not need to be updated. Ah, that's good to know. I think such an analysis (of the other loose-related functions) in the commit message would be useful, like I did for the packed-related functions [1]. [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/ > We have a few thousand developers making ~100K commits per week so in > our particular usage, I'm fairly confident it works correctly. That > said, it is possible there is some code path I've missed. :) I think that code paths like the history-searching ones ("git log -S", for example) should still work offline if possible - one of my ideas is to have these commands take a size threshold parameter so that we do not need to fetch large blobs during the invocation. (Hence my preference for size information to be already available to the repo.) Aside from that, does fsck of a partial repo work? Ability to fsck seems quite important (or, at least, useful). I tried updating fsck to support omitting commits and trees (in addition to blobs), and it seems relatively involved (although I didn't look into it deeply yet). (Also, that investigation also made me realize that, in my patch set, I didn't handle the case where a tag references a blob - fsck doesn't work when the blob is missing, even if it is promised. That's something for me to look into.)
Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
On 7/17/2017 2:06 PM, Jonathan Tan wrote: About the difference between this patch and my patch set [1], besides the fact that this patch does not spawn separate processes for each missing object, which does seem like an improvement to me, this patch (i) does not use a list of promised objects (but instead communicates with the hook for each missing object), and (ii) provides backwards compatibility with other Git code (that does not know about promised objects) in a different way. The costs and benefits of (i) are being discussed here [2]. As for (ii), I still think that my approach is better - I have commented more about this below. Maybe the best approach is a combination of both our approaches. Yes, in the context of the promised objects model patch series, the value of this patch series is primarily as a sample of how to use the sub-process mechanism to create a versioned interface for retrieving objects. [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/ [2] https://public-inbox.org/git/20170713123951.5cab1...@twelve2.svl.corp.google.com/ On Fri, 14 Jul 2017 09:26:51 -0400 Ben Peartwrote: + +packet: git> command=get +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05 +packet: git> + It would be useful to have this command support more than one SHA-1, so that hooks that know how to batch can do so. I agree. Since nothing was using that capability yet, I decided to keep it simple and not add support for a feature that wasn't being used. The reason the interface is versioned is so that if/when something does need that capability, it can be added. +static int subprocess_map_initialized; +static struct hashmap subprocess_map; The documentation of "tablesize" in "struct hashmap" states that it can be used to check if the hashmap is initialized, so subprocess_map_initialized is probably unnecessary. Nice. That will make things a little simpler. static int check_and_freshen(const unsigned char *sha1, int freshen) { - return check_and_freshen_local(sha1, freshen) || - check_and_freshen_nonlocal(sha1, freshen); + int ret; + int already_retried = 0; + +retry: + ret = check_and_freshen_local(sha1, freshen) || + check_and_freshen_nonlocal(sha1, freshen); + if (!ret && core_virtualize_objects && !already_retried) { + already_retried = 1; + if (!read_object_process(sha1)) + goto retry; + } + + return ret; } Is this change meant to ensure that Git code that operates on loose objects directly (bypassing storage-agnostic functions such as sha1_object_info_extended() and has_sha1_file()) still work? If yes, this patch appears incomplete (for example, read_loose_object() needs to be changed too), and this seems like a difficult task - in my patch set [1], I ended up deciding to create a separate type of storage and instead looked at the code that operates on *packed* objects directly (because there were fewer such methods) to ensure that they would work correctly in the presence of a separate type of storage. Yes, with this set of patches, we've been running successfully on completely sparse clones (no commits, trees, or blobs) for several months. read_loose_object() is only called by fsck when it is enumerating existing loose objects so does not need to be updated. We have a few thousand developers making ~100K commits per week so in our particular usage, I'm fairly confident it works correctly. That said, it is possible there is some code path I've missed. :) [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/
Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
About the difference between this patch and my patch set [1], besides the fact that this patch does not spawn separate processes for each missing object, which does seem like an improvement to me, this patch (i) does not use a list of promised objects (but instead communicates with the hook for each missing object), and (ii) provides backwards compatibility with other Git code (that does not know about promised objects) in a different way. The costs and benefits of (i) are being discussed here [2]. As for (ii), I still think that my approach is better - I have commented more about this below. Maybe the best approach is a combination of both our approaches. [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/ [2] https://public-inbox.org/git/20170713123951.5cab1...@twelve2.svl.corp.google.com/ On Fri, 14 Jul 2017 09:26:51 -0400 Ben Peartwrote: > + > +packet: git> command=get > +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05 > +packet: git> > + It would be useful to have this command support more than one SHA-1, so that hooks that know how to batch can do so. > +static int subprocess_map_initialized; > +static struct hashmap subprocess_map; The documentation of "tablesize" in "struct hashmap" states that it can be used to check if the hashmap is initialized, so subprocess_map_initialized is probably unnecessary. > static int check_and_freshen(const unsigned char *sha1, int freshen) > { > - return check_and_freshen_local(sha1, freshen) || > -check_and_freshen_nonlocal(sha1, freshen); > + int ret; > + int already_retried = 0; > + > +retry: > + ret = check_and_freshen_local(sha1, freshen) || > + check_and_freshen_nonlocal(sha1, freshen); > + if (!ret && core_virtualize_objects && !already_retried) { > + already_retried = 1; > + if (!read_object_process(sha1)) > + goto retry; > + } > + > + return ret; > } Is this change meant to ensure that Git code that operates on loose objects directly (bypassing storage-agnostic functions such as sha1_object_info_extended() and has_sha1_file()) still work? If yes, this patch appears incomplete (for example, read_loose_object() needs to be changed too), and this seems like a difficult task - in my patch set [1], I ended up deciding to create a separate type of storage and instead looked at the code that operates on *packed* objects directly (because there were fewer such methods) to ensure that they would work correctly in the presence of a separate type of storage. [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/
Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
On Fri, Jul 14, 2017 at 3:26 PM, Ben Peartwrote: > diff --git a/contrib/long-running-read-object/example.pl > b/contrib/long-running-read-object/example.pl > new file mode 100755 > index 00..b8f37f836a > --- /dev/null > +++ b/contrib/long-running-read-object/example.pl [...] > +sub packet_bin_read { > + my $buffer; > + my $bytes_read = read STDIN, $buffer, 4; > + if ( $bytes_read == 0 ) { > + > + # EOF - Git stopped talking to us! > + exit(); > + } > + elsif ( $bytes_read != 4 ) { > + die "invalid packet: '$buffer'"; > + } > + my $pkt_size = hex($buffer); > + if ( $pkt_size == 0 ) { > + return ( 1, "" ); > + } > + elsif ( $pkt_size > 4 ) { > + my $content_size = $pkt_size - 4; > + $bytes_read = read STDIN, $buffer, $content_size; > + if ( $bytes_read != $content_size ) { > + die "invalid packet ($content_size bytes expected; > $bytes_read bytes read)"; > + } > + return ( 0, $buffer ); > + } > + else { > + die "invalid packet size: $pkt_size"; > + } > +} > + > +sub packet_txt_read { > + my ( $res, $buf ) = packet_bin_read(); > + unless ( $buf =~ s/\n$// ) { > + die "A non-binary line MUST be terminated by an LF."; > + } > + return ( $res, $buf ); > +} > + > +sub packet_bin_write { > + my $buf = shift; > + print STDOUT sprintf( "%04x", length($buf) + 4 ); > + print STDOUT $buf; > + STDOUT->flush(); > +} > + > +sub packet_txt_write { > + packet_bin_write( $_[0] . "\n" ); > +} > + > +sub packet_flush { > + print STDOUT sprintf( "%04x", 0 ); > + STDOUT->flush(); > +} The above could reuse the refactoring of t0021/rot13-filter.pl into perl/Git/Packet.pm that is at the beginning of my patch series. > diff --git a/t/t0410/read-object b/t/t0410/read-object > new file mode 100755 > index 00..85e997c930 > --- /dev/null > +++ b/t/t0410/read-object [...] > +sub packet_bin_read { > + my $buffer; > + my $bytes_read = read STDIN, $buffer, 4; > + if ( $bytes_read == 0 ) { > + > + # EOF - Git stopped talking to us! > + exit(); > + } > + elsif ( $bytes_read != 4 ) { > + die "invalid packet: '$buffer'"; > + } > + my $pkt_size = hex($buffer); > + if ( $pkt_size == 0 ) { > + return ( 1, "" ); > + } > + elsif ( $pkt_size > 4 ) { > + my $content_size = $pkt_size - 4; > + $bytes_read = read STDIN, $buffer, $content_size; > + if ( $bytes_read != $content_size ) { > + die "invalid packet ($content_size bytes expected; > $bytes_read bytes read)"; > + } > + return ( 0, $buffer ); > + } > + else { > + die "invalid packet size: $pkt_size"; > + } > +} > + > +sub packet_txt_read { > + my ( $res, $buf ) = packet_bin_read(); > + unless ( $buf =~ s/\n$// ) { > + die "A non-binary line MUST be terminated by an LF."; > + } > + return ( $res, $buf ); > +} > + > +sub packet_bin_write { > + my $buf = shift; > + print STDOUT sprintf( "%04x", length($buf) + 4 ); > + print STDOUT $buf; > + STDOUT->flush(); > +} > + > +sub packet_txt_write { > + packet_bin_write( $_[0] . "\n" ); > +} > + > +sub packet_flush { > + print STDOUT sprintf( "%04x", 0 ); > + STDOUT->flush(); > +} Same thing about the above and perl/Git/Packet.pm. Otherwise thanks for updating this!
[PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
This patch series enables git to request missing objects when they are not found in the object store. This is a fault-in model where the "read-object" sub-process will fetch the missing object and store it in the object store as a loose, alternate, or pack file. On success, git will retry the operation and find the requested object. It utilizes the recent sub-process refactoring to spawn a "read-object" hook as a sub-process on the first request and then all subsequent requests are made to that existing sub-process. This significantly reduces the cost of making multiple request within a single git command. Signed-off-by: Ben Peart--- Documentation/technical/read-object-protocol.txt | 102 cache.h | 1 + config.c | 5 + contrib/long-running-read-object/example.pl | 114 + environment.c| 1 + sha1_file.c | 193 ++- t/t0410-read-object.sh | 27 t/t0410/read-object | 114 + 8 files changed, 550 insertions(+), 7 deletions(-) create mode 100644 Documentation/technical/read-object-protocol.txt create mode 100755 contrib/long-running-read-object/example.pl create mode 100755 t/t0410-read-object.sh create mode 100755 t/t0410/read-object diff --git a/Documentation/technical/read-object-protocol.txt b/Documentation/technical/read-object-protocol.txt new file mode 100644 index 00..a893b46e7c --- /dev/null +++ b/Documentation/technical/read-object-protocol.txt @@ -0,0 +1,102 @@ +Read Object Process +^^^ + +The read-object process enables Git to read all missing blobs with a +single process invocation for the entire life of a single Git command. +This is achieved by using a packet format (pkt-line, see technical/ +protocol-common.txt) based protocol over standard input and standard +output as follows. All packets, except for the "*CONTENT" packets and +the "" flush packet, are considered text and therefore are +terminated by a LF. + +Git starts the process when it encounters the first missing object that +needs to be retrieved. After the process is started, Git sends a welcome +message ("git-read-object-client"), a list of supported protocol version +numbers, and a flush packet. Git expects to read a welcome response +message ("git-read-object-server"), exactly one protocol version number +from the previously sent list, and a flush packet. All further +communication will be based on the selected version. + +The remaining protocol description below documents "version=1". Please +note that "version=42" in the example below does not exist and is only +there to illustrate how the protocol would look with more than one +version. + +After the version negotiation Git sends a list of all capabilities that +it supports and a flush packet. Git expects to read a list of desired +capabilities, which must be a subset of the supported capabilities list, +and a flush packet as response: + +packet: git> git-read-object-client +packet: git> version=1 +packet: git> version=42 +packet: git> +packet: git< git-read-object-server +packet: git< version=1 +packet: git< +packet: git> capability=get +packet: git> capability=have +packet: git> capability=put +packet: git> capability=not-yet-invented +packet: git> +packet: git< capability=get +packet: git< + +The only supported capability in version 1 is "get". + +Afterwards Git sends a list of "key=value" pairs terminated with a flush +packet. The list will contain at least the command (based on the +supported capabilities) and the sha1 of the object to retrieve. Please +note, that the process must not send any response before it received the +final flush packet. + +When the process receives the "get" command, it should make the requested +object available in the git object store and then return success. Git will +then check the object store again and this time find it and proceed. + +packet: git> command=get +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05 +packet: git> + + +The process is expected to respond with a list of "key=value" pairs +terminated with a flush packet. If the process does not experience +problems then the list must contain a "success" status. + +packet: git< status=success +packet: git< + + +In case the process cannot or does not want to process the content, it +is expected to respond with an "error" status. + +packet: git< status=error +packet: git< + + +In case the process cannot or does not want to process the content as +well as any future content for the lifetime of the Git process, then