Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand

2017-07-17 Thread Jonathan Tan
On Mon, 17 Jul 2017 16:09:17 -0400
Ben Peart  wrote:

> > Is this change meant to ensure that Git code that operates on loose
> > objects directly (bypassing storage-agnostic functions such as
> > sha1_object_info_extended() and has_sha1_file()) still work? If yes,
> > this patch appears incomplete (for example, read_loose_object() needs to
> > be changed too), and this seems like a difficult task - in my patch set
> > [1], I ended up deciding to create a separate type of storage and
> > instead looked at the code that operates on *packed* objects directly
> > (because there were fewer such methods) to ensure that they would work
> > correctly in the presence of a separate type of storage.
> > 
> 
> Yes, with this set of patches, we've been running successfully on 
> completely sparse clones (no commits, trees, or blobs) for several 
> months.  read_loose_object() is only called by fsck when it is 
> enumerating existing loose objects so does not need to be updated.

Ah, that's good to know. I think such an analysis (of the other
loose-related functions) in the commit message would be useful, like I
did for the packed-related functions [1].

[1] 
https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/

> We have a few thousand developers making ~100K commits per week so in 
> our particular usage, I'm fairly confident it works correctly.  That 
> said, it is possible there is some code path I've missed. :)

I think that code paths like the history-searching ones ("git log -S",
for example) should still work offline if possible - one of my ideas is
to have these commands take a size threshold parameter so that we do not
need to fetch large blobs during the invocation. (Hence my preference
for size information to be already available to the repo.)

Aside from that, does fsck of a partial repo work? Ability to fsck seems
quite important (or, at least, useful). I tried updating fsck to support
omitting commits and trees (in addition to blobs), and it seems
relatively involved (although I didn't look into it deeply yet).

(Also, that investigation also made me realize that, in my patch set, I
didn't handle the case where a tag references a blob - fsck doesn't work
when the blob is missing, even if it is promised. That's something for
me to look into.)


Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand

2017-07-17 Thread Ben Peart



On 7/17/2017 2:06 PM, Jonathan Tan wrote:

About the difference between this patch and my patch set [1], besides
the fact that this patch does not spawn separate processes for each
missing object, which does seem like an improvement to me, this patch
(i) does not use a list of promised objects (but instead communicates
with the hook for each missing object), and (ii) provides backwards
compatibility with other Git code (that does not know about promised
objects) in a different way.

The costs and benefits of (i) are being discussed here [2]. As for (ii),
I still think that my approach is better - I have commented more about
this below.

Maybe the best approach is a combination of both our approaches.


Yes, in the context of the promised objects model patch series, the 
value of this patch series is primarily as a sample of how to use the 
sub-process mechanism to create a versioned interface for retrieving 
objects.




[1] 
https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/

[2] 
https://public-inbox.org/git/20170713123951.5cab1...@twelve2.svl.corp.google.com/

On Fri, 14 Jul 2017 09:26:51 -0400
Ben Peart  wrote:


+
+packet: git> command=get
+packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
+packet: git> 
+


It would be useful to have this command support more than one SHA-1, so
that hooks that know how to batch can do so.



I agree.  Since nothing was using that capability yet, I decided to keep 
it simple and not add support for a feature that wasn't being used. The 
reason the interface is versioned is so that if/when something does need 
that capability, it can be added.



+static int subprocess_map_initialized;
+static struct hashmap subprocess_map;


The documentation of "tablesize" in "struct hashmap" states that it can
be used to check if the hashmap is initialized, so
subprocess_map_initialized is probably unnecessary.



Nice.  That will make things a little simpler.


  static int check_and_freshen(const unsigned char *sha1, int freshen)
  {
-   return check_and_freshen_local(sha1, freshen) ||
-  check_and_freshen_nonlocal(sha1, freshen);
+   int ret;
+   int already_retried = 0;
+
+retry:
+   ret = check_and_freshen_local(sha1, freshen) ||
+   check_and_freshen_nonlocal(sha1, freshen);
+   if (!ret && core_virtualize_objects && !already_retried) {
+   already_retried = 1;
+   if (!read_object_process(sha1))
+   goto retry;
+   }
+
+   return ret;
  }


Is this change meant to ensure that Git code that operates on loose
objects directly (bypassing storage-agnostic functions such as
sha1_object_info_extended() and has_sha1_file()) still work? If yes,
this patch appears incomplete (for example, read_loose_object() needs to
be changed too), and this seems like a difficult task - in my patch set
[1], I ended up deciding to create a separate type of storage and
instead looked at the code that operates on *packed* objects directly
(because there were fewer such methods) to ensure that they would work
correctly in the presence of a separate type of storage.



Yes, with this set of patches, we've been running successfully on 
completely sparse clones (no commits, trees, or blobs) for several 
months.  read_loose_object() is only called by fsck when it is 
enumerating existing loose objects so does not need to be updated.


We have a few thousand developers making ~100K commits per week so in 
our particular usage, I'm fairly confident it works correctly.  That 
said, it is possible there is some code path I've missed. :)



[1] 
https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/



Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand

2017-07-17 Thread Jonathan Tan
About the difference between this patch and my patch set [1], besides
the fact that this patch does not spawn separate processes for each
missing object, which does seem like an improvement to me, this patch
(i) does not use a list of promised objects (but instead communicates
with the hook for each missing object), and (ii) provides backwards
compatibility with other Git code (that does not know about promised
objects) in a different way.

The costs and benefits of (i) are being discussed here [2]. As for (ii),
I still think that my approach is better - I have commented more about
this below.

Maybe the best approach is a combination of both our approaches.

[1] 
https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/

[2] 
https://public-inbox.org/git/20170713123951.5cab1...@twelve2.svl.corp.google.com/

On Fri, 14 Jul 2017 09:26:51 -0400
Ben Peart  wrote:

> +
> +packet: git> command=get
> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
> +packet: git> 
> +

It would be useful to have this command support more than one SHA-1, so
that hooks that know how to batch can do so.

> +static int subprocess_map_initialized;
> +static struct hashmap subprocess_map;

The documentation of "tablesize" in "struct hashmap" states that it can
be used to check if the hashmap is initialized, so
subprocess_map_initialized is probably unnecessary.

>  static int check_and_freshen(const unsigned char *sha1, int freshen)
>  {
> - return check_and_freshen_local(sha1, freshen) ||
> -check_and_freshen_nonlocal(sha1, freshen);
> + int ret;
> + int already_retried = 0;
> +
> +retry:
> + ret = check_and_freshen_local(sha1, freshen) ||
> + check_and_freshen_nonlocal(sha1, freshen);
> + if (!ret && core_virtualize_objects && !already_retried) {
> + already_retried = 1;
> + if (!read_object_process(sha1))
> + goto retry;
> + }
> +
> + return ret;
>  }

Is this change meant to ensure that Git code that operates on loose
objects directly (bypassing storage-agnostic functions such as
sha1_object_info_extended() and has_sha1_file()) still work? If yes,
this patch appears incomplete (for example, read_loose_object() needs to
be changed too), and this seems like a difficult task - in my patch set
[1], I ended up deciding to create a separate type of storage and
instead looked at the code that operates on *packed* objects directly
(because there were fewer such methods) to ensure that they would work
correctly in the presence of a separate type of storage.

[1] 
https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/


Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand

2017-07-14 Thread Christian Couder
On Fri, Jul 14, 2017 at 3:26 PM, Ben Peart  wrote:

> diff --git a/contrib/long-running-read-object/example.pl 
> b/contrib/long-running-read-object/example.pl
> new file mode 100755
> index 00..b8f37f836a
> --- /dev/null
> +++ b/contrib/long-running-read-object/example.pl

[...]

> +sub packet_bin_read {
> +   my $buffer;
> +   my $bytes_read = read STDIN, $buffer, 4;
> +   if ( $bytes_read == 0 ) {
> +
> +   # EOF - Git stopped talking to us!
> +   exit();
> +   }
> +   elsif ( $bytes_read != 4 ) {
> +   die "invalid packet: '$buffer'";
> +   }
> +   my $pkt_size = hex($buffer);
> +   if ( $pkt_size == 0 ) {
> +   return ( 1, "" );
> +   }
> +   elsif ( $pkt_size > 4 ) {
> +   my $content_size = $pkt_size - 4;
> +   $bytes_read = read STDIN, $buffer, $content_size;
> +   if ( $bytes_read != $content_size ) {
> +   die "invalid packet ($content_size bytes expected; 
> $bytes_read bytes read)";
> +   }
> +   return ( 0, $buffer );
> +   }
> +   else {
> +   die "invalid packet size: $pkt_size";
> +   }
> +}
> +
> +sub packet_txt_read {
> +   my ( $res, $buf ) = packet_bin_read();
> +   unless ( $buf =~ s/\n$// ) {
> +   die "A non-binary line MUST be terminated by an LF.";
> +   }
> +   return ( $res, $buf );
> +}
> +
> +sub packet_bin_write {
> +   my $buf = shift;
> +   print STDOUT sprintf( "%04x", length($buf) + 4 );
> +   print STDOUT $buf;
> +   STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> +   packet_bin_write( $_[0] . "\n" );
> +}
> +
> +sub packet_flush {
> +   print STDOUT sprintf( "%04x", 0 );
> +   STDOUT->flush();
> +}

The above could reuse the refactoring of t0021/rot13-filter.pl into
perl/Git/Packet.pm that is at the beginning of my patch series.

> diff --git a/t/t0410/read-object b/t/t0410/read-object
> new file mode 100755
> index 00..85e997c930
> --- /dev/null
> +++ b/t/t0410/read-object

[...]

> +sub packet_bin_read {
> +   my $buffer;
> +   my $bytes_read = read STDIN, $buffer, 4;
> +   if ( $bytes_read == 0 ) {
> +
> +   # EOF - Git stopped talking to us!
> +   exit();
> +   }
> +   elsif ( $bytes_read != 4 ) {
> +   die "invalid packet: '$buffer'";
> +   }
> +   my $pkt_size = hex($buffer);
> +   if ( $pkt_size == 0 ) {
> +   return ( 1, "" );
> +   }
> +   elsif ( $pkt_size > 4 ) {
> +   my $content_size = $pkt_size - 4;
> +   $bytes_read = read STDIN, $buffer, $content_size;
> +   if ( $bytes_read != $content_size ) {
> +   die "invalid packet ($content_size bytes expected; 
> $bytes_read bytes read)";
> +   }
> +   return ( 0, $buffer );
> +   }
> +   else {
> +   die "invalid packet size: $pkt_size";
> +   }
> +}
> +
> +sub packet_txt_read {
> +   my ( $res, $buf ) = packet_bin_read();
> +   unless ( $buf =~ s/\n$// ) {
> +   die "A non-binary line MUST be terminated by an LF.";
> +   }
> +   return ( $res, $buf );
> +}
> +
> +sub packet_bin_write {
> +   my $buf = shift;
> +   print STDOUT sprintf( "%04x", length($buf) + 4 );
> +   print STDOUT $buf;
> +   STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> +   packet_bin_write( $_[0] . "\n" );
> +}
> +
> +sub packet_flush {
> +   print STDOUT sprintf( "%04x", 0 );
> +   STDOUT->flush();
> +}

Same thing about the above and perl/Git/Packet.pm.

Otherwise thanks for updating this!


[PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand

2017-07-14 Thread Ben Peart
This patch series enables git to request missing objects when they are
not found in the object store. This is a fault-in model where the
"read-object" sub-process will fetch the missing object and store it in
the object store as a loose, alternate, or pack file. On success, git
will retry the operation and find the requested object.

It utilizes the recent sub-process refactoring to spawn a "read-object"
hook as a sub-process on the first request and then all subsequent
requests are made to that existing sub-process. This significantly
reduces the cost of making multiple request within a single git command.

Signed-off-by: Ben Peart 
---
 Documentation/technical/read-object-protocol.txt | 102 
 cache.h  |   1 +
 config.c |   5 +
 contrib/long-running-read-object/example.pl  | 114 +
 environment.c|   1 +
 sha1_file.c  | 193 ++-
 t/t0410-read-object.sh   |  27 
 t/t0410/read-object  | 114 +
 8 files changed, 550 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/technical/read-object-protocol.txt
 create mode 100755 contrib/long-running-read-object/example.pl
 create mode 100755 t/t0410-read-object.sh
 create mode 100755 t/t0410/read-object

diff --git a/Documentation/technical/read-object-protocol.txt 
b/Documentation/technical/read-object-protocol.txt
new file mode 100644
index 00..a893b46e7c
--- /dev/null
+++ b/Documentation/technical/read-object-protocol.txt
@@ -0,0 +1,102 @@
+Read Object Process
+^^^
+
+The read-object process enables Git to read all missing blobs with a
+single process invocation for the entire life of a single Git command.
+This is achieved by using a packet format (pkt-line, see technical/
+protocol-common.txt) based protocol over standard input and standard
+output as follows. All packets, except for the "*CONTENT" packets and
+the "" flush packet, are considered text and therefore are
+terminated by a LF.
+
+Git starts the process when it encounters the first missing object that
+needs to be retrieved. After the process is started, Git sends a welcome
+message ("git-read-object-client"), a list of supported protocol version
+numbers, and a flush packet. Git expects to read a welcome response
+message ("git-read-object-server"), exactly one protocol version number
+from the previously sent list, and a flush packet. All further
+communication will be based on the selected version.
+
+The remaining protocol description below documents "version=1". Please
+note that "version=42" in the example below does not exist and is only
+there to illustrate how the protocol would look with more than one
+version.
+
+After the version negotiation Git sends a list of all capabilities that
+it supports and a flush packet. Git expects to read a list of desired
+capabilities, which must be a subset of the supported capabilities list,
+and a flush packet as response:
+
+packet: git> git-read-object-client
+packet: git> version=1
+packet: git> version=42
+packet: git> 
+packet: git< git-read-object-server
+packet: git< version=1
+packet: git< 
+packet: git> capability=get
+packet: git> capability=have
+packet: git> capability=put
+packet: git> capability=not-yet-invented
+packet: git> 
+packet: git< capability=get
+packet: git< 
+
+The only supported capability in version 1 is "get".
+
+Afterwards Git sends a list of "key=value" pairs terminated with a flush
+packet. The list will contain at least the command (based on the
+supported capabilities) and the sha1 of the object to retrieve. Please
+note, that the process must not send any response before it received the
+final flush packet.
+
+When the process receives the "get" command, it should make the requested
+object available in the git object store and then return success. Git will
+then check the object store again and this time find it and proceed.
+
+packet: git> command=get
+packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
+packet: git> 
+
+
+The process is expected to respond with a list of "key=value" pairs
+terminated with a flush packet. If the process does not experience
+problems then the list must contain a "success" status.
+
+packet: git< status=success
+packet: git< 
+
+
+In case the process cannot or does not want to process the content, it
+is expected to respond with an "error" status.
+
+packet: git< status=error
+packet: git< 
+
+
+In case the process cannot or does not want to process the content as
+well as any future content for the lifetime of the Git process, then