Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-29 Thread Jakub Narębski
W dniu 29.09.2016 o 08:33, Torsten Bögershausen pisze:
> On 15.09.16 22:04, Junio C Hamano wrote:
>> Lars Schneider  writes:
>>
>>> Wouldn't that complicate the pathname parsing on the filter side?
>>> Can't we just define in our filter protocol documentation that our 
>>> "pathname" packet _always_ has a trailing "\n"? That would mean the 
>>> receiver would know a packet "pathname=ABC\n\n" encodes the path
>>> "ABC\n" [1].
>>
>> That's fine, too.  If you declare that pathname over the protocol is
>> a binary thing, you can also define that the packet does not have
>> the terminating \n, i.e. the example encodes the path "ABC\n\n",
>> which is also OK ;-)
>>
>> As long as the rule is clearly documented, easy for filter
>> implementors to follow it, and hard for them to get it wrong, I'd be
>> perfectly happy.
>>
> 
> (Sorry for the late reply)
> 
> In V8 the additional "\n" is clearly documented.
> 
> On the long run,
> I would suggest to be more clear what BINARY is:
> 
> --- a/Documentation/technical/protocol-common.txt
> +++ b/Documentation/technical/protocol-common.txt
> @@ -61,6 +61,9 @@ the length's hexadecimal representation.
>  A pkt-line MAY contain binary data, so implementors MUST ensure
>  pkt-line parsing/formatting routines are 8-bit clean.
>  
> +Each pkt-line that may contain ASCII control characters should
> +be treated as binary.
> +

Well, it is not as clear cut with pathnames.  Sane pathnames should
not contain control characters, even if they are outside US-ASCII,
assuming sane filesystem pathnames charset (like UTF-8).

One thing pathname cannot include is NUL ("\0") character.

So in most cases they are ASCII, but might not be.  Not that 
pkt-line text packets are binary-unsafe... I think the trailing
"\n" is here for easier debugging.

http://www.dwheeler.com/essays/filenames-in-shell.html
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

-- 
Jakub Narębski


Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-29 Thread Torsten Bögershausen
On 15.09.16 22:04, Junio C Hamano wrote:
> Lars Schneider  writes:
> 
>> Wouldn't that complicate the pathname parsing on the filter side?
>> Can't we just define in our filter protocol documentation that our 
>> "pathname" packet _always_ has a trailing "\n"? That would mean the 
>> receiver would know a packet "pathname=ABC\n\n" encodes the path
>> "ABC\n" [1].
> 
> That's fine, too.  If you declare that pathname over the protocol is
> a binary thing, you can also define that the packet does not have
> the terminating \n, i.e. the example encodes the path "ABC\n\n",
> which is also OK ;-)
> 
> As long as the rule is clearly documented, easy for filter
> implementors to follow it, and hard for them to get it wrong, I'd be
> perfectly happy.
>

(Sorry for the late reply)

In V8 the additional "\n" is clearly documented.

On the long run,
I would suggest to be more clear what BINARY is:

--- a/Documentation/technical/protocol-common.txt
+++ b/Documentation/technical/protocol-common.txt
@@ -61,6 +61,9 @@ the length's hexadecimal representation.
 A pkt-line MAY contain binary data, so implementors MUST ensure
 pkt-line parsing/formatting routines are 8-bit clean.
 
+Each pkt-line that may contain ASCII control characters should
+be treated as binary.
+



Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-15 Thread Junio C Hamano
Lars Schneider  writes:

>> On 13 Sep 2016, at 17:22, Junio C Hamano  wrote:
>> 
>> larsxschnei...@gmail.com writes:
>> 
>>> diff --git a/contrib/long-running-filter/example.pl 
>>> b/contrib/long-running-filter/example.pl
>>> ...
>>> +packet_write( "clean=true\n" );
>>> +packet_write( "smudge=true\n" );
>>
>> These extra SP around the contents inside () pair look unfamiliar
>> and somewhat strange to me, but as long as they are consistently
>> done (and I think you are mostly being consistent), it is OK.
>
> Ups. I forgot to run PerlTidy here. I run PerlTidy with the flag 
> "-pbp" (= Perl Best Practices). This seems to add no extra SP for
> functions with one parameter (e.g. `foo("bar")`) and extra SP
> for functions with multiple parameter (e.g. `foo( "bar", 1 )`).
> Is this still OK?

Your choice.  I already said I do not care too much either way as
long as you are consistent.

If you prefer PerlTidy's PBP output over what you wrote, and if you
are resending the patch anyway, then why not? ;-)

> Does anyone have a "Git PerlTidy configuration"?

Not me.

Thanks.


Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-15 Thread Lars Schneider

> On 13 Sep 2016, at 17:22, Junio C Hamano  wrote:
> 
> larsxschnei...@gmail.com writes:
> 
>> diff --git a/contrib/long-running-filter/example.pl 
>> b/contrib/long-running-filter/example.pl
>> ...
>> +sub packet_read {
>> +my $buffer;
>> +my $bytes_read = read STDIN, $buffer, 4;
>> +if ( $bytes_read == 0 ) {
>> +
>> +# EOF - Git stopped talking to us!
>> +exit();
>> +...
>> +packet_write( "clean=true\n" );
>> +packet_write( "smudge=true\n" );
>> +packet_flush();
>> +
>> +while (1) {
> 
> These extra SP around the contents inside () pair look unfamiliar
> and somewhat strange to me, but as long as they are consistently
> done (and I think you are mostly being consistent), it is OK.

Ups. I forgot to run PerlTidy here. I run PerlTidy with the flag 
"-pbp" (= Perl Best Practices). This seems to add no extra SP for
functions with one parameter (e.g. `foo("bar")`) and extra SP
for functions with multiple parameter (e.g. `foo( "bar", 1 )`).
Is this still OK?

Does anyone have a "Git PerlTidy configuration"?


> 
>> +#define CAP_CLEAN(1u<<0)
>> +#define CAP_SMUDGE   (1u<<1)
> 
> As these are meant to be usable together, i.e. bits in a single flag
> word, they are of type "unsigned int", which makes perfect sense.
> 
> Make sure your variables and fields that store them are of the same
> type.  I think I saw "int' used to pass them in at least one place.

Fixed!


>> +static int apply_filter(const char *path, const char *src, size_t len,
>> +int fd, struct strbuf *dst, struct convert_driver 
>> *drv,
>> +const int wanted_capability)
>> +{
>> +const char* cmd = NULL;
> 
> "const char *cmd = NULL;" of course.

Fixed!


>> diff --git a/unpack-trees.c b/unpack-trees.c
>> index 11c37fb..f6798f8 100644
>> --- a/unpack-trees.c
>> +++ b/unpack-trees.c
>> @@ -10,6 +10,7 @@
>> #include "attr.h"
>> #include "split-index.h"
>> #include "dir.h"
>> +#include "convert.h"
>> 
>> /*
>>  * Error messages expected by scripts out of plumbing commands such as
> 
> Why?  The resulting file seems to compile without this addition.

Of course. That shouldn't have been part of this commit.


Thank you,
Lars






Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-15 Thread Junio C Hamano
Lars Schneider  writes:

> Wouldn't that complicate the pathname parsing on the filter side?
> Can't we just define in our filter protocol documentation that our 
> "pathname" packet _always_ has a trailing "\n"? That would mean the 
> receiver would know a packet "pathname=ABC\n\n" encodes the path
> "ABC\n" [1].

That's fine, too.  If you declare that pathname over the protocol is
a binary thing, you can also define that the packet does not have
the terminating \n, i.e. the example encodes the path "ABC\n\n",
which is also OK ;-)

As long as the rule is clearly documented, easy for filter
implementors to follow it, and hard for them to get it wrong, I'd be
perfectly happy.

Thanks.


Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-15 Thread Lars Schneider

> On 13 Sep 2016, at 17:42, Junio C Hamano  wrote:
> 
> Torsten Bögershausen  writes:
> 
>> I would really consider to treat pathnames as binary, and not add a trailing 
>> '\n',
>> are there other opinions ?
> 
> It would be the most consistent if the same format as
> write_name_quoted() is used for this, I would think.

Is that the solution you had in mind?

quote_c_style(path, _path, NULL, 0);
err = packet_write_fmt_gently(process->in, "pathname=%s\n", 
quoted_path.buf);

Wouldn't that complicate the pathname parsing on the filter side?
Can't we just define in our filter protocol documentation that our 
"pathname" packet _always_ has a trailing "\n"? That would mean the 
receiver would know a packet "pathname=ABC\n\n" encodes the path
"ABC\n" [1].

Thanks,
Lars


[1] Following Torsten's example in 
http://public-inbox.org/git/96554f6d-988d-e0b8-7936-8d0f29a75...@web.de )



Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-13 Thread Lars Schneider

> On 10 Sep 2016, at 17:40, Torsten Bögershausen  wrote:
> 
> []
> 
> One general question up here, more comments inline.
> The current order for a clean-filter is like this, I removed the error 
> handling:
> 
> int convert_to_git()
> {
>   ret |= apply_filter(path, src, len, -1, dst, filter);
>   if (ret && dst) {
>   src = dst->buf;
>   len = dst->len;
>   }
>   ret |= crlf_to_git(path, src, len, dst, ca.crlf_action, checksafe);
>   return ret | ident_to_git(path, src, len, dst, ca.ident);
> }
> 
> The first step is the clean filter, the CRLF-LF conversion (if needed),
> then ident.
> The current implementation streams the whole file content to the filter,
> (STDIN of the filter) and reads back STDOUT from the filter into a STRBUF.
> This is to use the UNIX-like STDIN--STDOUT method for writing a filter.
> 
> However, int would_convert_to_git_filter_fd() and convert_to_git_filter_fd()
> offer a sort of short-cut:
> The filter reads from the file directly, and the output of the filter is
> read into a STRBUF.

Are you sure? As far as I understand the code the filter does not read from 
the file in any case today.  The functions would_convert_to_git_filter_fd() and 
convert_to_git_filter_fd() just avoid avoid mapping the file in Git. The 
content 
is still streamed via pipes:
https://github.com/git/git/commit/9035d75a2be9d80d82676504d69553245017f6d4


> It looks as if the multi-filter approach can use this in a similar way:
> Give the pathname to the filter, the filter opens the file for reading
> and stream the result via the pkt-line protocol into Git.
> This needs some more changes, and may be very well go into a separate patch
> series. (and should).
> 
> What I am asking for:
> When a multi-filter is used, the content is handled to the filter via 
> pkt-line,
> and the result is given to Git via pkt-line ?
> Nothing wrong with it, I just wonder, if it should be mentioned somewhere.

That is most certainly a good idea and the main reason I added "capabilities"
to the protocol. Joey Hess worked on this topic (not merged, yet) and I would 
like to make this available to the long-running filter protocol as soon as the
feature is available:
http://public-inbox.org/git/1468277112-9909-1-git-send-email-jo...@joeyh.name/


>> +sub packet_read {
>> +my $buffer;
>> +my $bytes_read = read STDIN, $buffer, 4;
>> +if ( $bytes_read == 0 ) {
>> +
>> +# EOF - Git stopped talking to us!
>> +exit();
>> +}
>> +elsif ( $bytes_read != 4 ) {
>> +die "invalid packet size '$bytes_read' field";
>> +}
> 
> This is half-kosher, I would say,
> (And I really. really would like to see an implementation in C ;-)

Would you be willing to contribute a patch? :-)


> A read function may look like this:
> 
>   ret = read(0, , 4);
>   if (ret < 0) {
> /* Error, nothing we can do */
> exit(1);
>   } else if (ret == 0) {
> /* EOF  */
> exit(0);
>   } else if (ret < 4) {
> /* 
>  * Go and read more, until we have 4 bytes or EOF or Error */
>   } else {
> /* Good case, see below */
>   }

I see. However, my intention was to provide an absolute minimal
example to teach a reader how the protocol works. I consider
all proper error handling an exercise for the reader ;-)


>> +#define CAP_CLEAN(1u<<0)
>> +#define CAP_SMUDGE   (1u<<1)
> 
> Is CAP_ too generic, and GIT_FILTER_CAP (or so) less calling for trouble ?

I had something like that but Junio suggested these names in V4:
http://public-inbox.org/git/xmqq8twd8uld@gitster.mtv.corp.google.com/


>> +
>> +err = (strlen(filter_type) > PKTLINE_DATA_MAXLEN);
> 
> Extra () needed ?
> More () in the code...

I thought it might improve readability, but I will remove them
if you think this would be more consistent with existing Git code.


Thanks,
Lars

Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-13 Thread Junio C Hamano
Torsten Bögershausen  writes:

> I would really consider to treat pathnames as binary, and not add a trailing 
> '\n',
> are there other opinions ?

It would be the most consistent if the same format as
write_name_quoted() is used for this, I would think.


Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-13 Thread Junio C Hamano
larsxschnei...@gmail.com writes:

> diff --git a/contrib/long-running-filter/example.pl 
> b/contrib/long-running-filter/example.pl
> ...
> +sub packet_read {
> +my $buffer;
> +my $bytes_read = read STDIN, $buffer, 4;
> +if ( $bytes_read == 0 ) {
> +
> +# EOF - Git stopped talking to us!
> +exit();
> +...
> +packet_write( "clean=true\n" );
> +packet_write( "smudge=true\n" );
> +packet_flush();
> +
> +while (1) {

These extra SP around the contents inside () pair look unfamiliar
and somewhat strange to me, but as long as they are consistently
done (and I think you are mostly being consistent), it is OK.

> +#define CAP_CLEAN(1u<<0)
> +#define CAP_SMUDGE   (1u<<1)

As these are meant to be usable together, i.e. bits in a single flag
word, they are of type "unsigned int", which makes perfect sense.

Make sure your variables and fields that store them are of the same
type.  I think I saw "int' used to pass them in at least one place.

> +static int apply_filter(const char *path, const char *src, size_t len,
> +int fd, struct strbuf *dst, struct convert_driver 
> *drv,
> +const int wanted_capability)
> +{
> + const char* cmd = NULL;

"const char *cmd = NULL;" of course.

> diff --git a/unpack-trees.c b/unpack-trees.c
> index 11c37fb..f6798f8 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -10,6 +10,7 @@
>  #include "attr.h"
>  #include "split-index.h"
>  #include "dir.h"
> +#include "convert.h"
>  
>  /*
>   * Error messages expected by scripts out of plumbing commands such as

Why?  The resulting file seems to compile without this addition.


Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-13 Thread Torsten Bögershausen
On 12.09.16 11:49, Lars Schneider wrote:

>> How do we send pathnames the have '\n' ?
>> Not really recommended, but allowed.
>> And here I am a little bit lost, is each of the lines packed into
>> a pkt-line ?
>> command=smudge is packet as pkt-line and pathname= is packed into
>> another one ? (The we don't need the '\n' at all)
> 
> Every line is a dedicated packet. That's why '\n' in a path name would
> not be a problem as the receiver is expected to read the entire packet
> when parsing the value (and the receiver knows the packet length, too).
> 
> The '\n' at the end is required by the pkt-line format:
> "A non-binary line SHOULD BE terminated by an LF..."
> (see protocol-common.txt)

That is only the half part of the story:
A non-binary line SHOULD BE terminated by an LF, which if present
MUST be included in the total length. Receivers MUST treat pkt-lines
with non-binary data the same whether or not they contain the trailing
LF (stripping the LF if present, and not complaining when it is
missing).


How do we treat pathnames ?
They can have each byte value except '\0'.
What should a receiver do, which reads a string like "ABC\n\n" ?
Is it "ABC\n" or "ABC\n\n" ?

I would really consider to treat pathnames as binary, and not add a trailing 
'\n',
are there other opinions ?
 







Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-12 Thread Lars Schneider

> On 10 Sep 2016, at 08:29, Torsten Bögershausen  wrote:
> 
> On Thu, Sep 08, 2016 at 08:21:32PM +0200, larsxschnei...@gmail.com wrote:
> []
>> +packet:  git> git-filter-client
>> +packet:  git> version=2
>> +packet:  git> version=42
>> +packet:  git> 
>> +packet:  git< git-filter-server
>> +packet:  git< version=2
>> +packet:  git> clean=true
>> +packet:  git> smudge=true
>> +packet:  git> not-yet-invented=true
>> +packet:  git> 
>> +packet:  git< clean=true
>> +packet:  git< smudge=true
>> +packet:  git< 
> 
> It's probalby only me who has difficulties to distinguish
> '>' from '<'.

I see what you mean. However, this format is used with "GIT_TRACE_PACKET"
as well and therefore I would prefer to keep it.


> packet:  git> git-filter-client
> packet:  git> version=2
> packet:  git> version=42
> packet:  git> 
> packet:   filter> git-filter-server
> packet:   filter> version=2
> 
> (Otherwise the dialoge description is nice)

Thanks!


>> +
>> +Supported filter capabilities in version 2 are "clean" and
>> +"smudge".
>> +
>> +Afterwards Git sends a list of "key=value" pairs terminated with
>> +a flush packet. The list will contain at least the filter command
>> +(based on the supported capabilities) and the pathname of the file
>> +to filter relative to the repository root. Right after these packets
>> +Git sends the content split in zero or more pkt-line packets and a
>> +flush packet to terminate content.
>> +
>> +packet:  git> command=smudge\n
>> +packet:  git> pathname=path/testfile.dat\n
> 
> How do we send pathnames the have '\n' ?
> Not really recommended, but allowed.
> And here I am a little bit lost, is each of the lines packed into
> a pkt-line ?
> command=smudge is packet as pkt-line and pathname= is packed into
> another one ? (The we don't need the '\n' at all)

Every line is a dedicated packet. That's why '\n' in a path name would
not be a problem as the receiver is expected to read the entire packet
when parsing the value (and the receiver knows the packet length, too).

The '\n' at the end is required by the pkt-line format:
"A non-binary line SHOULD BE terminated by an LF..."
(see protocol-common.txt)


> Or go both lines into one pkt-line (thats what I think), then
> we don't need the '\n' afther the pathname.

No (see above).


> And in this case the pathname is always the last element, and a '\n'
> may occur in the pathname, since we know the length of the packet
> we know how long the pathname must be.
> 
> 
> [...]
>> 
>> +In case the filter cannot or does not want to process the content,
> 
> Does not want ? 
> I can see something like "I read through the file, there is nothing
> to do. So Git, I don't send anything back, you know where the file is.

That's right. Isn't that covered with "does not want"?


>> +it is expected to respond with an "error" status. Depending on the
>> +`filter..required` flag Git will interpret that as error
>> +but it will not stop or restart the filter process.
>> +
>> +packet:  git< status=error\n
>> +packet:  git< 
>> +
>> +
>> +If the filter experiences an error during processing, then it can
>> +send the status "error" after the content was (partially or
>> +completely) sent. Depending on the `filter..required` flag
>> +Git will interpret that as error but it will not stop or restart the
>> +filter process.
>> +
>> +packet:  git< status=success\n
>> +packet:  git< 
>> +packet:  git< HALF_WRITTEN_ERRONEOUS_CONTENT
>> +packet:  git< 
>> +packet:  git< status=error\n
>> +packet:  git< 
>> +
>> +
>> +If the filter dies during the communication or does not adhere to
>> +the protocol then Git will stop the filter process and restart it
> 
> My personal comment:
> When a filter is mis-behaving, Git should say so loud and clear, and
> die(). 
> The filter process can be left running, so that it can be debugged.

In the current implementation Git would die already if the filter is 
"required". In this particular case we *could* die, too. However, 
I would prefer to keep it as is because I think the users of a 
"non-required" filter do not expect Git to die if there is *any* 
problem with the filter.


> Here I stopped the review for a moment, 
> I still think that Git shouldn't kill anything, because we loose
> the ability to debug these processes.

Based on my experience debugging the filter in this state is hard
anyways. I think a user would rather try to reproduce the problem
and run Git with the "GIT_TRACE_PACKET" flag enabled for debugging.


Thanks,
Lars

Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-10 Thread Torsten Bögershausen
[]

One general question up here, more comments inline.
The current order for a clean-filter is like this, I removed the error handling:

int convert_to_git()
{
ret |= apply_filter(path, src, len, -1, dst, filter);
if (ret && dst) {
src = dst->buf;
len = dst->len;
}
ret |= crlf_to_git(path, src, len, dst, ca.crlf_action, checksafe);
return ret | ident_to_git(path, src, len, dst, ca.ident);
}

The first step is the clean filter, the CRLF-LF conversion (if needed),
then ident.
The current implementation streams the whole file content to the filter,
(STDIN of the filter) and reads back STDOUT from the filter into a STRBUF.
This is to use the UNIX-like STDIN--STDOUT method for writing a filter.

However, int would_convert_to_git_filter_fd() and convert_to_git_filter_fd()
offer a sort of short-cut:
The filter reads from the file directly, and the output of the filter is
read into a STRBUF.

It looks as if the multi-filter approach can use this in a similar way:
Give the pathname to the filter, the filter opens the file for reading
and stream the result via the pkt-line protocol into Git.
This needs some more changes, and may be very well go into a separate patch
series. (and should).

What I am asking for:
When a multi-filter is used, the content is handled to the filter via pkt-line,
and the result is given to Git via pkt-line ?
Nothing wrong with it, I just wonder, if it should be mentioned somewhere.

> diff --git a/contrib/long-running-filter/example.pl 
> b/contrib/long-running-filter/example.pl
> new file mode 100755
> index 000..279fbfb
> --- /dev/null
> +++ b/contrib/long-running-filter/example.pl
> @@ -0,0 +1,111 @@
> +#!/usr/bin/perl
> +#
> +# Example implementation for the Git filter protocol version 2
> +# See Documentation/gitattributes.txt, section "Filter Protocol"
> +#
> +
> +use strict;
> +use warnings;
> +
> +my $MAX_PACKET_CONTENT_SIZE = 65516;
> +
> +sub packet_read {
> +my $buffer;
> +my $bytes_read = read STDIN, $buffer, 4;
> +if ( $bytes_read == 0 ) {
> +
> +# EOF - Git stopped talking to us!
> +exit();
> +}
> +elsif ( $bytes_read != 4 ) {
> +die "invalid packet size '$bytes_read' field";
> +}

This is half-kosher, I would say,
(And I really. really would like to see an implementation in C ;-)

A read function may look like this:

   ret = read(0, , 4);
   if (ret < 0) {
 /* Error, nothing we can do */
 exit(1);
   } else if (ret == 0) {
 /* EOF  */
 exit(0);
   } else if (ret < 4) {
 /* 
  * Go and read more, until we have 4 bytes or EOF or Error */
   } else {
 /* Good case, see below */
   }

> +my $pkt_size = hex($buffer);
> +if ( $pkt_size == 0 ) {
> +return ( 1, "" );
> +}
> +elsif ( $pkt_size > 4 ) {
> +my $content_size = $pkt_size - 4;
> +$bytes_read = read STDIN, $buffer, $content_size;
> +if ( $bytes_read != $content_size ) {
> +die "invalid packet ($content_size expected; $bytes_read read)";
> +}
> +return ( 0, $buffer );
> +}
> +else {
> +die "invalid packet size";
> +}
> +}
> +
> +sub packet_write {
> +my ($packet) = @_;
> +print STDOUT sprintf( "%04x", length($packet) + 4 );
> +print STDOUT $packet;
> +STDOUT->flush();
> +}
> +
> +sub packet_flush {
> +print STDOUT sprintf( "%04x", 0 );
> +STDOUT->flush();
> +}
> +
> +( packet_read() eq ( 0, "git-filter-client" ) ) || die "bad initialization";
> +( packet_read() eq ( 0, "version=2" ) ) || die "bad version";
> +( packet_read() eq ( 1, "" ) )  || die "bad version end";
> +
> +packet_write("git-filter-server\n");
> +packet_write("version=2\n");
> +
> +( packet_read() eq ( 0, "clean=true" ) )  || die "bad capability";
> +( packet_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
> +( packet_read() eq ( 1, "" ) )|| die "bad capability end";
> +
> +packet_write( "clean=true\n" );
> +packet_write( "smudge=true\n" );
> +packet_flush();
> +
> +while (1) {
> +my ($command) = packet_read() =~ /^command=([^=]+)\n$/;
> +my ($pathname) = packet_read() =~ /^pathname=([^=]+)\n$/;
> +
> +packet_read();
> +
> +my $input = "";
> +{
> +binmode(STDIN);
> +my $buffer;
> +my $done = 0;
> +while ( !$done ) {
> +( $done, $buffer ) = packet_read();
> +$input .= $buffer;
> +}
> +}
> +
> +my $output;
> +if ( $command eq "clean" ) {
> +### Perform clean here ###
> +$output = $input;
> +}
> +elsif ( $command eq "smudge" ) {
> +### Perform smudge here ###
> +$output = $input;
> +}
> +else {
> +die "bad command '$command'";
> +}
> +
> +packet_write("status=success\n");
> +packet_flush();
> +while ( length($output) > 0 ) {
> +my $packet = substr( $output, 0, 

Re: [PATCH v7 10/10] convert: add filter..process option

2016-09-10 Thread Torsten Bögershausen
On Thu, Sep 08, 2016 at 08:21:32PM +0200, larsxschnei...@gmail.com wrote:
[]
> +packet:  git> git-filter-client
> +packet:  git> version=2
> +packet:  git> version=42
> +packet:  git> 
> +packet:  git< git-filter-server
> +packet:  git< version=2
> +packet:  git> clean=true
> +packet:  git> smudge=true
> +packet:  git> not-yet-invented=true
> +packet:  git> 
> +packet:  git< clean=true
> +packet:  git< smudge=true
> +packet:  git< 

It's probalby only me who has difficulties to distinguish
'>' from '<'.

packet:  git> git-filter-client
packet:  git> version=2
packet:  git> version=42
packet:  git> 
packet:   filter> git-filter-server
packet:   filter> version=2

(Otherwise the dialoge description is nice)

> +
> +Supported filter capabilities in version 2 are "clean" and
> +"smudge".
> +
> +Afterwards Git sends a list of "key=value" pairs terminated with
> +a flush packet. The list will contain at least the filter command
> +(based on the supported capabilities) and the pathname of the file
> +to filter relative to the repository root. Right after these packets
> +Git sends the content split in zero or more pkt-line packets and a
> +flush packet to terminate content.
> +
> +packet:  git> command=smudge\n
> +packet:  git> pathname=path/testfile.dat\n

How do we send pathnames the have '\n' ?
Not really recommended, but allowed.
And here I am a little bit lost, is each of the lines packed into
a pkt-line ?
command=smudge is packet as pkt-line and pathname= is packed into
another one ? (The we don't need the '\n' at all)
Or go both lines into one pkt-line (thats what I think), then
we don't need the '\n' afther the pathname.
And in this case the pathname is always the last element, and a '\n'
may occur in the pathname, since we know the length of the packet
we know how long the pathname must be.



> +packet:  git> 
> +packet:  git> CONTENT
> +packet:  git> 
> +
> +


> +The filter is expected to respond with a list of "key=value" pairs
> +terminated with a flush packet. If the filter does not experience
> +problems then the list must contain a "success" status. Right after
> +these packets the filter is expected to send the content in zero
> +or more pkt-line packets and a flush packet at the end. Finally, a
> +second list of "key=value" pairs terminated with a flush packet
> +is expected. The filter can change the status in the second list.
> +
> +packet:  git< status=success\n
> +packet:  git< 
> +packet:  git< SMUDGED_CONTENT
> +packet:  git< 
> +packet:  git<   # empty list!
> +
> +
> +If the result content is empty then the filter is expected to respond
> +with a success status and an empty list.
> +
> +packet:  git< status=success\n
> +packet:  git< 
> +packet:  git<   # empty content!
> +packet:  git<   # empty list!
> +
> +
> +In case the filter cannot or does not want to process the content,

Does not want ? 
I can see something like "I read through the file, there is nothing
to do. So Git, I don't send anything back, you know where the file is.

> +it is expected to respond with an "error" status. Depending on the
> +`filter..required` flag Git will interpret that as error
> +but it will not stop or restart the filter process.
> +
> +packet:  git< status=error\n
> +packet:  git< 
> +
> +
> +If the filter experiences an error during processing, then it can
> +send the status "error" after the content was (partially or
> +completely) sent. Depending on the `filter..required` flag
> +Git will interpret that as error but it will not stop or restart the
> +filter process.
> +
> +packet:  git< status=success\n
> +packet:  git< 
> +packet:  git< HALF_WRITTEN_ERRONEOUS_CONTENT
> +packet:  git< 
> +packet:  git< status=error\n
> +packet:  git< 
> +
> +
> +If the filter dies during the communication or does not adhere to
> +the protocol then Git will stop the filter process and restart it

My personal comment:
When a filter is mis-behaving, Git should say so loud and clear, and
die(). 
The filter process can be left running, so that it can be debugged.

Here I stopped the review for a moment, 
I still think that Git shouldn't kill anything, because we loose
the ability to debug these processes.



[PATCH v7 10/10] convert: add filter..process option

2016-09-08 Thread larsxschneider
From: Lars Schneider 

Git's clean/smudge mechanism invokes an external filter process for
every single blob that is affected by a filter. If Git filters a lot of
blobs then the startup time of the external filter processes can become
a significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge
filter written in golang to filter 12,000 files. This process took 364s
with the existing filter mechanism and 5s with the new mechanism. See
details here: https://github.com/github/git-lfs/pull/1382

This patch adds the `filter..process` string option which, if
used, keeps the external filter process running and processes all blobs
with the packet format (pkt-line) based protocol over standard input and
standard output. The full protocol is explained in detail in
`Documentation/gitattributes.txt`.

A few key decisions:

* The long running filter process is referred to as filter protocol
  version 2 because the existing single shot filter invocation is
  considered version 1.
* Git sends a welcome message and expects a response right after the
  external filter process has started. This ensures that Git will not
  hang if a version 1 filter is incorrectly used with the
  filter..process option for version 2 filters. In addition,
  Git can detect this kind of error and warn the user.
* The status of a filter operation (e.g. "success" or "error) is set
  before the actual response and (if necessary!) re-set after the
  response. The advantage of this two step status response is that if
  the filter detects an error early, then the filter can communicate
  this and Git does not even need to create structures to read the
  response.
* All status responses are pkt-line lists terminated with a flush
  packet. This allows us to send other status fields with the same
  protocol in the future.

Helped-by: Martin-Louis Bright 
Reviewed-by: Jakub Narebski 
Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt| 154 +-
 contrib/long-running-filter/example.pl | 111 ++
 convert.c  | 349 ---
 pkt-line.h |   1 +
 t/t0021-conversion.sh  | 365 -
 t/t0021/rot13-filter.pl| 179 
 unpack-trees.c |   1 +
 7 files changed, 1129 insertions(+), 31 deletions(-)
 create mode 100755 contrib/long-running-filter/example.pl
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 7aff940..ac000ea 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the 
command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate.  If a long running `process` filter is used
+in place of `clean` and/or `smudge` filters, then Git can process
+all blobs with a single filter command invocation for the entire
+life of a single Git command, for example `git add --all`.  See
+section below for the description of the protocol used to
+communicate with a `process` filter.
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -373,6 +379,152 @@ not exist, or may have different contents. So, smudge and 
clean commands
 should not try to access the file on disk, but only act as filters on the
 content provided to them on standard input.
 
+Long Running Filter Process
+^^^
+
+If the filter command (a string value) is defined via
+`filter..process` then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command. This is achieved by using the following packet format
+(pkt-line, see technical/protocol-common.txt) based protocol over
+standard input and standard output.
+
+Git starts the filter when it encounters the first file
+that needs to be cleaned or smudged. After the filter started
+Git sends a welcome message ("git-filter-client"), a list of
+supported protocol version numbers, and a flush packet. Git expects
+to read a welcome response message ("git-filter-server") and exactly
+one protocol version number from the previously sent list. All further
+communication will be based on the selected version. The remaining
+protocol description below documents "version=2". Please note that
+"version=42" in the example below does not exist and is only there
+to illustrate how the protocol would look like with more than one
+version.
+