Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-15 Thread Jeff King
On Thu, Sep 15, 2016 at 07:15:33AM +0200, Kevin Daudt wrote:

> > > Another small thing I am not sure about is if the \ quoting can hide
> > > an embedded newline in the author name.  Would we end up turning
> > > 
> > >   From: "Jeff \
> > > King" 
> > > 
> > > or somesuch into
> > > 
> > >   Author: Jeff
> > > King
> > > Email: p...@peff.net
> > > 
> > > ;-)
> > 
> > Heh, yeah. That is another reason to clean up and sanitize as much as
> > possible before stuffing it into another text format that will be
> > parsed.
> 
> A quoted string cannot contain newlines according to the RFC, so I think
> we don't need to care about that.

I wondered how we handled something like:

  From: =?UTF-8?q?J=0Aff=20King?= 

which sticks a newline into the middle of the buffer. We do decode it
that way, but eventually call cleanup_space() which converts a run of 1
or more isspace() characters into a single space (0x20). So you end up
with:

  Author: J ff King

which is probably reasonable.

> Makes sense, the current itteration of my patch already strips exterior
> quotes, no matter where they happen.
> 
> I will send a patch soon.

Great. Thanks for working on this.

-Peff


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Kevin Daudt
On Wed, Sep 14, 2016 at 12:38:20PM -0700, Jeff King wrote:
> On Wed, Sep 14, 2016 at 12:30:06PM -0700, Junio C Hamano wrote:
> 
> > Another small thing I am not sure about is if the \ quoting can hide
> > an embedded newline in the author name.  Would we end up turning
> > 
> > From: "Jeff \
> > King" 
> > 
> > or somesuch into
> > 
> > Author: Jeff
> > King
> > Email: p...@peff.net
> > 
> > ;-)
> 
> Heh, yeah. That is another reason to clean up and sanitize as much as
> possible before stuffing it into another text format that will be
> parsed.

A quoted string cannot contain newlines according to the RFC, so I think
we don't need to care about that.

> 
> > So let's roll the \" -> " into mailinfo.
> > 
> > I am not sure if we also should remove the surrounding "", i.e. we
> > currently do not turn this
> > 
> > From: "Jeff King" 
> > 
> > into this:
> > 
> > Author: Jeff King
> > Email: p...@peff.net
> > 
> > I think we probably should, and remove the one that does so from the
> > reader.
> 
> I think you have to, or else you cannot tell the difference between
> surrounding quotes that need to be stripped, and ones that were
> backslash-escaped. Like:
> 
>   From: "Jeff King" 
>   From: \"Jeff King\" 
> 
> which would both become:
> 
>   Author: "Jeff King"
>   Email: p...@peff.net
> 
> though I am not sure the latter one is actually valid; you might need to
> be inside syntactic quotes in order to include backslashed quotes. I
> haven't read rfc2822 carefully recently enough to know.
> 
> Anyway, I think that:
> 
>   From: One "Two \"Three\" Four" Five
> 
> may also be valid. So the quote-stripping in the reader is not just "at
> the outside", but may need to handle interior syntactic quotes, too. So
> it really makes sense for me to clean and sanitize as much as possible
> in one step, and then make the parser of mailinfo as dumb as possible.
> 

Makes sense, the current itteration of my patch already strips exterior
quotes, no matter where they happen.

I will send a patch soon.


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 12:30:06PM -0700, Junio C Hamano wrote:

> Another small thing I am not sure about is if the \ quoting can hide
> an embedded newline in the author name.  Would we end up turning
> 
>   From: "Jeff \
> King" 
> 
> or somesuch into
> 
>   Author: Jeff
> King
> Email: p...@peff.net
> 
> ;-)

Heh, yeah. That is another reason to clean up and sanitize as much as
possible before stuffing it into another text format that will be
parsed.

> So let's roll the \" -> " into mailinfo.
> 
> I am not sure if we also should remove the surrounding "", i.e. we
> currently do not turn this
> 
>   From: "Jeff King" 
> 
> into this:
> 
>   Author: Jeff King
> Email: p...@peff.net
> 
> I think we probably should, and remove the one that does so from the
> reader.

I think you have to, or else you cannot tell the difference between
surrounding quotes that need to be stripped, and ones that were
backslash-escaped. Like:

  From: "Jeff King" 
  From: \"Jeff King\" 

which would both become:

  Author: "Jeff King"
  Email: p...@peff.net

though I am not sure the latter one is actually valid; you might need to
be inside syntactic quotes in order to include backslashed quotes. I
haven't read rfc2822 carefully recently enough to know.

Anyway, I think that:

  From: One "Two \"Three\" Four" Five

may also be valid. So the quote-stripping in the reader is not just "at
the outside", but may need to handle interior syntactic quotes, too. So
it really makes sense for me to clean and sanitize as much as possible
in one step, and then make the parser of mailinfo as dumb as possible.

-Peff


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Junio C Hamano
Jeff King  writes:

> On Wed, Sep 14, 2016 at 10:43:18AM -0700, Junio C Hamano wrote:
>
>> I think we can go either way and it does not matter all that much if
>> "mailinfo" changes its output or the reader of "mailinfo" output
>> changes its input--we will either be munging data read from "From:"
>> when producing the "Author:" line, or taking the "Author:" output by
>> mailinfo and removing the quotes.
>
> Yeah, that was the part I was wondering about in my original response.
> What is the output of mailinfo _supposed_ to be, and do we consider that
> at all public (i.e., are there are other tools besides "git am" that
> build on mailinfo)?
>
> At least "am" already does some quote-stripping, so any de-quoting added
> in mailinfo is potentially a regression (if we indeed care about keeping
> the output stable).

Another small thing I am not sure about is if the \ quoting can hide
an embedded newline in the author name.  Would we end up turning

From: "Jeff \
King" 

or somesuch into

Author: Jeff
King
Email: p...@peff.net

;-)

> But if we are OK with that, it seems to me that mailinfo is the best
> place to do the de-quoting, because then its output is well-defined:
> everything after "Author:" up to the newline is the name.

There are other things mailinfo does, like turning this

From: p...@peff.net (Jeff King)

into

Author: Jeff King
Email: p...@peff.net

and

From: Uh "foo" Bar p...@peff.net (Jeff King)

into

Author: Uh "foo" Bar (Jeff King)
Email: p...@peff.net

So let's roll the \" -> " into mailinfo.

I am not sure if we also should remove the surrounding "", i.e. we
currently do not turn this

From: "Jeff King" 

into this:

Author: Jeff King
Email: p...@peff.net

I think we probably should, and remove the one that does so from the
reader.



Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 10:43:18AM -0700, Junio C Hamano wrote:

> I think we can go either way and it does not matter all that much if
> "mailinfo" changes its output or the reader of "mailinfo" output
> changes its input--we will either be munging data read from "From:"
> when producing the "Author:" line, or taking the "Author:" output by
> mailinfo and removing the quotes.

Yeah, that was the part I was wondering about in my original response.
What is the output of mailinfo _supposed_ to be, and do we consider that
at all public (i.e., are there are other tools besides "git am" that
build on mailinfo)?

At least "am" already does some quote-stripping, so any de-quoting added
in mailinfo is potentially a regression (if we indeed care about keeping
the output stable).

But if we are OK with that, it seems to me that mailinfo is the best
place to do the de-quoting, because then its output is well-defined:
everything after "Author:" up to the newline is the name. Whereas if the
cleanup of the value is split across mailinfo and its reader, then it is
hard to know which side is responsible for which part. mailinfo handles
whitespace unfolding, certainly. What other rfc2822 things does it
handle? What are the rules for dequoting its output?

I'll admit I don't care _too_ much. This is a remote corner of the code
that I hope never to have to look at. I'm mostly just describing how the
problem space makes sense to _me_, and how I would write it if starting
from scratch.

-Peff


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Junio C Hamano
Kevin Daudt  writes:

> When applied the the author of this patch shows up as:
>
> Author: A U Thor" (test) 
>
> So I agree with Jeff[1] where he states that the surrounding quotes
> should be removed, if that's not a problem for git.
>
> [1]:https://public-inbox.org/git/20160914051305.vphknpsikyxi3...@sigill.intra.peff.net/

I think we can go either way and it does not matter all that much if
"mailinfo" changes its output or the reader of "mailinfo" output
changes its input--we will either be munging data read from "From:"
when producing the "Author:" line, or taking the "Author:" output by
mailinfo and removing the quotes.

As an output from mailinfo that looks like this:

Author: "A U Thor"
Email: a...@thor.com

is made into a commit object that has this:

author A U Thor 

we know that the reader of mailinfo output _already_ has some logic
to strip the surrounding double quotes.  That is the only reason why
I think it is a better approach to not dequote in the "mailinfo" but
in the reader to turn

Author: "A \"U\" Thor"
Email: a...@thor.com

into a commit object that has this:

author A "U" Thor 

than updating mailinfo to produce

Author: A "U" Thor
Email: a...@thor.com

and then create the same result.


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Kevin Daudt
On Tue, Sep 13, 2016 at 10:54:47PM -0700, Junio C Hamano wrote:
> Jeff King  writes:
> 
> > It has been a while since I looked at rfc2822, but aren't the quoting
> > and syntax rules different for addresses versus other headers? We would
> > not want to dequote a Subject header, I think.
> 
> You're absolutely right.  RFC2822 does not quite _want_ to dequote
> anything.  As you pointed out in a separate message, we are the one
> who want to strip out "" quoting when mailinfo says
> 
>   Author: "Jeff King"
> 
> to its standard output (aka "info"), and turn it into
> 
>   GIT_AUTHOR_NAME='Jeff King'
> 
> and do so ONLY for the author name.
> 
> So I would think it is the responsibility of the one that reads the
> "info" file that is produced by mailinfo to dequote the backslash
> thing if the mailinfo gave us
> 
>   Author: "Jeff \"Peff\" King"
> 

The RFC makes a distinction between structured fields and unstructured
fields. Quoting would not even be necessary for unstructured fields
(like Subject), so yes, that those fields should be left alone.

Unstructures fields are subject, comments, keywords and optional fields,
the rest is considered structured.

Because the only field where this is a problem is the From field, I
think it would be safe to limit the unquoting just to that field.

My reasoning to do the unquoting here is because it's the RFC requires
the quoting in the first place.

I already noticed a bug in the current unquoting of the author when
adding a comment to the From: field.

From: "A U Thor"  (test)

When applied the the author of this patch shows up as:

Author: A U Thor" (test) 

So I agree with Jeff[1] where he states that the surrounding quotes
should be removed, if that's not a problem for git.

[1]:https://public-inbox.org/git/20160914051305.vphknpsikyxi3...@sigill.intra.peff.net/


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-13 Thread Junio C Hamano
Jeff King  writes:

> It has been a while since I looked at rfc2822, but aren't the quoting
> and syntax rules different for addresses versus other headers? We would
> not want to dequote a Subject header, I think.

You're absolutely right.  RFC2822 does not quite _want_ to dequote
anything.  As you pointed out in a separate message, we are the one
who want to strip out "" quoting when mailinfo says

Author: "Jeff King"

to its standard output (aka "info"), and turn it into

GIT_AUTHOR_NAME='Jeff King'

and do so ONLY for the author name.

So I would think it is the responsibility of the one that reads the
"info" file that is produced by mailinfo to dequote the backslash
thing if the mailinfo gave us

Author: "Jeff \"Peff\" King"



Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-13 Thread Jeff King
On Wed, Sep 14, 2016 at 01:46:12AM +0200, Kevin Daudt wrote:

> diff --git a/t/t5100/quoted-pair.expect b/t/t5100/quoted-pair.expect
> new file mode 100644
> index 000..9fe72e9
> --- /dev/null
> +++ b/t/t5100/quoted-pair.expect
> @@ -0,0 +1,5 @@
> +Author: "Author "The Author" Name"
> +Email: someb...@example.com
> +Subject: testing quoted-pair
> +Date: Sun, 25 May 2008 00:38:18 -0700

So obviously this is much better than including the backslashed quotes.
But I have to wonder why the first line is not:

  Author: Author "The Author" Name

Who is responsible for stripping out the other quotes? I know that they
_do_ get stripped out even in the current code, but it is not clear to
me if that is intentional or an accident.

In Git's world-view (e.g., in commit headers), an ident name continues
until we get to the "<" of the email (or a "\n" terminates the header
line completely). So if mailinfo is converting rfc2822 headers into Git
ident, I'd expect it to fully remove any quotes that are not intended to
be in the name, and everything after "Author: " up to the newline would
become the name.

It's entirely possible I'm missing something subtle about the design of
mailinfo, though.

-Peff


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-13 Thread Jeff King
On Tue, Sep 13, 2016 at 05:04:45PM -0700, Junio C Hamano wrote:

> > @@ -461,6 +506,7 @@ static int check_header(struct mailinfo *mi,
> >  */
> > strbuf_add(&sb, line->buf + len + 2, line->len - len - 
> > 2);
> > decode_header(mi, &sb);
> > +   unescape_quoted_pair(mi, &sb);
> > handle_header(&hdr_data[i], &sb);
> > ret = 1;
> > goto check_header_out;
> 
> I wonder why this call is only in here, not on other headers that
> all call decode_header().  For that matter, I wonder if the call (or
> the logic of the helper function itself) should go at the end of
> decode_header().  After all, this is different kind of decoding; the
> current one knows how to do b/q encoding but forgot about the more
> traditional quoting done with backslash, and you are teaching the
> code that the current decoding it does is insufficient and how to
> handle the one that the original implementors forgot about.

It has been a while since I looked at rfc2822, but aren't the quoting
and syntax rules different for addresses versus other headers? We would
not want to dequote a Subject header, I think.

-Peff


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-13 Thread Kevin Daudt
On Tue, Sep 13, 2016 at 05:04:45PM -0700, Junio C Hamano wrote:
> Kevin Daudt  writes:
> 
> 
> It often is easier to read if smaller of the two are in the if part
> and the larger in else part.  Also your switch/case is indented one
> level too deep.  I.e.
> 

Thanks, I've switched the order and fixed indentation.

> 
> I found the variable name "skip" a bit hard to reason about.  What
> it does is to signal the next round of the processing that we have
> seen a single-byte quote and it should keep the byte it will get, no
> matter what its value is.  It is "skipping" the conditional
> processing, but I'd imagine most people would consider it is
> "keeping the byte".

Yes, I agree and was trying to find a better name. I have renamed it to
"take_next_literally", which indicates better what it means.

> 
> > @@ -461,6 +506,7 @@ static int check_header(struct mailinfo *mi,
> >  */
> > strbuf_add(&sb, line->buf + len + 2, line->len - len - 
> > 2);
> > decode_header(mi, &sb);
> > +   unescape_quoted_pair(mi, &sb);
> > handle_header(&hdr_data[i], &sb);
> > ret = 1;
> > goto check_header_out;
> 
> I wonder why this call is only in here, not on other headers that
> all call decode_header().  For that matter, I wonder if the call (or
> the logic of the helper function itself) should go at the end of
> decode_header().  After all, this is different kind of decoding; the
> current one knows how to do b/q encoding but forgot about the more
> traditional quoting done with backslash, and you are teaching the
> code that the current decoding it does is insufficient and how to
> handle the one that the original implementors forgot about.

Makes sense, it should be applied to all headers (I missed the other
decode_header calls).

I will send a new version later.


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-13 Thread Junio C Hamano
Kevin Daudt  writes:

> +static int unescape_quoted_pair(struct mailinfo *mi, struct strbuf *line)
> +{
> + struct strbuf outbuf = STRBUF_INIT;
> + const char *in = line->buf;
> + int c, skip=0;
> + char escape_context=0;

Have SP around '=', i.e.

int c, skip = 0;
char escape_context = 0;

> + while ((c = *in++) != 0) {
> + if (!skip) {
> + switch (c) {
> + case '"':
> +...
> + break;
> + }
> + } else {
> + skip = 0;
> + }
> +
> + strbuf_addch(&outbuf, c);
> + }

It often is easier to read if smaller of the two are in the if part
and the larger in else part.  Also your switch/case is indented one
level too deep.  I.e.

while (...) {
if (skip) {
skip = 0;
} else {
switch (c) {
case '"':
do this;
...
}
}
strbuf_addch(...);
}

I found the variable name "skip" a bit hard to reason about.  What
it does is to signal the next round of the processing that we have
seen a single-byte quote and it should keep the byte it will get, no
matter what its value is.  It is "skipping" the conditional
processing, but I'd imagine most people would consider it is
"keeping the byte".

> @@ -461,6 +506,7 @@ static int check_header(struct mailinfo *mi,
>*/
>   strbuf_add(&sb, line->buf + len + 2, line->len - len - 
> 2);
>   decode_header(mi, &sb);
> + unescape_quoted_pair(mi, &sb);
>   handle_header(&hdr_data[i], &sb);
>   ret = 1;
>   goto check_header_out;

I wonder why this call is only in here, not on other headers that
all call decode_header().  For that matter, I wonder if the call (or
the logic of the helper function itself) should go at the end of
decode_header().  After all, this is different kind of decoding; the
current one knows how to do b/q encoding but forgot about the more
traditional quoting done with backslash, and you are teaching the
code that the current decoding it does is insufficient and how to
handle the one that the original implementors forgot about.

Thanks.