Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
On Thu, Sep 15, 2016 at 07:15:33AM +0200, Kevin Daudt wrote: > > > Another small thing I am not sure about is if the \ quoting can hide > > > an embedded newline in the author name. Would we end up turning > > > > > > From: "Jeff \ > > > King" > > > > > > or somesuch into > > > > > > Author: Jeff > > > King > > > Email: p...@peff.net > > > > > > ;-) > > > > Heh, yeah. That is another reason to clean up and sanitize as much as > > possible before stuffing it into another text format that will be > > parsed. > > A quoted string cannot contain newlines according to the RFC, so I think > we don't need to care about that. I wondered how we handled something like: From: =?UTF-8?q?J=0Aff=20King?= which sticks a newline into the middle of the buffer. We do decode it that way, but eventually call cleanup_space() which converts a run of 1 or more isspace() characters into a single space (0x20). So you end up with: Author: J ff King which is probably reasonable. > Makes sense, the current itteration of my patch already strips exterior > quotes, no matter where they happen. > > I will send a patch soon. Great. Thanks for working on this. -Peff
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
On Wed, Sep 14, 2016 at 12:38:20PM -0700, Jeff King wrote: > On Wed, Sep 14, 2016 at 12:30:06PM -0700, Junio C Hamano wrote: > > > Another small thing I am not sure about is if the \ quoting can hide > > an embedded newline in the author name. Would we end up turning > > > > From: "Jeff \ > > King" > > > > or somesuch into > > > > Author: Jeff > > King > > Email: p...@peff.net > > > > ;-) > > Heh, yeah. That is another reason to clean up and sanitize as much as > possible before stuffing it into another text format that will be > parsed. A quoted string cannot contain newlines according to the RFC, so I think we don't need to care about that. > > > So let's roll the \" -> " into mailinfo. > > > > I am not sure if we also should remove the surrounding "", i.e. we > > currently do not turn this > > > > From: "Jeff King" > > > > into this: > > > > Author: Jeff King > > Email: p...@peff.net > > > > I think we probably should, and remove the one that does so from the > > reader. > > I think you have to, or else you cannot tell the difference between > surrounding quotes that need to be stripped, and ones that were > backslash-escaped. Like: > > From: "Jeff King" > From: \"Jeff King\" > > which would both become: > > Author: "Jeff King" > Email: p...@peff.net > > though I am not sure the latter one is actually valid; you might need to > be inside syntactic quotes in order to include backslashed quotes. I > haven't read rfc2822 carefully recently enough to know. > > Anyway, I think that: > > From: One "Two \"Three\" Four" Five > > may also be valid. So the quote-stripping in the reader is not just "at > the outside", but may need to handle interior syntactic quotes, too. So > it really makes sense for me to clean and sanitize as much as possible > in one step, and then make the parser of mailinfo as dumb as possible. > Makes sense, the current itteration of my patch already strips exterior quotes, no matter where they happen. I will send a patch soon.
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
On Wed, Sep 14, 2016 at 12:30:06PM -0700, Junio C Hamano wrote: > Another small thing I am not sure about is if the \ quoting can hide > an embedded newline in the author name. Would we end up turning > > From: "Jeff \ > King" > > or somesuch into > > Author: Jeff > King > Email: p...@peff.net > > ;-) Heh, yeah. That is another reason to clean up and sanitize as much as possible before stuffing it into another text format that will be parsed. > So let's roll the \" -> " into mailinfo. > > I am not sure if we also should remove the surrounding "", i.e. we > currently do not turn this > > From: "Jeff King" > > into this: > > Author: Jeff King > Email: p...@peff.net > > I think we probably should, and remove the one that does so from the > reader. I think you have to, or else you cannot tell the difference between surrounding quotes that need to be stripped, and ones that were backslash-escaped. Like: From: "Jeff King" From: \"Jeff King\" which would both become: Author: "Jeff King" Email: p...@peff.net though I am not sure the latter one is actually valid; you might need to be inside syntactic quotes in order to include backslashed quotes. I haven't read rfc2822 carefully recently enough to know. Anyway, I think that: From: One "Two \"Three\" Four" Five may also be valid. So the quote-stripping in the reader is not just "at the outside", but may need to handle interior syntactic quotes, too. So it really makes sense for me to clean and sanitize as much as possible in one step, and then make the parser of mailinfo as dumb as possible. -Peff
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
Jeff King writes: > On Wed, Sep 14, 2016 at 10:43:18AM -0700, Junio C Hamano wrote: > >> I think we can go either way and it does not matter all that much if >> "mailinfo" changes its output or the reader of "mailinfo" output >> changes its input--we will either be munging data read from "From:" >> when producing the "Author:" line, or taking the "Author:" output by >> mailinfo and removing the quotes. > > Yeah, that was the part I was wondering about in my original response. > What is the output of mailinfo _supposed_ to be, and do we consider that > at all public (i.e., are there are other tools besides "git am" that > build on mailinfo)? > > At least "am" already does some quote-stripping, so any de-quoting added > in mailinfo is potentially a regression (if we indeed care about keeping > the output stable). Another small thing I am not sure about is if the \ quoting can hide an embedded newline in the author name. Would we end up turning From: "Jeff \ King" or somesuch into Author: Jeff King Email: p...@peff.net ;-) > But if we are OK with that, it seems to me that mailinfo is the best > place to do the de-quoting, because then its output is well-defined: > everything after "Author:" up to the newline is the name. There are other things mailinfo does, like turning this From: p...@peff.net (Jeff King) into Author: Jeff King Email: p...@peff.net and From: Uh "foo" Bar p...@peff.net (Jeff King) into Author: Uh "foo" Bar (Jeff King) Email: p...@peff.net So let's roll the \" -> " into mailinfo. I am not sure if we also should remove the surrounding "", i.e. we currently do not turn this From: "Jeff King" into this: Author: Jeff King Email: p...@peff.net I think we probably should, and remove the one that does so from the reader.
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
On Wed, Sep 14, 2016 at 10:43:18AM -0700, Junio C Hamano wrote: > I think we can go either way and it does not matter all that much if > "mailinfo" changes its output or the reader of "mailinfo" output > changes its input--we will either be munging data read from "From:" > when producing the "Author:" line, or taking the "Author:" output by > mailinfo and removing the quotes. Yeah, that was the part I was wondering about in my original response. What is the output of mailinfo _supposed_ to be, and do we consider that at all public (i.e., are there are other tools besides "git am" that build on mailinfo)? At least "am" already does some quote-stripping, so any de-quoting added in mailinfo is potentially a regression (if we indeed care about keeping the output stable). But if we are OK with that, it seems to me that mailinfo is the best place to do the de-quoting, because then its output is well-defined: everything after "Author:" up to the newline is the name. Whereas if the cleanup of the value is split across mailinfo and its reader, then it is hard to know which side is responsible for which part. mailinfo handles whitespace unfolding, certainly. What other rfc2822 things does it handle? What are the rules for dequoting its output? I'll admit I don't care _too_ much. This is a remote corner of the code that I hope never to have to look at. I'm mostly just describing how the problem space makes sense to _me_, and how I would write it if starting from scratch. -Peff
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
Kevin Daudt writes: > When applied the the author of this patch shows up as: > > Author: A U Thor" (test) > > So I agree with Jeff[1] where he states that the surrounding quotes > should be removed, if that's not a problem for git. > > [1]:https://public-inbox.org/git/20160914051305.vphknpsikyxi3...@sigill.intra.peff.net/ I think we can go either way and it does not matter all that much if "mailinfo" changes its output or the reader of "mailinfo" output changes its input--we will either be munging data read from "From:" when producing the "Author:" line, or taking the "Author:" output by mailinfo and removing the quotes. As an output from mailinfo that looks like this: Author: "A U Thor" Email: a...@thor.com is made into a commit object that has this: author A U Thor we know that the reader of mailinfo output _already_ has some logic to strip the surrounding double quotes. That is the only reason why I think it is a better approach to not dequote in the "mailinfo" but in the reader to turn Author: "A \"U\" Thor" Email: a...@thor.com into a commit object that has this: author A "U" Thor than updating mailinfo to produce Author: A "U" Thor Email: a...@thor.com and then create the same result.
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
On Tue, Sep 13, 2016 at 10:54:47PM -0700, Junio C Hamano wrote: > Jeff King writes: > > > It has been a while since I looked at rfc2822, but aren't the quoting > > and syntax rules different for addresses versus other headers? We would > > not want to dequote a Subject header, I think. > > You're absolutely right. RFC2822 does not quite _want_ to dequote > anything. As you pointed out in a separate message, we are the one > who want to strip out "" quoting when mailinfo says > > Author: "Jeff King" > > to its standard output (aka "info"), and turn it into > > GIT_AUTHOR_NAME='Jeff King' > > and do so ONLY for the author name. > > So I would think it is the responsibility of the one that reads the > "info" file that is produced by mailinfo to dequote the backslash > thing if the mailinfo gave us > > Author: "Jeff \"Peff\" King" > The RFC makes a distinction between structured fields and unstructured fields. Quoting would not even be necessary for unstructured fields (like Subject), so yes, that those fields should be left alone. Unstructures fields are subject, comments, keywords and optional fields, the rest is considered structured. Because the only field where this is a problem is the From field, I think it would be safe to limit the unquoting just to that field. My reasoning to do the unquoting here is because it's the RFC requires the quoting in the first place. I already noticed a bug in the current unquoting of the author when adding a comment to the From: field. From: "A U Thor" (test) When applied the the author of this patch shows up as: Author: A U Thor" (test) So I agree with Jeff[1] where he states that the surrounding quotes should be removed, if that's not a problem for git. [1]:https://public-inbox.org/git/20160914051305.vphknpsikyxi3...@sigill.intra.peff.net/
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
Jeff King writes: > It has been a while since I looked at rfc2822, but aren't the quoting > and syntax rules different for addresses versus other headers? We would > not want to dequote a Subject header, I think. You're absolutely right. RFC2822 does not quite _want_ to dequote anything. As you pointed out in a separate message, we are the one who want to strip out "" quoting when mailinfo says Author: "Jeff King" to its standard output (aka "info"), and turn it into GIT_AUTHOR_NAME='Jeff King' and do so ONLY for the author name. So I would think it is the responsibility of the one that reads the "info" file that is produced by mailinfo to dequote the backslash thing if the mailinfo gave us Author: "Jeff \"Peff\" King"
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
On Wed, Sep 14, 2016 at 01:46:12AM +0200, Kevin Daudt wrote: > diff --git a/t/t5100/quoted-pair.expect b/t/t5100/quoted-pair.expect > new file mode 100644 > index 000..9fe72e9 > --- /dev/null > +++ b/t/t5100/quoted-pair.expect > @@ -0,0 +1,5 @@ > +Author: "Author "The Author" Name" > +Email: someb...@example.com > +Subject: testing quoted-pair > +Date: Sun, 25 May 2008 00:38:18 -0700 So obviously this is much better than including the backslashed quotes. But I have to wonder why the first line is not: Author: Author "The Author" Name Who is responsible for stripping out the other quotes? I know that they _do_ get stripped out even in the current code, but it is not clear to me if that is intentional or an accident. In Git's world-view (e.g., in commit headers), an ident name continues until we get to the "<" of the email (or a "\n" terminates the header line completely). So if mailinfo is converting rfc2822 headers into Git ident, I'd expect it to fully remove any quotes that are not intended to be in the name, and everything after "Author: " up to the newline would become the name. It's entirely possible I'm missing something subtle about the design of mailinfo, though. -Peff
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
On Tue, Sep 13, 2016 at 05:04:45PM -0700, Junio C Hamano wrote: > > @@ -461,6 +506,7 @@ static int check_header(struct mailinfo *mi, > > */ > > strbuf_add(&sb, line->buf + len + 2, line->len - len - > > 2); > > decode_header(mi, &sb); > > + unescape_quoted_pair(mi, &sb); > > handle_header(&hdr_data[i], &sb); > > ret = 1; > > goto check_header_out; > > I wonder why this call is only in here, not on other headers that > all call decode_header(). For that matter, I wonder if the call (or > the logic of the helper function itself) should go at the end of > decode_header(). After all, this is different kind of decoding; the > current one knows how to do b/q encoding but forgot about the more > traditional quoting done with backslash, and you are teaching the > code that the current decoding it does is insufficient and how to > handle the one that the original implementors forgot about. It has been a while since I looked at rfc2822, but aren't the quoting and syntax rules different for addresses versus other headers? We would not want to dequote a Subject header, I think. -Peff
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
On Tue, Sep 13, 2016 at 05:04:45PM -0700, Junio C Hamano wrote: > Kevin Daudt writes: > > > It often is easier to read if smaller of the two are in the if part > and the larger in else part. Also your switch/case is indented one > level too deep. I.e. > Thanks, I've switched the order and fixed indentation. > > I found the variable name "skip" a bit hard to reason about. What > it does is to signal the next round of the processing that we have > seen a single-byte quote and it should keep the byte it will get, no > matter what its value is. It is "skipping" the conditional > processing, but I'd imagine most people would consider it is > "keeping the byte". Yes, I agree and was trying to find a better name. I have renamed it to "take_next_literally", which indicates better what it means. > > > @@ -461,6 +506,7 @@ static int check_header(struct mailinfo *mi, > > */ > > strbuf_add(&sb, line->buf + len + 2, line->len - len - > > 2); > > decode_header(mi, &sb); > > + unescape_quoted_pair(mi, &sb); > > handle_header(&hdr_data[i], &sb); > > ret = 1; > > goto check_header_out; > > I wonder why this call is only in here, not on other headers that > all call decode_header(). For that matter, I wonder if the call (or > the logic of the helper function itself) should go at the end of > decode_header(). After all, this is different kind of decoding; the > current one knows how to do b/q encoding but forgot about the more > traditional quoting done with backslash, and you are teaching the > code that the current decoding it does is insufficient and how to > handle the one that the original implementors forgot about. Makes sense, it should be applied to all headers (I missed the other decode_header calls). I will send a new version later.
Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields
Kevin Daudt writes: > +static int unescape_quoted_pair(struct mailinfo *mi, struct strbuf *line) > +{ > + struct strbuf outbuf = STRBUF_INIT; > + const char *in = line->buf; > + int c, skip=0; > + char escape_context=0; Have SP around '=', i.e. int c, skip = 0; char escape_context = 0; > + while ((c = *in++) != 0) { > + if (!skip) { > + switch (c) { > + case '"': > +... > + break; > + } > + } else { > + skip = 0; > + } > + > + strbuf_addch(&outbuf, c); > + } It often is easier to read if smaller of the two are in the if part and the larger in else part. Also your switch/case is indented one level too deep. I.e. while (...) { if (skip) { skip = 0; } else { switch (c) { case '"': do this; ... } } strbuf_addch(...); } I found the variable name "skip" a bit hard to reason about. What it does is to signal the next round of the processing that we have seen a single-byte quote and it should keep the byte it will get, no matter what its value is. It is "skipping" the conditional processing, but I'd imagine most people would consider it is "keeping the byte". > @@ -461,6 +506,7 @@ static int check_header(struct mailinfo *mi, >*/ > strbuf_add(&sb, line->buf + len + 2, line->len - len - > 2); > decode_header(mi, &sb); > + unescape_quoted_pair(mi, &sb); > handle_header(&hdr_data[i], &sb); > ret = 1; > goto check_header_out; I wonder why this call is only in here, not on other headers that all call decode_header(). For that matter, I wonder if the call (or the logic of the helper function itself) should go at the end of decode_header(). After all, this is different kind of decoding; the current one knows how to do b/q encoding but forgot about the more traditional quoting done with backslash, and you are teaching the code that the current decoding it does is insufficient and how to handle the one that the original implementors forgot about. Thanks.