Re: [RFC/PATCH 1/3] mailinfo: refactor commit message processing

2016-09-16 Thread Jeff King
On Fri, Sep 16, 2016 at 12:12:50PM -0700, Junio C Hamano wrote:

> > +static int check_header_raw(struct mailinfo *mi,
> > +   char *buf, size_t len,
> > +   struct strbuf *hdr_data[], int overwrite) {
> > +   const struct strbuf sb = {0, len, buf};
> > +   return check_header(mi, , hdr_data, overwrite);
> > +}
> 
> IIUC, this is a helper for callers that do not have a strbuf but
> instead have  pair to perform the same check_header() the
> callers that have strbuf can do.
> 
> As check_header() uses the strbuf as a read-only entity, wrapping
> the  pair in a temporary strbuf like this is safe.
> 
> The incoming  should conceptually be "const char *", but it's
> OK.

I think the "right" way to do this would be to continue taking a "char
*", and then strbuf_attach() it. That saves us from unexpectedly
violating any strbuf invariants.

If our assumption that check_header() does not touch the
contents turns out to be wrong, neither strategy would inform our
caller, though. I think you'd want something like:

  assert(sb.buf == buf);

after check_header() returns (though I guess we are in theory protected
by the "const").

That being said...

> If check_header() didn't call any helper function that gets passed
>  as a strbuf, or if convertiong the helper function to take a
>  pair instead, I would actually suggest refactoring this
> the other way around, though.  That is, move the implementation of
> check_header() to this function, updating its reference to line->buf
> and line->len to reference to  and , and then make
> check_header() a thin wrapper that does
> 
> check_header(mi, const struct strbuf *line,
>  struct strbuf *hdr_data[], int overwrite)
> {
> return check_header_raw(mi, line->buf, line->len,
> hdr_data, overwrite);
> }

This is _way_ better, and it looks like check_header() could handle it
easily. Looking at it, I also suspect the cascading if in that function
could be made more pleasant by modeling cmp_header()'s interface after
skip_prefix_mem(), but that is totally orthogonal and optional.

-Peff


Re: [RFC/PATCH 1/3] mailinfo: refactor commit message processing

2016-09-16 Thread Junio C Hamano
Jonathan Tan  writes:

> Within the processing of the commit message, check for a scissors line
> or a patchbreak line first (before checking for in-body headers) so that
> a subsequent patch modifying the processing of in-body headers would not
> cause a scissors line or patchbreak line to be misidentified.
>
> If a line could be both an in-body header and a scissors line (for
> example, "From: -- >8 --"), this is considered a fatal error
> (previously, it would be interpreted as an in-body header).

The scissors line is designed to allow garbage other than scissors
and perforation marks to be on the same line, i.e.

/*
 * The mark must be at least 8 bytes long (e.g. "-- >8 --").
 * Even though there can be arbitrary cruft on the same line
 * (e.g. "cut here"), in order to avoid misidentification, the
 * perforation must occupy more than a third of the visible
 * width of the line, and dashes and scissors must occupy more
 * than half of the perforation.
 */

Even though it is not likely for people to do so, it would probably
be nicer if we can treat

From: -- >8 -- cut -- >8 -- >8 -- here -- >8 --

as a scissors line instead of making it a fatal error, by treating
that "From:" as just a random garbage.

But this is a minor point.  It is not worth to make it work like so
if the resulting code will become messier.

> The following enumeration shows that processing is the same except (as
> described above) the in-body header + scissors line case.
>
> o in-body header (check_header OK)
>   o passes UTF-8 conversion
> o [described above] is scissors line
> o [not possible] is patchbreak line
> o [not possible] is blank line
> o is none of the above - processed as header
>   o fails UTF-8 conversion - processed as header
> o not in-body header
>   o passes UTF-8 conversion
> o is scissors line - processed as scissors
> o is patchbreak line - processed as patchbreak
> o is blank line - ignored if in header_stage
> o is none of the above - log message
>   o fails UTF-8 conversion - input error
>
> As for the result left in "line" (after the invocation of
> handle_commit_msg), it is unused (by its caller, handle_filter, and by
> handle_filter's callers, handle_boundary and handle_body) unless this
> line is a patchbreak line, in which case handle_patch is subsequently
> called (in handle_filter) on "line". In this case, "line" must have
> passed UTF-8 conversion both before and after this patch, so the result
> is still the same overall.
>
> Signed-off-by: Jonathan Tan 
> ---
>  mailinfo.c | 145 
> -
>  1 file changed, 115 insertions(+), 30 deletions(-)
>
> diff --git a/mailinfo.c b/mailinfo.c
> index e19abe3..23a56c2 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -340,23 +340,56 @@ static struct strbuf *decode_b_segment(const struct 
> strbuf *b_seg)
>   return out;
>  }
>  
> -static int convert_to_utf8(struct mailinfo *mi,
> -struct strbuf *line, const char *charset)
> +/*
> + * Attempts to convert line into UTF-8, storing the result in line.
> + *
> + * This differs from convert_to_utf8 in that conversion non-success is not
> + * considered an error case - mi->input_error is not set, and no error 
> message
> + * is printed.
> + *
> + * If the conversion is unnecessary, returns 0 and stores NULL in old_buf (if
> + * old_buf is not NULL).
> + *
> + * If the conversion is successful, returns 0 and stores the unconverted 
> string
> + * in old_buf and old_len (if they are respectively not NULL).
> + *
> + * If the conversion is unsuccessful, returns -1.
> + */
> +static int try_convert_to_utf8(const struct mailinfo *mi, struct strbuf 
> *line,
> +const char *charset, char **old_buf,
> +size_t *old_len)
>  {
> - char *out;
> + char *utf8;
>  
> - if (!mi->metainfo_charset || !charset || !*charset)
> + if (!mi->metainfo_charset || !charset || !*charset ||
> + same_encoding(mi->metainfo_charset, charset)) {
> + if (old_buf)
> + *old_buf = NULL;
>   return 0;
> + }
>  
> - if (same_encoding(mi->metainfo_charset, charset))
> + utf8 = reencode_string(line->buf, mi->metainfo_charset, charset);
> + if (utf8) {
> + char *temp = strbuf_detach(line, old_len);
> + if (old_buf)
> + *old_buf = temp;
> + strbuf_attach(line, utf8, strlen(utf8), strlen(utf8));
>   return 0;
> - out = reencode_string(line->buf, mi->metainfo_charset, charset);
> - if (!out) {
> + }
> + return -1;
> +}
> +
> +/*
> + * Converts line into UTF-8, setting mi->input_error to -1 upon failure.
> + */
> +static int convert_to_utf8(struct mailinfo *mi,
> +struct strbuf