[PATCH 00/17] Peace with CRLF

Junio C Hamano Wed, 28 Oct 2015 15:34:20 -0700

We have too many topics titled "War on something"; let's try to make
peace for a change.


This is a continuation to $gmane/275735, which is filed in the
archive under another mailing list:

  http://thread.gmane.org/gmane.comp.version-control.msysgit/21773

We read "text" files from filesystem (or from the standard input) in
various places in the code.  Some of them are written by us, but
others are often prepared by an editor driven by human user.  Even
though we write (and expect) lines in these text files to be
terminated with LF (not CRLF), the end-user's editor can be told to
use CRLF termination, and on some platforms that may be the default.

Because many codepaths (e.g. commit log message) first pass the
contents of such a file through stripspace(), and as a side effect
of discarding the whitespace at the end of each line, lines
terminated with CRLF are normalized to LF terminated lines (but we
do not lose a lone CR in the middle of a non-blank string), this is
not a problem in many codepaths.  But not all of the codepaths are
safe.

Typically, we use strbuf_getline() to read these "text" files, which
reads a single line up to the first LF into a buffer, discard that
LF at the end, and returns (an incomplete last line gets returned
as-is).  In theory, these places that expect to read "text", we
should be able to update the lotic to read a line up to the first
LF, discard that LF, together with a CR if one appears immediately
before that LF, without breaking anything.

I inspected all the callsites of this function to see if it is safe
to use such an updated logic at these callsites, and did not find
anything problematic.  I could update strbuf_getline() in place, but
just to be extra careful, this series instead introduces another
helper, strbuf_gets(), that is aware of this CRLF business, and
convert the ones that are safe to update as we verify.

At the end of this message, you will find my notes while inspecting
the current codebase as of 37023ba3 (Seventh batch for 2.7,
2015-10-26).

 * This series converts only the callers of strbuf_getline() in the
   category [A], i.e. the current code would misbehave when fed a
   file with CRLF-terminated lines and use the data with an unwanted
   CR appended at the end, and with the update the code should work
   as intended with such a file, without breaking the expected
   behaviour when working on a file with LF-terminated lines.

 * Callers that expect to read from our own output do not have to
   accomodate CRLF-terminated lines, but they can be updated to do
   so safely if we do not rely on the ability to express a payload
   that has CR at the end.  For example, the insn sheet "rebase -i"
   uses typically end with commit log summary that we ourselves do
   not read or use, so even if the end-user on a platform with LF
   lines deliberately does insert \r at the end of the line and
   strbuf_gets() removed that \r from the payload, no unexpected
   behaviour should happen.  They are categorized as [B] in the
   attached notes (and this series does not touch them).

 * Some callers just strbuf_trim() or otherwise have logic that
   tolerates whitespaces at the end of the line.  For them, it does
   not make any difference whether strbuf_getline() or strbuf_gets()
   is used to read the lines to be processed.  They are caregorized
   as [C] in the attached notes (and this series does not touch
   them).

 * I haven't found a caller that wants to only see LF-terminated
   lines (in other words, "ABC\r\n" must be treated as a line with 4
   bytes payload on it, A B C and CR), but the survey reserves
   category name [X] for this empty set ;-).

Double-checking my classification for callers in [B] and [C] and
making them all use strbuf_gets() would be a good microproject for
GSoC in coming years ;-) If all callers with strbuf_getline() that
uses '\n' as the terminator disappears at the end, that would be
ideal.

Junio C Hamano (17):
  strbuf: add strbuf_gets()
  check-attr, check-ignore, checkout-index: read paths with strbuf_gets()
  update-index: read --index-info with strbuf_gets()
  update-index: read list of paths with strbuf_gets() under --stdin
  mktree: read textual tree representation with strbuf_gets()
  hash-object: read --stdin-paths with strbuf_gets()
  revision: read --stdin with strbuf_gets()
  rev-parse: read parseopt spec with strbuf_gets()
  ident.c: read /etc/mailname with strbuf_gets()
  remote.c: read $GIT_DIR/remotes/* with strbuf_gets()
  clone/sha1_file: read info/alternates with strbuf_gets()
  transport-helper: read helper response with strbuf_gets()
  cat-file: read batch stream with strbuf_gets()
  column: read lines with strbuf_gets()
  send-pack: read list of refs with strbuf_gets()
  grep: read -f file with strbuf_gets()
  test-sha1-array: read command stream with strbuf_gets()

 builtin/am.c             | 23 ++++-------------------
 builtin/cat-file.c       |  2 +-
 builtin/check-attr.c     |  4 +++-
 builtin/check-ignore.c   |  5 ++++-
 builtin/checkout-index.c |  4 +++-
 builtin/clone.c          |  2 +-
 builtin/column.c         |  2 +-
 builtin/grep.c           |  2 +-
 builtin/hash-object.c    |  2 +-
 builtin/mktree.c         |  4 +++-
 builtin/rev-parse.c      |  4 ++--
 builtin/send-pack.c      |  2 +-
 builtin/update-index.c   |  9 +++++++--
 ident.c                  |  2 +-
 remote.c                 |  2 +-
 revision.c               |  9 ++-------
 sha1_file.c              |  2 +-
 strbuf.c                 | 16 ++++++++++++++--
 strbuf.h                 |  7 +++++++
 test-sha1-array.c        |  2 +-
 transport-helper.c       |  2 +-
 21 files changed, 60 insertions(+), 47 deletions(-)

-- >8 -- survey of current strbuf_getline() callers -- >8 --

updating == "teaching strbuf_getline() to strip CR that comes
             immediately before the LF"

[A] may be problematic on CRLF systems, "updating" will help fixing.

[B] CRLF line ending may or may not appear, but "updating" does not hurt.

[C] CRLF line ending may appear, but the code is already safe.

[X] The caller cares about distinction between CRLF and LF, and "updating"
    introduce breakage.


bisect.c uses it to read from BISECT_NAMES, which is a text
file written by redirecting "rev-parse --sq-quote" into it. [B]

bisect.c uses it to read from BISECT_EXPECTED_REV, which is
a "ref-like" file written by update_ref(). [B]

*** This should be cleaned up to use read_ref() or something.

ident.c reads from /etc/mailname, which is a text file. [A]
*** mailnamebuf may want to be strbuf_strip()'ed, and doing so is
another way to fix this.

credential-cache--daemon.c uses it to read requests, which is a text
channel. [B]

credential-store.c uses it to read from the credential file, which is
a text file. However, it is parsed by credential_from_url() that is
not affected by an extra CR at the end. [C]

credential-store.c parrots the original in CRLF as CRLF with "store",
while writing a new one with LF.  This codepath will start normalizing
the resulting file to LF, which may not be a bad thing. [C]

credential.c uses it to read from the helper. [B]

daemon.c uses it to read from a child process's output in order to
save it to the log.  The current code may have sent string with CR at
the end to logerror(). [A]

fast-import.c uses it to read the command stream, which is text. [B]

remote-curl.c uses it as a transport helper to read commadn stream,
which is text [B]

remote-testsvn.c uses it as a transport helper to read commadn stream,
which is text [B]

remote.c uses it to read $GIT_DIR/remotes/* files, which is text. [A]

sequencer.c uses it to read $GIT_DIR/sequencer/head that it itself
writes. [B]

builtin/clone.c and sha1_file.c uses it to read from an alternates
file, which is text.  [A]

shell.c does end-user interactions ;-) [B]

test-sha1-array.c reads command stream, which is text [A].

transport-helper.c reads helper output, which is text [A].

walker.c reads from the standard input a <target> HT <ref> per line,
which is text [A].

wt-status.c reads from rebase-i insn, which is text, but it uses
strbuf_trim() on it [C].

wt-status.c also reads from various "ref-like" things, all of which
are internally generated by us [B].

builtin/am.c does a lot to process mail input. [A]

builtin/cat-file.c uses it to read --batch input, which is text [A].

builtin/checkout-index.c, builtin/check-attr.c and
builtin/check-ignore.c use it to read --stdin, which is text [A].

builtin/check-mailmap.c uses it to read --stdin, but the tail end of
the input line is not used [C].

builtin/clean.c does end-user interactions; it trims so it is safe [C].

builtin/column.c reads from standard input text. [A]

builtin/commit.c reads from MERGE_HEAD which is internally generated
[B].

builtin/fetch-pack.c reads list of refs with --stdin, which is
internally generated [B].

builtin/grep.c uses it to read patterns from a file with -f [A].

builtin/hash-object.c uses it with --stdin-paths [A].

mailinfo uses it to read from its input [B].

builtin/mktree.c reads etxt-formatted tree representation [A].

builtin/notes.c reads from the standard input upon "git notes copy";
the input is trimmed [C].

builtin/pull.c reads FETCH_HEAD which is internally generated [B].

builtin/repack.c reads the packname output by pack-objects which is
internally generated [B].

builtin/rev-parse.c uses it for its parseopt mode [A].

builtin/send-pack.c uses it to read the refs from --stdin [A].

builtin/update-index.c reads "--index-info" [A].

builtin/update-index.c reads paths from "--stdin" [A].


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/17] Peace with CRLF

Reply via email to