subject:"RFC\: Native clean\/smudge filter for UTF\-16 files"

Re: RFC: Native clean/smudge filter for UTF-16 files

2017-12-04 Thread Jeff King

On Sun, Dec 03, 2017 at 07:48:01PM +0100, Lars Schneider wrote:

> > - if core.convertEncoding is true, then for any file with an
> >   encoding=foo attribute, internally run iconv(foo, utf8) in
> >   convert_to_git(), and likewise iconv(utf8, foo) in
> >   convert_to_working_tree.
> > 
> > - I'm not sure if core.convertEncoding should be enabled by default. If
> >   it's a noop as long as there's no encoding attribute, then it's
> >   probably fine. But I would not want accidental conversion or any
> >   slowdown for the common case that the user wants no conversion.
> 
> I think we should mimic the behavior of "eol=crlf/lf" attribute.
> 
> AFAIK whenever I set "*.ext text eol=crlf", then I can be sure the 
> file is checked out with CRLF independent of any of my local config
> settings. Isn't that correct? I would expect a similar behavior if
> "*.ext text encoding=utf16" is set. Wouldn't that mean that we do
> not need a "core.convertEncoding" config?

Yeah, on further thought, that's probably the right thing. Both "eol"
and "encoding" attributes are definite indications of what should happen
(unlike "text", which is just saying you _could_ convert line endings if
you wished to, and therefore has to be used in conjunction with a config
setting).

I like the name "encoding" for the attribute, but I do wonder if this
would bite anybody using it already for other purposes (like gitk).

> > There is one other approach, which is to really store utf-16 in the
> > repository and better teach the diff tools to handle it (which are
> > really the main thing in git that cares about looking into the blob
> > contents). You can do this already with a textconv filter, but:
> > 
> >  1. It's slow (though cacheable).
> > 
> >  2. It doesn't work unless each repo configures the filter (so not on
> > sites like GitHub, unless we define a micro-format that diff=utf16
> > should be textconv'd on display, and get all implementations to
> > respect that).
> 
> Actually, rendering diffs on Git hosting sites such as GitHub is one
> of my goals. Therefore, storing content as UTF-16 wouldn't be a solution
> for me.

If there were a convention for specifying the attribute, I think sites
like GitHub would start respecting it in the server-side diffs (though
like I said, we could also just auto-detect via BOM without even
requiring any attributes to be set up).

> >  3. Textconv patches look good, but can't be applied. This occasionally
> > makes things awkward, depending on your workflow.
> 
> TBH I dont't understand what you mean here. What do you mean with
> "textconv patches"?

I mean the patch produced by "git diff" when textconv is in effect. That
patch cannot be applied to the original content. E.g.:

  git init
  echo "* diff=foo" >.git/info/attributes
  git config diff.foo.textconv "sed s/^/foo:/"

  echo base >file
  git add file
  git commit -m base

  echo change >file
  git diff >patch

  git reset --hard
  git apply patch

That works in the absence of the textconv, but not with it. (For a real
binary file, you'd probably need "diff --binary" to produce a usable
patch, but the principle is the same).

-Peff

Re: RFC: Native clean/smudge filter for UTF-16 files

2017-12-03 Thread Lars Schneider


> On 24 Nov 2017, at 19:04, Jeff King  wrote:
> 
> On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:
> 
>> Alternatively, I could add a native attribute to Git that translates UTF-16 
>> to UTF-8 and back. A conversion function is already available in "mingw.h" 
>> [3]
>> on Windows. Limiting this feature to Windows wouldn't be a problem from my
>> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
>> could look like this:
>> 
>>*.txttext encoding=utf-16
>> 
>> There was a previous discussion on the topic and Jonathan already suggested
>> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
>> is already present but, as far as I can tell, is only used by the git gui
>> for viewing [5].
> 
> I would not want to see a proliferation of built-in filters, but it
> really seems like text-encoding conversion is a broad and practical one
> that many people might benefit from. So just like line-ending
> conversion, which _could_ be done by generic filters, it makes sense to
> me to support it natively for speed and simplicity.
> 
>> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
>> "encoding=utf-16" on Windows would have a chance to get accepted?
> 
> You haven't fully specified the semantics here, so let me sketch out
> what I think it ought to look like:

Thank you :-)


> - declare utf8 the "canonical" in-repo representation, just as we have
>   declared LF for line endings (alternatively this could be
>   configurable, but if we can get away with declaring utf8 the one true
>   encoding, that cuts out a lot of corner cases).

100% agree on UTF-8 as canonical representation and I wouldn't make 
that configurable as it would open a can of worms.


> - if core.convertEncoding is true, then for any file with an
>   encoding=foo attribute, internally run iconv(foo, utf8) in
>   convert_to_git(), and likewise iconv(utf8, foo) in
>   convert_to_working_tree.
> 
> - I'm not sure if core.convertEncoding should be enabled by default. If
>   it's a noop as long as there's no encoding attribute, then it's
>   probably fine. But I would not want accidental conversion or any
>   slowdown for the common case that the user wants no conversion.

I think we should mimic the behavior of "eol=crlf/lf" attribute.

AFAIK whenever I set "*.ext text eol=crlf", then I can be sure the 
file is checked out with CRLF independent of any of my local config
settings. Isn't that correct? I would expect a similar behavior if
"*.ext text encoding=utf16" is set. Wouldn't that mean that we do
not need a "core.convertEncoding" config?

I do 100% agree that it must be a noop if there is no encoding 
attribute present.


> - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". I
>   don't think people consistently have all utf-16 files (the way they
>   might have all CRLF files) rather a few files that must be utf-16.

Agreed!


> - I have actually seen two types of utf-16 in git repos in the wild:
>   files which really must be utf-16 (because some tool demands it) and
>   files which happen to be utf-16, but could just as easily be utf-8
>   (and the user simply does not notice and commits utf-16, but doesn't
>   realize it until much later when their diffs are unreadable).
> 
>   For the first case, the "encoding" thing above would work fine. For
>   the second case, in theory we could have an option that takes any
>   file with a "text" attribute and no "encoding" attribute, and
>   converts it to utf-8.

TBH I would label that a "non-goal". Because AFAIK we cannot reliability
detect the encoding automatically.


>   I suspect that's opening a can of worms for false positives similar
>   to core.autocrlf. And performance drops as we try to guess the
>   encoding and convert all incoming data.
> 
>   So I mention it mostly as a direction I think we probably _don't_
>   want to go. Anybody with the "this could have been utf-8 all along"
>   type of file can remedy it by converting and committing the result.

100 % agree.


> Omitting all of the "we shouldn't do this" bullet points, it seems
> pretty simple and sane to me.
> 
> There is one other approach, which is to really store utf-16 in the
> repository and better teach the diff tools to handle it (which are
> really the main thing in git that cares about looking into the blob
> contents). You can do this already with a textconv filter, but:
> 
>  1. It's slow (though cacheable).
> 
>  2. It doesn't work unless each repo configures the filter (so not on
> sites like GitHub, unless we define a micro-format that diff=utf16
> should be textconv'd on display, and get all implementations to
> respect that).

Actually, rendering diffs on Git hosting sites such as GitHub is one
of my goals. Therefore, storing content as UTF-16 wouldn't be a solution
for me.


>  3. Textconv patches look good, but can't be applied. This occasionally
> makes things awkward,

Re: RFC: Native clean/smudge filter for UTF-16 files

2017-11-24 Thread Junio C Hamano

Jeff King  writes:

> So anyway, that is an alternate strategy, but I think I like "canonical
> in-repo text is utf-8" approach a lot more, since then git operations
> work consistently. There are still a few rough edges (e.g., I'm not sure

Sounds like a good way forward.

> if you could apply a utf-8 patch directly to a utf-16 working tree file.
> Certainly not using "patch", but I'm not sure how well "git apply" would
> handle that case either). But I think it would mostly Just Work as long
> as people were willing to set their encoding attributes.

It should work (or fail) just like applying LF patch to CRLF working
tree, so I wouldn't worry too much about it.

Thanks.

Re: RFC: Native clean/smudge filter for UTF-16 files

2017-11-24 Thread Jeff King

On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:

> Alternatively, I could add a native attribute to Git that translates UTF-16 
> to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
> on Windows. Limiting this feature to Windows wouldn't be a problem from my
> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
> could look like this:
> 
> *.txttext encoding=utf-16
>
> There was a previous discussion on the topic and Jonathan already suggested
> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
> is already present but, as far as I can tell, is only used by the git gui
> for viewing [5].

I would not want to see a proliferation of built-in filters, but it
really seems like text-encoding conversion is a broad and practical one
that many people might benefit from. So just like line-ending
conversion, which _could_ be done by generic filters, it makes sense to
me to support it natively for speed and simplicity.

> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
> "encoding=utf-16" on Windows would have a chance to get accepted?

You haven't fully specified the semantics here, so let me sketch out
what I think it ought to look like:

 - declare utf8 the "canonical" in-repo representation, just as we have
   declared LF for line endings (alternatively this could be
   configurable, but if we can get away with declaring utf8 the one true
   encoding, that cuts out a lot of corner cases).

 - if core.convertEncoding is true, then for any file with an
   encoding=foo attribute, internally run iconv(foo, utf8) in
   convert_to_git(), and likewise iconv(utf8, foo) in
   convert_to_working_tree.

 - I'm not sure if core.convertEncoding should be enabled by default. If
   it's a noop as long as there's no encoding attribute, then it's
   probably fine. But I would not want accidental conversion or any
   slowdown for the common case that the user wants no conversion.

 - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". I
   don't think people consistently have all utf-16 files (the way they
   might have all CRLF files) rather a few files that must be utf-16.

 - I have actually seen two types of utf-16 in git repos in the wild:
   files which really must be utf-16 (because some tool demands it) and
   files which happen to be utf-16, but could just as easily be utf-8
   (and the user simply does not notice and commits utf-16, but doesn't
   realize it until much later when their diffs are unreadable).

   For the first case, the "encoding" thing above would work fine. For
   the second case, in theory we could have an option that takes any
   file with a "text" attribute and no "encoding" attribute, and
   converts it to utf-8.

   I suspect that's opening a can of worms for false positives similar
   to core.autocrlf. And performance drops as we try to guess the
   encoding and convert all incoming data.

   So I mention it mostly as a direction I think we probably _don't_
   want to go. Anybody with the "this could have been utf-8 all along"
   type of file can remedy it by converting and committing the result.

Omitting all of the "we shouldn't do this" bullet points, it seems
pretty simple and sane to me.

There is one other approach, which is to really store utf-16 in the
repository and better teach the diff tools to handle it (which are
really the main thing in git that cares about looking into the blob
contents). You can do this already with a textconv filter, but:

  1. It's slow (though cacheable).

  2. It doesn't work unless each repo configures the filter (so not on
 sites like GitHub, unless we define a micro-format that diff=utf16
 should be textconv'd on display, and get all implementations to
 respect that).

  3. Textconv patches look good, but can't be applied. This occasionally
 makes things awkward, depending on your workflow.

  4. You have to actually mark each file with an attribute, which is
 slightly annoying and more thing to remember to do (I see this from
 the server side, since people often commit utf-16 without any
 attributes, and then get annoyed when they see the file marked as
 binary).

We've toyed with the idea at GitHub of auto-detecting UTF-16 BOMs and
doing an "auto-textconv" to utf-8 (for our human-readable diffs only, of
course). That solves (1), (2), and (4), but leaves (3). I actually
looked into using libicu to do it not just for UTF-16, but to detect any
encoding. It turned out to be really slow, though. :)

So anyway, that is an alternate strategy, but I think I like "canonical
in-repo text is utf-8" approach a lot more, since then git operations
work consistently. There are still a few rough edges (e.g., I'm not sure
if you could apply a utf-8 patch directly to a utf-16 working tree file.
Certainly not using "patch", but I'm not sure how well "git apply" would
handle that case either). But I

Re: RFC: Native clean/smudge filter for UTF-16 files

2017-11-23 Thread Torsten Bögershausen

On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:
> Hi,
> 
> I am working with a team that owns a repository with lots of UTF-16 files.
> Converting these files to UTF-8 is no good option as downstream applications
> require the UTF-16 encoding. Keeping the files in UTF-16 is no good option
> either as Git and Git related tools (e.g. GitHub) consider the files binary
> and consequently do not render diffs.
> 
> The obvious solution is to setup a clean/smudge filter like this [1]:
> [filter "winutf16"]
> clean = iconv -f utf-16 -t utf-8
> smudge = iconv -f utf-8 -t utf-16
> 
> In general this works well but the "per-file" clean/smudge process invocation 
> can be slow for many files. I could apply the same trick that we used for Git
> LFS and write a iconv that processes all files with a single invocation (see
> "edcc85814c convert: add filter..process option" [2]). 
> 
> Alternatively, I could add a native attribute to Git that translates UTF-16 
> to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
> on Windows. Limiting this feature to Windows wouldn't be a problem from my
> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
> could look like this:
> 
> *.txttext encoding=utf-16
> 

I would probably vote for UTF-16LE.
But we can specify whatever iconv allows, see below.

[]
> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
> "encoding=utf-16" on Windows would have a chance to get accepted?

Yes.
The thing is that we have convert.c and utf8.c (reencode_string_iconv()
and possible strbuf.c, which feels that they are in charge here.

Having a path using mingw.h would mean that this is Windows only,
and that is not a good solution.
(I sometimes host files on a Linux box, export them via SAMBA to
 Mac and Windows. Having that kind of setup allows to commit
 directly on the Linux fileserver)

But even if you don't have that setup, cross platform is a must, I would say.
There may be possible optimizations using xutftowcsn() and friends under
Windows, but they may come later into the picture (or are not needed)
What file sizes are we talking about ?
100K ?
100M ?

It is even possible to hook into the streaming interface.

> 
> Thanks,
> Lars
>

RFC: Native clean/smudge filter for UTF-16 files

2017-11-23 Thread Lars Schneider

Hi,

I am working with a team that owns a repository with lots of UTF-16 files.
Converting these files to UTF-8 is no good option as downstream applications
require the UTF-16 encoding. Keeping the files in UTF-16 is no good option
either as Git and Git related tools (e.g. GitHub) consider the files binary
and consequently do not render diffs.

The obvious solution is to setup a clean/smudge filter like this [1]:
[filter "winutf16"]
clean = iconv -f utf-16 -t utf-8
smudge = iconv -f utf-8 -t utf-16

In general this works well but the "per-file" clean/smudge process invocation 
can be slow for many files. I could apply the same trick that we used for Git
LFS and write a iconv that processes all files with a single invocation (see
"edcc85814c convert: add filter..process option" [2]). 

Alternatively, I could add a native attribute to Git that translates UTF-16 
to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
on Windows. Limiting this feature to Windows wouldn't be a problem from my
point of view as UTF-16 is only relevant on Windows anyways. The attribute 
could look like this:

*.txttext encoding=utf-16

There was a previous discussion on the topic and Jonathan already suggested
a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
is already present but, as far as I can tell, is only used by the git gui
for viewing [5].

Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
"encoding=utf-16" on Windows would have a chance to get accepted?

Thanks,
Lars

[1] https://github.com/msysgit/msysgit/issues/113#issue-13142846
[2] https://github.com/git/git/commit/edcc85814c87ebd7f3b1b7d3979fac3dfb84d308
[3] 
https://github.com/git/git/blob/14c63a9dc093d6738454f6369a4f5663ca732cf7/compat/mingw.h#L501-L533
[4] https://public-inbox.org/git/20101022195331.GA12014@burratino/
[5] https://github.com/git/git/commit/1ffca60f0b0395e1e593e64d66e7ed3c47d8517e

Re: RFC: Native clean/smudge filter for UTF-16 files

Re: RFC: Native clean/smudge filter for UTF-16 files

Re: RFC: Native clean/smudge filter for UTF-16 files

Re: RFC: Native clean/smudge filter for UTF-16 files

Re: RFC: Native clean/smudge filter for UTF-16 files

RFC: Native clean/smudge filter for UTF-16 files

6 matches

Site Navigation

Mail list logo

Footer information