Re: git-rebase is ignoring working-tree-encoding

2018-11-08 Thread Torsten Bögershausen
On Wed, Nov 07, 2018 at 05:38:18AM +0100, Adrián Gimeno Balaguer wrote:
> Hello Torsten,
> 
> Thanks for answering.
> 
> Answering to your question, I removed the comments with "rebase" since
> my reported encoding issue happens on more simpler operations
> (described in the PR), and the problem is not directly related to
> rebasing, so I considered it better in order to avoid unrelated
> confusions.
> 
> Let's get back to the problem. Each system has a default endianness.
> Also, in .gitattributes's working-tree-encoding, Git behaves
> differently depending on the attribute's value and the contents of the
> referenced entry file. When I put the value "UTF-16", then the file
> must have a BOM, or Git complains. Otherwise, if I put the value
> "UTF-16BE" or "UTF-16LE", then Git prohibites operations if file has a
> BOM for that main encoding (UTF-16 here), which can be relate to any
> endianness.
> 
> My very initial goal was, given a UTF-16LE file, to be able to view
> human-readable diffs whenever I make a change on it (and yes, it must
> be Little Endian). Plus, this file had a BOM. Now, what are the
> options with Git currently (consider only working-tree-encoding)? If I
> put working-tree-encoding=UTF-16, then I could view readable diffs and
> commit the file, but here is the main problem: Git looses information
> about what initial endianness the file had, therefore, after
> staging/committing it re-encodes the file from UTF-8 (as stored
> internally) to UTF-16 and the default system endianness. In my case it
> did to Big Endian, thus affecting the project's requirement. That is
> why I ended up writing a fixup script to change the encoding back to
> UTF-16LE.

OK, I think I understand your problem now.
The file format which you ask for could be named "UTF-16-BOM-LE",
but that does not exist in reality.
If you use UTF-16, then there must be a BOM, and if there is a BOM,
then a Unicode-aware application -should- be able to handle it.

Why does your project require such a format ?

> 
> On the other hand, once I set working-tree-encoding=UTF-16LE, then Git
> prohibited me from committing the file and even viewing human-readable
> diffs (the output simply tells it's a binary file). In this sense, the
> internal location of these  errors is within the function of utf8.c I
> made changes to in the PR. I hope I was clearer!
> 
> Finally, Git behaviour around this is based on Unicode standards,
> which is why I acknowledged that my changes violated them after
> refering to a link which is present in the ut8.h file.

[]


Re: git-rebase is ignoring working-tree-encoding

2018-11-06 Thread Adrián Gimeno Balaguer
Hello Torsten,

Thanks for answering.

Answering to your question, I removed the comments with "rebase" since
my reported encoding issue happens on more simpler operations
(described in the PR), and the problem is not directly related to
rebasing, so I considered it better in order to avoid unrelated
confusions.

Let's get back to the problem. Each system has a default endianness.
Also, in .gitattributes's working-tree-encoding, Git behaves
differently depending on the attribute's value and the contents of the
referenced entry file. When I put the value "UTF-16", then the file
must have a BOM, or Git complains. Otherwise, if I put the value
"UTF-16BE" or "UTF-16LE", then Git prohibites operations if file has a
BOM for that main encoding (UTF-16 here), which can be relate to any
endianness.

My very initial goal was, given a UTF-16LE file, to be able to view
human-readable diffs whenever I make a change on it (and yes, it must
be Little Endian). Plus, this file had a BOM. Now, what are the
options with Git currently (consider only working-tree-encoding)? If I
put working-tree-encoding=UTF-16, then I could view readable diffs and
commit the file, but here is the main problem: Git looses information
about what initial endianness the file had, therefore, after
staging/committing it re-encodes the file from UTF-8 (as stored
internally) to UTF-16 and the default system endianness. In my case it
did to Big Endian, thus affecting the project's requirement. That is
why I ended up writing a fixup script to change the encoding back to
UTF-16LE.

On the other hand, once I set working-tree-encoding=UTF-16LE, then Git
prohibited me from committing the file and even viewing human-readable
diffs (the output simply tells it's a binary file). In this sense, the
internal location of these  errors is within the function of utf8.c I
made changes to in the PR. I hope I was clearer!

Finally, Git behaviour around this is based on Unicode standards,
which is why I acknowledged that my changes violated them after
refering to a link which is present in the ut8.h file.
El mar., 6 nov. 2018 a las 21:16, Torsten Bögershausen
() escribió:
>
> On Mon, Nov 05, 2018 at 07:10:14PM +0100, Torsten Bögershausen wrote:
> > On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote:
> >
> > []
> >
> > > https://github.com/git/git/pull/550
> >
> > []
> >
> > > This is covered in the mentioned PR above. Thanks for feedback.
> >
> > Thanks for the code,
> > I will have a look (the next days)
> >
> > >
> > > --
> > > Adrián
>
> Hej Adrián,
>
> I still didn't manage to fully understand your problem.
> I tried to convert your test into my understanding,
> It can be fetched here (or copied from this message, see below)
>
> https://github.com/tboegi/git/tree/tb.181106_UTF16LE_commit
>
> The commit of an empty file seems to work for me, in the initial
> report a "rebase" was mentioned, which is not in the TC ?
>
> Is the following what you intended to test ?
>
> #!/bin/sh
> test_description='UTF-16 LE/BE file encoding using working-tree-encoding'
>
>
> . ./test-lib.sh
>
> # We specify the UTF-16LE BOM manually, to not depend on programs such as 
> iconv.
> utf16leBOM=$(printf '\377\376')
>
> test_expect_success 'Stage empty UTF-16LE file as binary' '
> >empty_0.txt &&
> echo "empty_0.txt binary" >>.gitattributes &&
> git add empty_0.txt
> '
>
>
> test_expect_success 'Stage empty file with enc=UTF.16BL' '
> >utf16le_0.txt &&
> echo "utf16le_0.txt text working-tree-encoding=UTF-16BE" 
> >>.gitattributes &&
> git add utf16le_0.txt
> '
>
>
> test_expect_success 'Create and stage UTF-16LE file with only BOM' '
> printf "$utf16leBOM" >utf16le_1.txt &&
> echo "utf16le_1.txt text working-tree-encoding=UTF-16" 
> >>.gitattributes &&
> git add utf16le_1.txt
> '
>
> test_expect_success 'Dont stage UTF-16LE file with only BOM with 
> enc=UTF.16BE' '
> printf "$utf16leBOM" >utf16le_2.txt &&
> echo "utf16le_2.txt text working-tree-encoding=UTF-16BE" 
> >>.gitattributes &&
> test_must_fail git add utf16le_2.txt
> '
>
> test_expect_success 'commit all files' '
> test_tick &&
> git commit -m "Commit all 3 files"
> '
>
> test_expect_success 'All commited files have the same sha' '
> git ls-files -s --eol >tmp1 &&
> sed -e "s!  i/none.*!!" actual &&
> >expect &&
> test_cmp expect actual
> '
>
> test_done



-- 
Adrián


Re: git-rebase is ignoring working-tree-encoding

2018-11-06 Thread Torsten Bögershausen
On Mon, Nov 05, 2018 at 07:10:14PM +0100, Torsten Bögershausen wrote:
> On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote:
> 
> []
> 
> > https://github.com/git/git/pull/550
>  
> []
>  
> > This is covered in the mentioned PR above. Thanks for feedback.
> 
> Thanks for the code,
> I will have a look (the next days)
> 
> > 
> > -- 
> > Adrián

Hej Adrián,

I still didn't manage to fully understand your problem.
I tried to convert your test into my understanding,
It can be fetched here (or copied from this message, see below)

https://github.com/tboegi/git/tree/tb.181106_UTF16LE_commit

The commit of an empty file seems to work for me, in the initial
report a "rebase" was mentioned, which is not in the TC ?

Is the following what you intended to test ?

#!/bin/sh
test_description='UTF-16 LE/BE file encoding using working-tree-encoding'


. ./test-lib.sh

# We specify the UTF-16LE BOM manually, to not depend on programs such as iconv.
utf16leBOM=$(printf '\377\376')

test_expect_success 'Stage empty UTF-16LE file as binary' '
>empty_0.txt &&
echo "empty_0.txt binary" >>.gitattributes &&
git add empty_0.txt
'


test_expect_success 'Stage empty file with enc=UTF.16BL' '
>utf16le_0.txt &&
echo "utf16le_0.txt text working-tree-encoding=UTF-16BE" 
>>.gitattributes &&
git add utf16le_0.txt
'


test_expect_success 'Create and stage UTF-16LE file with only BOM' '
printf "$utf16leBOM" >utf16le_1.txt &&
echo "utf16le_1.txt text working-tree-encoding=UTF-16" >>.gitattributes 
&&
git add utf16le_1.txt
'

test_expect_success 'Dont stage UTF-16LE file with only BOM with enc=UTF.16BE' '
printf "$utf16leBOM" >utf16le_2.txt &&
echo "utf16le_2.txt text working-tree-encoding=UTF-16BE" 
>>.gitattributes &&
test_must_fail git add utf16le_2.txt
'

test_expect_success 'commit all files' '
test_tick &&
git commit -m "Commit all 3 files"
'

test_expect_success 'All commited files have the same sha' '
git ls-files -s --eol >tmp1 &&
sed -e "s!  i/none.*!!" actual &&
>expect &&
test_cmp expect actual
'

test_done


Re: git-rebase is ignoring working-tree-encoding

2018-11-05 Thread Torsten Bögershausen
On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote:

[]

> https://github.com/git/git/pull/550
 
[]
 
> This is covered in the mentioned PR above. Thanks for feedback.

Thanks for the code,
I will have a look (the next days)

> 
> -- 
> Adrián


Re: git-rebase is ignoring working-tree-encoding

2018-11-04 Thread Adrián Gimeno Balaguer
El dom., 4 nov. 2018 a las 18:07, Torsten Bögershausen
() escribió:
>
> Thanks for the report.
> I have tried to follow the problem from your verbal descriptions
> (and the PR) but I need to admit that I don't fully understand the
> problem (yet).

I have created a PR in the Git's repository. You can read an updated
description there:

https://github.com/git/git/pull/550

> Could you try to create some instructions how to reproduce it?
> A numer of shell instructions would be great,
> in best case some kind of "test case", like the tests in
> the t/ directory in Git.
>
> It would be nice to be able to re-produce it.
> And if there is a bug, to get it fixed.

This is covered in the mentioned PR above. Thanks for feedback.

-- 
Adrián


Re: git-rebase is ignoring working-tree-encoding

2018-11-04 Thread brian m. carlson
On Sun, Nov 04, 2018 at 05:37:09PM +0100, Adrián Gimeno Balaguer wrote:
> I wrote a "counterpart" easy fix which instead only prohibites BOM for
> the opposite endianness (for example if
> working-tree-encoding=UTF-16LE, then finding an UTF-16BE BOM in the
> file would cause Git to signal the error right before committing,
> diffing, etc.). That way user would be encouraged to modify the file's
> encoding to match the one specified in working-tree-encoding before
> allowing these actions, therefore preventing Git from encoding to the
> wrong endianness after file is written out. With few repository tests,
> this new behaviour worked as expected. But then I realized this
> solution would perhaps be unacceptable for Git's source code as it
> would violate that Unicode standard. Anyways, here is a PR in my Git
> fork with the changes I did, for reference:

I actually think such a solution (although I haven't looked at your
patch) would be fine, and I would encourage you to send it to the list.
It's my understanding that many people on Windows want to write things
in UTF-16 encoding but only little-endian with a BOM.  Allowing them to
write that, even if Git won't be able to guarantee producing that, would
be fine, as long as the data is what we expect.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: git-rebase is ignoring working-tree-encoding

2018-11-04 Thread Torsten Bögershausen
On Fri, Nov 02, 2018 at 03:30:17AM +0100, Adrián Gimeno Balaguer wrote:
> I’m attempting to perform fixups via git-rebase of UTF-16 LE files
> (the project I’m working on requires that exact encoding on certain
> files). When the rebase is complete, Git changes that file’s encoding
> to UTF-16 BE. I have been using the newer working-tree-encoding
> attribute in .gitattributes. I’m using Git for Windows.
> 
> $ git version
> git version 2.19.1.windows.1
> 
> Here is a sample UTF-16 LE file (with BOM and LF endings) with
> following atributes in .gitattributes:
> 
> test.txt eol=lf -text working-tree-encoding=UTF-16
> 
> I put eol=lf and -text to tell Git to not change the encoding of the
> file on checkout, but that doesn’t even help. Asides, the newer
> working-tree-encoding allows me to view human-readable diffs of that
> file (in GitHub Desktop and Git Bash). Now, note that doing for
> example consecutive commits to the same file does not affect the
> UTF-16 LE encoding. And before I discovered this attribute, the whole
> thing was even worse when squash/fixup rebasing, as Git would modify
> the file with Chinese characters (when manually setting it as text via
> .gitattributes).
> 
> So, again the problem with the exposed .gitattributes line is that
> after fixup rebasing, UTF-16 LE files encoding change to UTF-16 BE.
> 
> For long, I have been working with the involved UTF-16 LE files set as
> binary via .gitattributes (e.g. “test.txt binary”), so that Git would
> not modify the file encoding, but this doesn’t allow me to view the
> diffs upon changes in GitHub Desktop, which I want (and neither via
> git diff).

Thanks for the report.
I have tried to follow the problem from your verbal descriptions
(and the PR) but I need to admit that I don't fully understand the
problem (yet).

Could you try to create some instructions how to reproduce it?
A numer of shell istructions would be great,
in best case some kind of "test case", like the tests in
the t/ directory in Git.

It would be nice to be able to re-produce it.
And if there is a bug, to get it fixed.


Re: git-rebase is ignoring working-tree-encoding

2018-11-04 Thread Adrián Gimeno Balaguer
El dom., 4 nov. 2018 a las 16:48, brian m. carlson
() escribió:
> Do things work for you if you write this as "UTF-16LE"?  When you use
> working-tree-encoding, the file is stored internally as UTF-8, but it's
> serialized to the specified encoding when written out.

When I use UTF-16LE or UTF-16BE, then I can't commit or view diffs of
specified files, as Git prohibites BOM existance in these cases,
showing an error when attempting to commit. But BOM must also exist
for the project. I even experimented for fixing this issue within
Git's source. It turns out that Git is following an Unicode rule that
says that BOM is not permitted when declaring exact UTF-16BE/UTF-16LE
MIME (and UTF-32 variants) encoding types:

https://github.com/git/git/blob/master/utf8.h#L87

> Asking for "UTF-16" is ambiguous: there are two endiannesses, and so as
> long as you get a BOM in the output, either one is an acceptable option.
> Which one you get is dependent on what the underlying code thinks is the
> default, and traditionally for Unix systems and Unix tools that's been
> big-endian.  If you want a particular endianness, you should specify it.

I wrote a "counterpart" easy fix which instead only prohibites BOM for
the opposite endianness (for example if
working-tree-encoding=UTF-16LE, then finding an UTF-16BE BOM in the
file would cause Git to signal the error right before committing,
diffing, etc.). That way user would be encouraged to modify the file's
encoding to match the one specified in working-tree-encoding before
allowing these actions, therefore preventing Git from encoding to the
wrong endianness after file is written out. With few repository tests,
this new behaviour worked as expected. But then I realized this
solution would perhaps be unacceptable for Git's source code as it
would violate that Unicode standard. Anyways, here is a PR in my Git
fork with the changes I did, for reference:

https://github.com/AdRiAnIlloO/git/pull/1

Ah this point, the solution I came with recently for my project was
writing some code in Shell to fix the endianness of the re-encoded
files to UTF-16BE after the Git's write out process (or a "working
tree refresh" in my own words), within the same script that I use to
pack assets including the localization files.

> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204



-- 
Adrián


Re: git-rebase is ignoring working-tree-encoding

2018-11-04 Thread brian m. carlson
On Fri, Nov 02, 2018 at 03:30:17AM +0100, Adrián Gimeno Balaguer wrote:
> I’m attempting to perform fixups via git-rebase of UTF-16 LE files
> (the project I’m working on requires that exact encoding on certain
> files). When the rebase is complete, Git changes that file’s encoding
> to UTF-16 BE. I have been using the newer working-tree-encoding
> attribute in .gitattributes. I’m using Git for Windows.
> 
> $ git version
> git version 2.19.1.windows.1
> 
> Here is a sample UTF-16 LE file (with BOM and LF endings) with
> following atributes in .gitattributes:
> 
> test.txt eol=lf -text working-tree-encoding=UTF-16

Do things work for you if you write this as "UTF-16LE"?  When you use
working-tree-encoding, the file is stored internally as UTF-8, but it's
serialized to the specified encoding when written out.

Asking for "UTF-16" is ambiguous: there are two endiannesses, and so as
long as you get a BOM in the output, either one is an acceptable option.
Which one you get is dependent on what the underlying code thinks is the
default, and traditionally for Unix systems and Unix tools that's been
big-endian.  If you want a particular endianness, you should specify it.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature