Re: git-rebase is ignoring working-tree-encoding
On Wed, Nov 07, 2018 at 05:38:18AM +0100, Adrián Gimeno Balaguer wrote: > Hello Torsten, > > Thanks for answering. > > Answering to your question, I removed the comments with "rebase" since > my reported encoding issue happens on more simpler operations > (described in the PR), and the problem is not directly related to > rebasing, so I considered it better in order to avoid unrelated > confusions. > > Let's get back to the problem. Each system has a default endianness. > Also, in .gitattributes's working-tree-encoding, Git behaves > differently depending on the attribute's value and the contents of the > referenced entry file. When I put the value "UTF-16", then the file > must have a BOM, or Git complains. Otherwise, if I put the value > "UTF-16BE" or "UTF-16LE", then Git prohibites operations if file has a > BOM for that main encoding (UTF-16 here), which can be relate to any > endianness. > > My very initial goal was, given a UTF-16LE file, to be able to view > human-readable diffs whenever I make a change on it (and yes, it must > be Little Endian). Plus, this file had a BOM. Now, what are the > options with Git currently (consider only working-tree-encoding)? If I > put working-tree-encoding=UTF-16, then I could view readable diffs and > commit the file, but here is the main problem: Git looses information > about what initial endianness the file had, therefore, after > staging/committing it re-encodes the file from UTF-8 (as stored > internally) to UTF-16 and the default system endianness. In my case it > did to Big Endian, thus affecting the project's requirement. That is > why I ended up writing a fixup script to change the encoding back to > UTF-16LE. OK, I think I understand your problem now. The file format which you ask for could be named "UTF-16-BOM-LE", but that does not exist in reality. If you use UTF-16, then there must be a BOM, and if there is a BOM, then a Unicode-aware application -should- be able to handle it. Why does your project require such a format ? > > On the other hand, once I set working-tree-encoding=UTF-16LE, then Git > prohibited me from committing the file and even viewing human-readable > diffs (the output simply tells it's a binary file). In this sense, the > internal location of these errors is within the function of utf8.c I > made changes to in the PR. I hope I was clearer! > > Finally, Git behaviour around this is based on Unicode standards, > which is why I acknowledged that my changes violated them after > refering to a link which is present in the ut8.h file. []
Re: git-rebase is ignoring working-tree-encoding
Hello Torsten, Thanks for answering. Answering to your question, I removed the comments with "rebase" since my reported encoding issue happens on more simpler operations (described in the PR), and the problem is not directly related to rebasing, so I considered it better in order to avoid unrelated confusions. Let's get back to the problem. Each system has a default endianness. Also, in .gitattributes's working-tree-encoding, Git behaves differently depending on the attribute's value and the contents of the referenced entry file. When I put the value "UTF-16", then the file must have a BOM, or Git complains. Otherwise, if I put the value "UTF-16BE" or "UTF-16LE", then Git prohibites operations if file has a BOM for that main encoding (UTF-16 here), which can be relate to any endianness. My very initial goal was, given a UTF-16LE file, to be able to view human-readable diffs whenever I make a change on it (and yes, it must be Little Endian). Plus, this file had a BOM. Now, what are the options with Git currently (consider only working-tree-encoding)? If I put working-tree-encoding=UTF-16, then I could view readable diffs and commit the file, but here is the main problem: Git looses information about what initial endianness the file had, therefore, after staging/committing it re-encodes the file from UTF-8 (as stored internally) to UTF-16 and the default system endianness. In my case it did to Big Endian, thus affecting the project's requirement. That is why I ended up writing a fixup script to change the encoding back to UTF-16LE. On the other hand, once I set working-tree-encoding=UTF-16LE, then Git prohibited me from committing the file and even viewing human-readable diffs (the output simply tells it's a binary file). In this sense, the internal location of these errors is within the function of utf8.c I made changes to in the PR. I hope I was clearer! Finally, Git behaviour around this is based on Unicode standards, which is why I acknowledged that my changes violated them after refering to a link which is present in the ut8.h file. El mar., 6 nov. 2018 a las 21:16, Torsten Bögershausen () escribió: > > On Mon, Nov 05, 2018 at 07:10:14PM +0100, Torsten Bögershausen wrote: > > On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote: > > > > [] > > > > > https://github.com/git/git/pull/550 > > > > [] > > > > > This is covered in the mentioned PR above. Thanks for feedback. > > > > Thanks for the code, > > I will have a look (the next days) > > > > > > > > -- > > > Adrián > > Hej Adrián, > > I still didn't manage to fully understand your problem. > I tried to convert your test into my understanding, > It can be fetched here (or copied from this message, see below) > > https://github.com/tboegi/git/tree/tb.181106_UTF16LE_commit > > The commit of an empty file seems to work for me, in the initial > report a "rebase" was mentioned, which is not in the TC ? > > Is the following what you intended to test ? > > #!/bin/sh > test_description='UTF-16 LE/BE file encoding using working-tree-encoding' > > > . ./test-lib.sh > > # We specify the UTF-16LE BOM manually, to not depend on programs such as > iconv. > utf16leBOM=$(printf '\377\376') > > test_expect_success 'Stage empty UTF-16LE file as binary' ' > >empty_0.txt && > echo "empty_0.txt binary" >>.gitattributes && > git add empty_0.txt > ' > > > test_expect_success 'Stage empty file with enc=UTF.16BL' ' > >utf16le_0.txt && > echo "utf16le_0.txt text working-tree-encoding=UTF-16BE" > >>.gitattributes && > git add utf16le_0.txt > ' > > > test_expect_success 'Create and stage UTF-16LE file with only BOM' ' > printf "$utf16leBOM" >utf16le_1.txt && > echo "utf16le_1.txt text working-tree-encoding=UTF-16" > >>.gitattributes && > git add utf16le_1.txt > ' > > test_expect_success 'Dont stage UTF-16LE file with only BOM with > enc=UTF.16BE' ' > printf "$utf16leBOM" >utf16le_2.txt && > echo "utf16le_2.txt text working-tree-encoding=UTF-16BE" > >>.gitattributes && > test_must_fail git add utf16le_2.txt > ' > > test_expect_success 'commit all files' ' > test_tick && > git commit -m "Commit all 3 files" > ' > > test_expect_success 'All commited files have the same sha' ' > git ls-files -s --eol >tmp1 && > sed -e "s! i/none.*!!" actual && > >expect && > test_cmp expect actual > ' > > test_done -- Adrián
Re: git-rebase is ignoring working-tree-encoding
On Mon, Nov 05, 2018 at 07:10:14PM +0100, Torsten Bögershausen wrote: > On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote: > > [] > > > https://github.com/git/git/pull/550 > > [] > > > This is covered in the mentioned PR above. Thanks for feedback. > > Thanks for the code, > I will have a look (the next days) > > > > > -- > > Adrián Hej Adrián, I still didn't manage to fully understand your problem. I tried to convert your test into my understanding, It can be fetched here (or copied from this message, see below) https://github.com/tboegi/git/tree/tb.181106_UTF16LE_commit The commit of an empty file seems to work for me, in the initial report a "rebase" was mentioned, which is not in the TC ? Is the following what you intended to test ? #!/bin/sh test_description='UTF-16 LE/BE file encoding using working-tree-encoding' . ./test-lib.sh # We specify the UTF-16LE BOM manually, to not depend on programs such as iconv. utf16leBOM=$(printf '\377\376') test_expect_success 'Stage empty UTF-16LE file as binary' ' >empty_0.txt && echo "empty_0.txt binary" >>.gitattributes && git add empty_0.txt ' test_expect_success 'Stage empty file with enc=UTF.16BL' ' >utf16le_0.txt && echo "utf16le_0.txt text working-tree-encoding=UTF-16BE" >>.gitattributes && git add utf16le_0.txt ' test_expect_success 'Create and stage UTF-16LE file with only BOM' ' printf "$utf16leBOM" >utf16le_1.txt && echo "utf16le_1.txt text working-tree-encoding=UTF-16" >>.gitattributes && git add utf16le_1.txt ' test_expect_success 'Dont stage UTF-16LE file with only BOM with enc=UTF.16BE' ' printf "$utf16leBOM" >utf16le_2.txt && echo "utf16le_2.txt text working-tree-encoding=UTF-16BE" >>.gitattributes && test_must_fail git add utf16le_2.txt ' test_expect_success 'commit all files' ' test_tick && git commit -m "Commit all 3 files" ' test_expect_success 'All commited files have the same sha' ' git ls-files -s --eol >tmp1 && sed -e "s! i/none.*!!" actual && >expect && test_cmp expect actual ' test_done
Re: git-rebase is ignoring working-tree-encoding
On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote: [] > https://github.com/git/git/pull/550 [] > This is covered in the mentioned PR above. Thanks for feedback. Thanks for the code, I will have a look (the next days) > > -- > Adrián
Re: git-rebase is ignoring working-tree-encoding
El dom., 4 nov. 2018 a las 18:07, Torsten Bögershausen () escribió: > > Thanks for the report. > I have tried to follow the problem from your verbal descriptions > (and the PR) but I need to admit that I don't fully understand the > problem (yet). I have created a PR in the Git's repository. You can read an updated description there: https://github.com/git/git/pull/550 > Could you try to create some instructions how to reproduce it? > A numer of shell instructions would be great, > in best case some kind of "test case", like the tests in > the t/ directory in Git. > > It would be nice to be able to re-produce it. > And if there is a bug, to get it fixed. This is covered in the mentioned PR above. Thanks for feedback. -- Adrián
Re: git-rebase is ignoring working-tree-encoding
On Sun, Nov 04, 2018 at 05:37:09PM +0100, Adrián Gimeno Balaguer wrote: > I wrote a "counterpart" easy fix which instead only prohibites BOM for > the opposite endianness (for example if > working-tree-encoding=UTF-16LE, then finding an UTF-16BE BOM in the > file would cause Git to signal the error right before committing, > diffing, etc.). That way user would be encouraged to modify the file's > encoding to match the one specified in working-tree-encoding before > allowing these actions, therefore preventing Git from encoding to the > wrong endianness after file is written out. With few repository tests, > this new behaviour worked as expected. But then I realized this > solution would perhaps be unacceptable for Git's source code as it > would violate that Unicode standard. Anyways, here is a PR in my Git > fork with the changes I did, for reference: I actually think such a solution (although I haven't looked at your patch) would be fine, and I would encourage you to send it to the list. It's my understanding that many people on Windows want to write things in UTF-16 encoding but only little-endian with a BOM. Allowing them to write that, even if Git won't be able to guarantee producing that, would be fine, as long as the data is what we expect. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: git-rebase is ignoring working-tree-encoding
On Fri, Nov 02, 2018 at 03:30:17AM +0100, Adrián Gimeno Balaguer wrote: > I’m attempting to perform fixups via git-rebase of UTF-16 LE files > (the project I’m working on requires that exact encoding on certain > files). When the rebase is complete, Git changes that file’s encoding > to UTF-16 BE. I have been using the newer working-tree-encoding > attribute in .gitattributes. I’m using Git for Windows. > > $ git version > git version 2.19.1.windows.1 > > Here is a sample UTF-16 LE file (with BOM and LF endings) with > following atributes in .gitattributes: > > test.txt eol=lf -text working-tree-encoding=UTF-16 > > I put eol=lf and -text to tell Git to not change the encoding of the > file on checkout, but that doesn’t even help. Asides, the newer > working-tree-encoding allows me to view human-readable diffs of that > file (in GitHub Desktop and Git Bash). Now, note that doing for > example consecutive commits to the same file does not affect the > UTF-16 LE encoding. And before I discovered this attribute, the whole > thing was even worse when squash/fixup rebasing, as Git would modify > the file with Chinese characters (when manually setting it as text via > .gitattributes). > > So, again the problem with the exposed .gitattributes line is that > after fixup rebasing, UTF-16 LE files encoding change to UTF-16 BE. > > For long, I have been working with the involved UTF-16 LE files set as > binary via .gitattributes (e.g. “test.txt binary”), so that Git would > not modify the file encoding, but this doesn’t allow me to view the > diffs upon changes in GitHub Desktop, which I want (and neither via > git diff). Thanks for the report. I have tried to follow the problem from your verbal descriptions (and the PR) but I need to admit that I don't fully understand the problem (yet). Could you try to create some instructions how to reproduce it? A numer of shell istructions would be great, in best case some kind of "test case", like the tests in the t/ directory in Git. It would be nice to be able to re-produce it. And if there is a bug, to get it fixed.
Re: git-rebase is ignoring working-tree-encoding
El dom., 4 nov. 2018 a las 16:48, brian m. carlson () escribió: > Do things work for you if you write this as "UTF-16LE"? When you use > working-tree-encoding, the file is stored internally as UTF-8, but it's > serialized to the specified encoding when written out. When I use UTF-16LE or UTF-16BE, then I can't commit or view diffs of specified files, as Git prohibites BOM existance in these cases, showing an error when attempting to commit. But BOM must also exist for the project. I even experimented for fixing this issue within Git's source. It turns out that Git is following an Unicode rule that says that BOM is not permitted when declaring exact UTF-16BE/UTF-16LE MIME (and UTF-32 variants) encoding types: https://github.com/git/git/blob/master/utf8.h#L87 > Asking for "UTF-16" is ambiguous: there are two endiannesses, and so as > long as you get a BOM in the output, either one is an acceptable option. > Which one you get is dependent on what the underlying code thinks is the > default, and traditionally for Unix systems and Unix tools that's been > big-endian. If you want a particular endianness, you should specify it. I wrote a "counterpart" easy fix which instead only prohibites BOM for the opposite endianness (for example if working-tree-encoding=UTF-16LE, then finding an UTF-16BE BOM in the file would cause Git to signal the error right before committing, diffing, etc.). That way user would be encouraged to modify the file's encoding to match the one specified in working-tree-encoding before allowing these actions, therefore preventing Git from encoding to the wrong endianness after file is written out. With few repository tests, this new behaviour worked as expected. But then I realized this solution would perhaps be unacceptable for Git's source code as it would violate that Unicode standard. Anyways, here is a PR in my Git fork with the changes I did, for reference: https://github.com/AdRiAnIlloO/git/pull/1 Ah this point, the solution I came with recently for my project was writing some code in Shell to fix the endianness of the re-encoded files to UTF-16BE after the Git's write out process (or a "working tree refresh" in my own words), within the same script that I use to pack assets including the localization files. > brian m. carlson: Houston, Texas, US > OpenPGP: https://keybase.io/bk2204 -- Adrián
Re: git-rebase is ignoring working-tree-encoding
On Fri, Nov 02, 2018 at 03:30:17AM +0100, Adrián Gimeno Balaguer wrote: > I’m attempting to perform fixups via git-rebase of UTF-16 LE files > (the project I’m working on requires that exact encoding on certain > files). When the rebase is complete, Git changes that file’s encoding > to UTF-16 BE. I have been using the newer working-tree-encoding > attribute in .gitattributes. I’m using Git for Windows. > > $ git version > git version 2.19.1.windows.1 > > Here is a sample UTF-16 LE file (with BOM and LF endings) with > following atributes in .gitattributes: > > test.txt eol=lf -text working-tree-encoding=UTF-16 Do things work for you if you write this as "UTF-16LE"? When you use working-tree-encoding, the file is stored internally as UTF-8, but it's serialized to the specified encoding when written out. Asking for "UTF-16" is ambiguous: there are two endiannesses, and so as long as you get a BOM in the output, either one is an acceptable option. Which one you get is dependent on what the underlying code thinks is the default, and traditionally for Unix systems and Unix tools that's been big-endian. If you want a particular endianness, you should specify it. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature