Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-11 Thread Magnus Ihse Bursie
On Fri, 11 Apr 2025 03:35:11 GMT, Sergey Bylokhov  wrote:

>> I have checked the entire code base for incorrect encodings, but luckily 
>> enough these were the only remaining problems I found. 
>> 
>> BOM (byte-order mark) is a method used for distinguishing big and little 
>> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is 
>> discouraged. In the words of the Unicode Consortium: "Use of a BOM is 
>> neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful 
>> of files. These should be removed.
>> 
>> Methodology used: 
>> 
>> I have run four different tools for using different heuristics for 
>> determining the encoding of a file:
>> * chardetect (the original, slow-as-molasses Perl program, which also had 
>> the worst performing heuristics of all; I'll rate it 1/5)
>> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
>> * enca (targeted towards obscure code pages)
>> * libmagic / `file  --mime-encoding`
>> 
>> They all agreed on pure ASCII files (which is easy to check), and these I 
>> just ignored/accepted as good. The handling of pure binary files differed 
>> between the tools; most detected them as binary but some suggested arcane 
>> encodings for specific (often small) binary files. To keep my sanity, I 
>> decided that files ending in any of these extensions were binary, and I did 
>> not check them further:
>> * 
>> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
>> 
>> From the remaining list of non-ascii, non-known-binary files I selected two 
>> overlapping and exhaustive subsets:
>> * All files where at least one tool claimed it to be UTF-8
>> * All files where at least one tool claimed it to be *not* UTF-8
>> 
>> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
>> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
>> names-of-files-to-check.txt)`, and visually examining the results). At this 
>> stage, I found several files where unicode were unnecessarily used instead 
>> of pure ASCII, and I treated those files separately. Other from that, my 
>> inspection revealed no obvious encoding errors. This list comprised of about 
>> 2000 files, so I did not spend too much time on each file. The assumption, 
>> after all, was that these files are okay.
>> 
>> For the second subset, I checked every non-ASCII character (using the same 
>> method). This list was about 300+ files. Most of them were okay far as I can 
>> tell; I can confirm encodings for European languages 100%, but JCK encodings 
>> could theoretically be wrong; they looked sane but I cannot read and confirm 
>> fully. Several were in fact pure...
>
> src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11:
> 
>> 9: تخصص الشفرة الموحدة "يونِكود" رقما وحيدا لكل محرف في جميع اللغات 
>> العالمية، وذلك بغض النظر عن نوع الحاسوب أو البرامج المستخدمة. وقد تـم تبني 
>> مواصفة "يونِكود" مــن قبـل قادة الصانعين لأنظمة الحواسيب فـي العالم، مثل 
>> شركات آي.بي.إم. (IBM)، أبـل (APPLE)، هِيـْولِـت بـاكـرد (Hewlett-Packard) ، 
>> مايكروسوفت (Microsoft)، أوراكِـل (Oracle) ، صن (Sun) وغيرها. كما أن 
>> المواصفات والمقاييس الحديثة (مثل لغة البرمجة "جافا" "JAVA" ولغة "إكس إم إل" 
>> "XML" التي تستخدم لبرمجة الانترنيت) تتطلب استخدام "يونِكود". علاوة على ذلك ، 
>> فإن "يونِكود" هي الطـريـقـة الرسـمية لتطبيق المقيـاس الـعـالـمي إيزو ١٠�
 �٤٦  (ISO 10646) .
>> 10: 
>> 11: إن بزوغ مواصفة "يونِكود" وتوفُّر الأنظمة التي تستخدمه وتدعمه، يعتبر من 
>> أهم الاختراعات الحديثة في عولمة البرمجيات لجميع اللغات في العالم. وإن 
>> استخدام "يونِكود" في عالم الانترنيت سيؤدي إلى توفير كبير مقارنة مع استخدام 
>> المجموعات التقليدية للمحارف المشفرة. كما أن استخدام "يونِكود" سيُمكِّن 
>> المبرمج من كتابة البرنامج مرة واحدة، واستخدامه على أي نوع من الأجهزة أو 
>> الأنظمة، ولأي لغة أو دولة في العالم أينما كانت، دون الحاجة لإعادة البرمجة أو 
>> إجراء أي تعديل. وأخيرا، فإن استخدام "يونِكود" سيمكن البيانات من الانتقال عبر 
>> الأنظمة والأجهزة المختلفة دون أ
 ي خطورة لتحريفها، مهما تعددت الشركات الصانعة للأنظمة واللغات، والدول التي تمر 
من خلالها هذه البيانات.
> 
> Looks like most of the changes in java2d/* are related to spaces at the end 
> of the line?

No, that are just incidental changes (see 
https://github.com/openjdk/jdk/pull/24566#issuecomment-2795201480). The actual 
change for the java2d files is the removal of the initial UTF-8 BOM. Github has 
a hard time showing this though, since the BOM is not visible.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2039258980


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-11 Thread Eirik Bjørsnøs
On Fri, 11 Apr 2025 10:21:32 GMT, Magnus Ihse Bursie  wrote:

>> src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11:
>> 
>>> 9: تخصص الشفرة الموحدة "يونِكود" رقما وحيدا لكل محرف في جميع اللغات 
>>> العالمية، وذلك بغض النظر عن نوع الحاسوب أو البرامج المستخدمة. وقد تـم تبني 
>>> مواصفة "يونِكود" مــن قبـل قادة الصانعين لأنظمة الحواسيب فـي العالم، مثل 
>>> شركات آي.بي.إم. (IBM)، أبـل (APPLE)، هِيـْولِـت بـاكـرد (Hewlett-Packard) ، 
>>> مايكروسوفت (Microsoft)، أوراكِـل (Oracle) ، صن (Sun) وغيرها. كما أن 
>>> المواصفات والمقاييس الحديثة (مثل لغة البرمجة "جافا" "JAVA" ولغة "إكس إم إل" 
>>> "XML" التي تستخدم لبرمجة الانترنيت) تتطلب استخدام "يونِكود". علاوة على ذلك 
>>> ، فإن "يونِكود" هي الطـريـقـة الرسـمية لتطبيق المقيـاس الـعـالـمي إيزو ١٠
 ٦٤٦  (ISO 10646) .
>>> 10: 
>>> 11: إن بزوغ مواصفة "يونِكود" وتوفُّر الأنظمة التي تستخدمه وتدعمه، يعتبر من 
>>> أهم الاختراعات الحديثة في عولمة البرمجيات لجميع اللغات في العالم. وإن 
>>> استخدام "يونِكود" في عالم الانترنيت سيؤدي إلى توفير كبير مقارنة مع استخدام 
>>> المجموعات التقليدية للمحارف المشفرة. كما أن استخدام "يونِكود" سيُمكِّن 
>>> المبرمج من كتابة البرنامج مرة واحدة، واستخدامه على أي نوع من الأجهزة أو 
>>> الأنظمة، ولأي لغة أو دولة في العالم أينما كانت، دون الحاجة لإعادة البرمجة 
>>> أو إجراء أي تعديل. وأخيرا، فإن استخدام "يونِكود" سيمكن البيانات من الانتقال 
>>> عبر الأنظمة والأجهزة المختلفة دون �
 �ي خطورة لتحريفها، مهما تعددت الشركات الصانعة للأنظمة واللغات، والدول التي تمر 
من خلالها هذه البيانات.
>> 
>> Looks like most of the changes in java2d/* are related to spaces at the end 
>> of the line?
>
> No, that are just incidental changes (see 
> https://github.com/openjdk/jdk/pull/24566#issuecomment-2795201480). The 
> actual change for the java2d files is the removal of the initial UTF-8 BOM. 
> Github has a hard time showing this though, since the BOM is not visible.

I found the side-by-side diff in IntelliJ useful here, as it said "UTF-8 BOM" 
vs. "UTF-8".

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2039263227


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Sergey Bylokhov
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11:

> 9: تخصص الشفرة الموحدة "يونِكود" رقما وحيدا لكل محرف في جميع اللغات العالمية، 
> وذلك بغض النظر عن نوع الحاسوب أو البرامج المستخدمة. وقد تـم تبني مواصفة 
> "يونِكود" مــن قبـل قادة الصانعين لأنظمة الحواسيب فـي العالم، مثل شركات 
> آي.بي.إم. (IBM)، أبـل (APPLE)، هِيـْولِـت بـاكـرد (Hewlett-Packard) ، 
> مايكروسوفت (Microsoft)، أوراكِـل (Oracle) ، صن (Sun) وغيرها. كما أن المواصفات 
> والمقاييس الحديثة (مثل لغة البرمجة "جافا" "JAVA" ولغة "إكس إم إل" "XML" التي 
> تستخدم لبرمجة الانترنيت) تتطلب استخدام "يونِكود". علاوة على ذلك ، فإن 
> "يونِكود" هي الطـريـقـة الرسـمية لتطبيق المقيـاس الـعـالـمي إيزو ١٠٦
 ٤٦  (ISO 10646) .
> 10: 
> 11: إن بزوغ مواصفة "يونِكود" وتوفُّر الأنظمة التي تستخدمه وتدعمه، يعتبر من 
> أهم الاختراعات الحديثة في عولمة البرمجيات لجميع اللغات في العالم. وإن استخدام 
> "يونِكود" في عالم الانترنيت سيؤدي إلى توفير كبير مقارنة مع استخدام المجموعات 
> التقليدية للمحارف المشفرة. كما أن استخدام "يونِكود" سيُمكِّن المبرمج من كتابة 
> البرنامج مرة واحدة، واستخدامه على أي نوع من الأجهزة أو الأنظمة، ولأي لغة أو 
> دولة في العالم أينما كانت، دون الحاجة لإعادة البرمجة أو إجراء أي تعديل. 
> وأخيرا، فإن استخدام "يونِكود" سيمكن البيانات من الانتقال عبر الأنظمة والأجهزة 
> المختلفة دون أ�
 � خطورة لتحريفها، مهما تعددت الشركات الصانعة للأنظمة واللغات، والدول التي تمر 
من خلالها هذه البيانات.

Looks like most of the changes in java2d/* are related to spaces at the end of 
the line?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038746193


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Magnus Ihse Bursie
On Thu, 10 Apr 2025 18:30:22 GMT, Eirik Bjørsnøs  wrote:

>> If this is a French name, it's e acute: é.
>
>> If this is a French name, it's e acute: é.
> 
> Supported by this Wikipedia page listing S.L as an LCMS developer:
> 
> https://en.wikipedia.org/wiki/Little_CMS

It's not a mistake in capitalization, it's a mistake for two different 
characters in two different encodings. (Probably iso-8859-1 mistaken as ansi 
iirc.)

I verified the developers name at the original file in the LCMS repo.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038362034


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Magnus Ihse Bursie
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

The whitespace changes are my editor removing whitespaces at the end of a line. 
This is a thing we enforce for many files types, but the check does not yet 
formally include .txt files. I have been working from time to time with trying 
to extend the set of files covered by this check, so I have in general not 
tried to circumvent my editor when it strips trailing whitespaces even for 
files that we do not yet require no trailing whitespaces in jcheck.

-

PR Comment: https://git.openjdk.org/jdk/pull/24566#issuecomment-2795201480


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Magnus Ihse Bursie
On Thu, 10 Apr 2025 19:06:35 GMT, Eirik Bjørsnøs  wrote:

> (BTW, I enjoyed seeing separate commits for the encoding and BOM changes, 
> makes it easier to verify each!)

Thanks! I do very much like myself to review PRs that has separate logical 
commits, so I try to produce such myself. I'm glad to hear it was appreciated.

-

PR Comment: https://git.openjdk.org/jdk/pull/24566#issuecomment-2795203125


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Eirik Bjørsnøs
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

LGTM.

There are some whitepace releated changes in this PR which seem okay, but has 
no mention in either the JBS or PR description.

Perhaps a short mention of this intention in either place would be good for 
future historians.

(BTW, I enjoyed seeing separate commits for the encoding and BOM changes, makes 
it easier to verify each!)

-

Marked as reviewed by eirbjo (Committer).

PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2758055634


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Eirik Bjørsnøs
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

src/java.desktop/share/legal/lcms.md line 103:

> 101: Tim Zaman
> 102: Amir Montazery and Open Source Technology Improvement Fund (ostif.org), 
> Google, for fuzzer fundings.
> 103: ```

This introduces an empty trailing line. I see you have removed trailing 
whitespace elsewhere.

Was this intentional, to avoid the file ending with the three ticks?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038071768


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Eirik Bjørsnøs
On Thu, 10 Apr 2025 17:23:37 GMT, Raffaello Giulietti  
wrote:

> If this is a French name, it's e acute: é.

Supported by this Wikipedia page listing S.L as an LCMS developer:

https://en.wikipedia.org/wiki/Little_CMS

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038022994


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Naoto Sato
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

Marked as reviewed by naoto (Reviewer).

-

PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2757716905


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Erik Joelsson
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

Marked as reviewed by erikj (Reviewer).

-

PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2757703868


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Raffaello Giulietti
On Thu, 10 Apr 2025 17:09:27 GMT, Naoto Sato  wrote:

>> I have checked the entire code base for incorrect encodings, but luckily 
>> enough these were the only remaining problems I found. 
>> 
>> BOM (byte-order mark) is a method used for distinguishing big and little 
>> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is 
>> discouraged. In the words of the Unicode Consortium: "Use of a BOM is 
>> neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful 
>> of files. These should be removed.
>> 
>> Methodology used: 
>> 
>> I have run four different tools for using different heuristics for 
>> determining the encoding of a file:
>> * chardetect (the original, slow-as-molasses Perl program, which also had 
>> the worst performing heuristics of all; I'll rate it 1/5)
>> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
>> * enca (targeted towards obscure code pages)
>> * libmagic / `file  --mime-encoding`
>> 
>> They all agreed on pure ASCII files (which is easy to check), and these I 
>> just ignored/accepted as good. The handling of pure binary files differed 
>> between the tools; most detected them as binary but some suggested arcane 
>> encodings for specific (often small) binary files. To keep my sanity, I 
>> decided that files ending in any of these extensions were binary, and I did 
>> not check them further:
>> * 
>> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
>> 
>> From the remaining list of non-ascii, non-known-binary files I selected two 
>> overlapping and exhaustive subsets:
>> * All files where at least one tool claimed it to be UTF-8
>> * All files where at least one tool claimed it to be *not* UTF-8
>> 
>> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
>> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
>> names-of-files-to-check.txt)`, and visually examining the results). At this 
>> stage, I found several files where unicode were unnecessarily used instead 
>> of pure ASCII, and I treated those files separately. Other from that, my 
>> inspection revealed no obvious encoding errors. This list comprised of about 
>> 2000 files, so I did not spend too much time on each file. The assumption, 
>> after all, was that these files are okay.
>> 
>> For the second subset, I checked every non-ASCII character (using the same 
>> method). This list was about 300+ files. Most of them were okay far as I can 
>> tell; I can confirm encodings for European languages 100%, but JCK encodings 
>> could theoretically be wrong; they looked sane but I cannot read and confirm 
>> fully. Several were in fact pure...
>
> src/java.desktop/share/legal/lcms.md line 72:
> 
>> 70: Mateusz Jurczyk (Google)
>> 71: Paul Miller
>> 72: Sébastien Léon
> 
> I cannot comment on capitalization here, but if we wanted to lowercase them, 
> should they be e-grave instead of e-acute?

If this is a French name, it's e acute: é.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037917708


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Naoto Sato
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

src/java.desktop/share/legal/lcms.md line 72:

> 70: Mateusz Jurczyk (Google)
> 71: Paul Miller
> 72: Sébastien Léon

I cannot comment on capitalization here, but if we wanted to lowercase them, 
should they be e-grave instead of e-acute?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037895884


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Magnus Ihse Bursie
On Thu, 10 Apr 2025 11:46:45 GMT, Raffaello Giulietti  
wrote:

> I guess the difference at L.1 in the various files is just the BOM?

Yes.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037357899


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Raffaello Giulietti
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

I only checked these 13 files to be UTF-8 encoded and without BOM.

-

Marked as reviewed by rgiulietti (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2756936848


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Raffaello Giulietti
On Thu, 10 Apr 2025 10:14:40 GMT, Magnus Ihse Bursie  wrote:

>> I have checked the entire code base for incorrect encodings, but luckily 
>> enough these were the only remaining problems I found. 
>> 
>> BOM (byte-order mark) is a method used for distinguishing big and little 
>> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is 
>> discouraged. In the words of the Unicode Consortium: "Use of a BOM is 
>> neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful 
>> of files. These should be removed.
>> 
>> Methodology used: 
>> 
>> I have run four different tools for using different heuristics for 
>> determining the encoding of a file:
>> * chardetect (the original, slow-as-molasses Perl program, which also had 
>> the worst performing heuristics of all; I'll rate it 1/5)
>> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
>> * enca (targeted towards obscure code pages)
>> * libmagic / `file  --mime-encoding`
>> 
>> They all agreed on pure ASCII files (which is easy to check), and these I 
>> just ignored/accepted as good. The handling of pure binary files differed 
>> between the tools; most detected them as binary but some suggested arcane 
>> encodings for specific (often small) binary files. To keep my sanity, I 
>> decided that files ending in any of these extensions were binary, and I did 
>> not check them further:
>> * 
>> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
>> 
>> From the remaining list of non-ascii, non-known-binary files I selected two 
>> overlapping and exhaustive subsets:
>> * All files where at least one tool claimed it to be UTF-8
>> * All files where at least one tool claimed it to be *not* UTF-8
>> 
>> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
>> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
>> names-of-files-to-check.txt)`, and visually examining the results). At this 
>> stage, I found several files where unicode were unnecessarily used instead 
>> of pure ASCII, and I treated those files separately. Other from that, my 
>> inspection revealed no obvious encoding errors. This list comprised of about 
>> 2000 files, so I did not spend too much time on each file. The assumption, 
>> after all, was that these files are okay.
>> 
>> For the second subset, I checked every non-ASCII character (using the same 
>> method). This list was about 300+ files. Most of them were okay far as I can 
>> tell; I can confirm encodings for European languages 100%, but JCK encodings 
>> could theoretically be wrong; they looked sane but I cannot read and confirm 
>> fully. Several were in fact pure...
>
> src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 497:
> 
>> 495: /*
>> 496:   The algorithm below is based on Intel publication:
>> 497:   "Fast SHA-256 Implementations on Intel(R) Architecture Processors" by 
>> Jim Guilford, Kirk Yap and Vinodh Gopal.
> 
> Note: There is of course a unicode `®` symbol, which is what it was 
> originally before it was botched here, but I found no reason to keep this, 
> and in the spirit of JDK-8354213, I thought it better to use pure ASCII here.

I guess the difference at L.1 in the various files is just the BOM?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037161789


Re: RFR: 8354266: Fix non-UTF-8 text encoding

2025-04-10 Thread Magnus Ihse Bursie
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie  wrote:

> I have checked the entire code base for incorrect encodings, but luckily 
> enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little 
> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. 
> In the words of the Unicode Consortium: "Use of a BOM is neither required nor 
> recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These 
> should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for 
> determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the 
> worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I 
> just ignored/accepted as good. The handling of pure binary files differed 
> between the tools; most detected them as binary but some suggested arcane 
> encodings for specific (often small) binary files. To keep my sanity, I 
> decided that files ending in any of these extensions were binary, and I did 
> not check them further:
> * 
> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two 
> overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
> names-of-files-to-check.txt)`, and visually examining the results). At this 
> stage, I found several files where unicode were unnecessarily used instead of 
> pure ASCII, and I treated those files separately. Other from that, my 
> inspection revealed no obvious encoding errors. This list comprised of about 
> 2000 files, so I did not spend too much time on each file. The assumption, 
> after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same 
> method). This list was about 300+ files. Most of them were okay far as I can 
> tell; I can confirm encodings for European languages 100%, but JCK encodings 
> could theoretically be wrong; they looked sane but I cannot read and confirm 
> fully. Several were in fact pure binary files, but without any telling 
> exten...

src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 497:

> 495: /*
> 496:   The algorithm below is based on Intel publication:
> 497:   "Fast SHA-256 Implementations on Intel(R) Architecture Processors" by 
> Jim Guilford, Kirk Yap and Vinodh Gopal.

Note: There is of course a unicode `®` symbol, which is what it was originally 
before it was botched here, but I found no reason to keep this, and in the 
spirit of JDK-8354213, I thought it better to use pure ASCII here.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037012318