Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Fri, 11 Apr 2025 03:35:11 GMT, Sergey Bylokhov wrote: >> I have checked the entire code base for incorrect encodings, but luckily >> enough these were the only remaining problems I found. >> >> BOM (byte-order mark) is a method used for distinguishing big and little >> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is >> discouraged. In the words of the Unicode Consortium: "Use of a BOM is >> neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful >> of files. These should be removed. >> >> Methodology used: >> >> I have run four different tools for using different heuristics for >> determining the encoding of a file: >> * chardetect (the original, slow-as-molasses Perl program, which also had >> the worst performing heuristics of all; I'll rate it 1/5) >> * uchardet (a modern version by freedesktop, used by e.g. Firefox) >> * enca (targeted towards obscure code pages) >> * libmagic / `file --mime-encoding` >> >> They all agreed on pure ASCII files (which is easy to check), and these I >> just ignored/accepted as good. The handling of pure binary files differed >> between the tools; most detected them as binary but some suggested arcane >> encodings for specific (often small) binary files. To keep my sanity, I >> decided that files ending in any of these extensions were binary, and I did >> not check them further: >> * >> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >> >> From the remaining list of non-ascii, non-known-binary files I selected two >> overlapping and exhaustive subsets: >> * All files where at least one tool claimed it to be UTF-8 >> * All files where at least one tool claimed it to be *not* UTF-8 >> >> For the first subset, I checked every non-ASCII character (using `C_ALL=C >> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat >> names-of-files-to-check.txt)`, and visually examining the results). At this >> stage, I found several files where unicode were unnecessarily used instead >> of pure ASCII, and I treated those files separately. Other from that, my >> inspection revealed no obvious encoding errors. This list comprised of about >> 2000 files, so I did not spend too much time on each file. The assumption, >> after all, was that these files are okay. >> >> For the second subset, I checked every non-ASCII character (using the same >> method). This list was about 300+ files. Most of them were okay far as I can >> tell; I can confirm encodings for European languages 100%, but JCK encodings >> could theoretically be wrong; they looked sane but I cannot read and confirm >> fully. Several were in fact pure... > > src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11: > >> 9: تخصص الشفرة الموحدة "يونِكود" رقما وحيدا لكل محرف في جميع اللغات >> العالمية، وذلك بغض النظر عن نوع الحاسوب أو البرامج المستخدمة. وقد تـم تبني >> مواصفة "يونِكود" مــن قبـل قادة الصانعين لأنظمة الحواسيب فـي العالم، مثل >> شركات آي.بي.إم. (IBM)، أبـل (APPLE)، هِيـْولِـت بـاكـرد (Hewlett-Packard) ، >> مايكروسوفت (Microsoft)، أوراكِـل (Oracle) ، صن (Sun) وغيرها. كما أن >> المواصفات والمقاييس الحديثة (مثل لغة البرمجة "جافا" "JAVA" ولغة "إكس إم إل" >> "XML" التي تستخدم لبرمجة الانترنيت) تتطلب استخدام "يونِكود". علاوة على ذلك ، >> فإن "يونِكود" هي الطـريـقـة الرسـمية لتطبيق المقيـاس الـعـالـمي إيزو ١٠� �٤٦ (ISO 10646) . >> 10: >> 11: إن بزوغ مواصفة "يونِكود" وتوفُّر الأنظمة التي تستخدمه وتدعمه، يعتبر من >> أهم الاختراعات الحديثة في عولمة البرمجيات لجميع اللغات في العالم. وإن >> استخدام "يونِكود" في عالم الانترنيت سيؤدي إلى توفير كبير مقارنة مع استخدام >> المجموعات التقليدية للمحارف المشفرة. كما أن استخدام "يونِكود" سيُمكِّن >> المبرمج من كتابة البرنامج مرة واحدة، واستخدامه على أي نوع من الأجهزة أو >> الأنظمة، ولأي لغة أو دولة في العالم أينما كانت، دون الحاجة لإعادة البرمجة أو >> إجراء أي تعديل. وأخيرا، فإن استخدام "يونِكود" سيمكن البيانات من الانتقال عبر >> الأنظمة والأجهزة المختلفة دون أ ي خطورة لتحريفها، مهما تعددت الشركات الصانعة للأنظمة واللغات، والدول التي تمر من خلالها هذه البيانات. > > Looks like most of the changes in java2d/* are related to spaces at the end > of the line? No, that are just incidental changes (see https://github.com/openjdk/jdk/pull/24566#issuecomment-2795201480). The actual change for the java2d files is the removal of the initial UTF-8 BOM. Github has a hard time showing this though, since the BOM is not visible. - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2039258980
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Fri, 11 Apr 2025 10:21:32 GMT, Magnus Ihse Bursie wrote: >> src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11: >> >>> 9: تخصص الشفرة الموحدة "يونِكود" رقما وحيدا لكل محرف في جميع اللغات >>> العالمية، وذلك بغض النظر عن نوع الحاسوب أو البرامج المستخدمة. وقد تـم تبني >>> مواصفة "يونِكود" مــن قبـل قادة الصانعين لأنظمة الحواسيب فـي العالم، مثل >>> شركات آي.بي.إم. (IBM)، أبـل (APPLE)، هِيـْولِـت بـاكـرد (Hewlett-Packard) ، >>> مايكروسوفت (Microsoft)، أوراكِـل (Oracle) ، صن (Sun) وغيرها. كما أن >>> المواصفات والمقاييس الحديثة (مثل لغة البرمجة "جافا" "JAVA" ولغة "إكس إم إل" >>> "XML" التي تستخدم لبرمجة الانترنيت) تتطلب استخدام "يونِكود". علاوة على ذلك >>> ، فإن "يونِكود" هي الطـريـقـة الرسـمية لتطبيق المقيـاس الـعـالـمي إيزو ١٠ ٦٤٦ (ISO 10646) . >>> 10: >>> 11: إن بزوغ مواصفة "يونِكود" وتوفُّر الأنظمة التي تستخدمه وتدعمه، يعتبر من >>> أهم الاختراعات الحديثة في عولمة البرمجيات لجميع اللغات في العالم. وإن >>> استخدام "يونِكود" في عالم الانترنيت سيؤدي إلى توفير كبير مقارنة مع استخدام >>> المجموعات التقليدية للمحارف المشفرة. كما أن استخدام "يونِكود" سيُمكِّن >>> المبرمج من كتابة البرنامج مرة واحدة، واستخدامه على أي نوع من الأجهزة أو >>> الأنظمة، ولأي لغة أو دولة في العالم أينما كانت، دون الحاجة لإعادة البرمجة >>> أو إجراء أي تعديل. وأخيرا، فإن استخدام "يونِكود" سيمكن البيانات من الانتقال >>> عبر الأنظمة والأجهزة المختلفة دون � �ي خطورة لتحريفها، مهما تعددت الشركات الصانعة للأنظمة واللغات، والدول التي تمر من خلالها هذه البيانات. >> >> Looks like most of the changes in java2d/* are related to spaces at the end >> of the line? > > No, that are just incidental changes (see > https://github.com/openjdk/jdk/pull/24566#issuecomment-2795201480). The > actual change for the java2d files is the removal of the initial UTF-8 BOM. > Github has a hard time showing this though, since the BOM is not visible. I found the side-by-side diff in IntelliJ useful here, as it said "UTF-8 BOM" vs. "UTF-8". - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2039263227
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11: > 9: تخصص الشفرة الموحدة "يونِكود" رقما وحيدا لكل محرف في جميع اللغات العالمية، > وذلك بغض النظر عن نوع الحاسوب أو البرامج المستخدمة. وقد تـم تبني مواصفة > "يونِكود" مــن قبـل قادة الصانعين لأنظمة الحواسيب فـي العالم، مثل شركات > آي.بي.إم. (IBM)، أبـل (APPLE)، هِيـْولِـت بـاكـرد (Hewlett-Packard) ، > مايكروسوفت (Microsoft)، أوراكِـل (Oracle) ، صن (Sun) وغيرها. كما أن المواصفات > والمقاييس الحديثة (مثل لغة البرمجة "جافا" "JAVA" ولغة "إكس إم إل" "XML" التي > تستخدم لبرمجة الانترنيت) تتطلب استخدام "يونِكود". علاوة على ذلك ، فإن > "يونِكود" هي الطـريـقـة الرسـمية لتطبيق المقيـاس الـعـالـمي إيزو ١٠٦ ٤٦ (ISO 10646) . > 10: > 11: إن بزوغ مواصفة "يونِكود" وتوفُّر الأنظمة التي تستخدمه وتدعمه، يعتبر من > أهم الاختراعات الحديثة في عولمة البرمجيات لجميع اللغات في العالم. وإن استخدام > "يونِكود" في عالم الانترنيت سيؤدي إلى توفير كبير مقارنة مع استخدام المجموعات > التقليدية للمحارف المشفرة. كما أن استخدام "يونِكود" سيُمكِّن المبرمج من كتابة > البرنامج مرة واحدة، واستخدامه على أي نوع من الأجهزة أو الأنظمة، ولأي لغة أو > دولة في العالم أينما كانت، دون الحاجة لإعادة البرمجة أو إجراء أي تعديل. > وأخيرا، فإن استخدام "يونِكود" سيمكن البيانات من الانتقال عبر الأنظمة والأجهزة > المختلفة دون أ� � خطورة لتحريفها، مهما تعددت الشركات الصانعة للأنظمة واللغات، والدول التي تمر من خلالها هذه البيانات. Looks like most of the changes in java2d/* are related to spaces at the end of the line? - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038746193
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 18:30:22 GMT, Eirik Bjørsnøs wrote: >> If this is a French name, it's e acute: é. > >> If this is a French name, it's e acute: é. > > Supported by this Wikipedia page listing S.L as an LCMS developer: > > https://en.wikipedia.org/wiki/Little_CMS It's not a mistake in capitalization, it's a mistake for two different characters in two different encodings. (Probably iso-8859-1 mistaken as ansi iirc.) I verified the developers name at the original file in the LCMS repo. - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038362034
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... The whitespace changes are my editor removing whitespaces at the end of a line. This is a thing we enforce for many files types, but the check does not yet formally include .txt files. I have been working from time to time with trying to extend the set of files covered by this check, so I have in general not tried to circumvent my editor when it strips trailing whitespaces even for files that we do not yet require no trailing whitespaces in jcheck. - PR Comment: https://git.openjdk.org/jdk/pull/24566#issuecomment-2795201480
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 19:06:35 GMT, Eirik Bjørsnøs wrote: > (BTW, I enjoyed seeing separate commits for the encoding and BOM changes, > makes it easier to verify each!) Thanks! I do very much like myself to review PRs that has separate logical commits, so I try to produce such myself. I'm glad to hear it was appreciated. - PR Comment: https://git.openjdk.org/jdk/pull/24566#issuecomment-2795203125
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... LGTM. There are some whitepace releated changes in this PR which seem okay, but has no mention in either the JBS or PR description. Perhaps a short mention of this intention in either place would be good for future historians. (BTW, I enjoyed seeing separate commits for the encoding and BOM changes, makes it easier to verify each!) - Marked as reviewed by eirbjo (Committer). PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2758055634
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... src/java.desktop/share/legal/lcms.md line 103: > 101: Tim Zaman > 102: Amir Montazery and Open Source Technology Improvement Fund (ostif.org), > Google, for fuzzer fundings. > 103: ``` This introduces an empty trailing line. I see you have removed trailing whitespace elsewhere. Was this intentional, to avoid the file ending with the three ticks? - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038071768
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 17:23:37 GMT, Raffaello Giulietti wrote: > If this is a French name, it's e acute: é. Supported by this Wikipedia page listing S.L as an LCMS developer: https://en.wikipedia.org/wiki/Little_CMS - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038022994
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... Marked as reviewed by naoto (Reviewer). - PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2757716905
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... Marked as reviewed by erikj (Reviewer). - PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2757703868
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 17:09:27 GMT, Naoto Sato wrote: >> I have checked the entire code base for incorrect encodings, but luckily >> enough these were the only remaining problems I found. >> >> BOM (byte-order mark) is a method used for distinguishing big and little >> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is >> discouraged. In the words of the Unicode Consortium: "Use of a BOM is >> neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful >> of files. These should be removed. >> >> Methodology used: >> >> I have run four different tools for using different heuristics for >> determining the encoding of a file: >> * chardetect (the original, slow-as-molasses Perl program, which also had >> the worst performing heuristics of all; I'll rate it 1/5) >> * uchardet (a modern version by freedesktop, used by e.g. Firefox) >> * enca (targeted towards obscure code pages) >> * libmagic / `file --mime-encoding` >> >> They all agreed on pure ASCII files (which is easy to check), and these I >> just ignored/accepted as good. The handling of pure binary files differed >> between the tools; most detected them as binary but some suggested arcane >> encodings for specific (often small) binary files. To keep my sanity, I >> decided that files ending in any of these extensions were binary, and I did >> not check them further: >> * >> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >> >> From the remaining list of non-ascii, non-known-binary files I selected two >> overlapping and exhaustive subsets: >> * All files where at least one tool claimed it to be UTF-8 >> * All files where at least one tool claimed it to be *not* UTF-8 >> >> For the first subset, I checked every non-ASCII character (using `C_ALL=C >> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat >> names-of-files-to-check.txt)`, and visually examining the results). At this >> stage, I found several files where unicode were unnecessarily used instead >> of pure ASCII, and I treated those files separately. Other from that, my >> inspection revealed no obvious encoding errors. This list comprised of about >> 2000 files, so I did not spend too much time on each file. The assumption, >> after all, was that these files are okay. >> >> For the second subset, I checked every non-ASCII character (using the same >> method). This list was about 300+ files. Most of them were okay far as I can >> tell; I can confirm encodings for European languages 100%, but JCK encodings >> could theoretically be wrong; they looked sane but I cannot read and confirm >> fully. Several were in fact pure... > > src/java.desktop/share/legal/lcms.md line 72: > >> 70: Mateusz Jurczyk (Google) >> 71: Paul Miller >> 72: Sébastien Léon > > I cannot comment on capitalization here, but if we wanted to lowercase them, > should they be e-grave instead of e-acute? If this is a French name, it's e acute: é. - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037917708
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... src/java.desktop/share/legal/lcms.md line 72: > 70: Mateusz Jurczyk (Google) > 71: Paul Miller > 72: Sébastien Léon I cannot comment on capitalization here, but if we wanted to lowercase them, should they be e-grave instead of e-acute? - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037895884
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 11:46:45 GMT, Raffaello Giulietti wrote: > I guess the difference at L.1 in the various files is just the BOM? Yes. - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037357899
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... I only checked these 13 files to be UTF-8 encoded and without BOM. - Marked as reviewed by rgiulietti (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2756936848
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:14:40 GMT, Magnus Ihse Bursie wrote: >> I have checked the entire code base for incorrect encodings, but luckily >> enough these were the only remaining problems I found. >> >> BOM (byte-order mark) is a method used for distinguishing big and little >> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is >> discouraged. In the words of the Unicode Consortium: "Use of a BOM is >> neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful >> of files. These should be removed. >> >> Methodology used: >> >> I have run four different tools for using different heuristics for >> determining the encoding of a file: >> * chardetect (the original, slow-as-molasses Perl program, which also had >> the worst performing heuristics of all; I'll rate it 1/5) >> * uchardet (a modern version by freedesktop, used by e.g. Firefox) >> * enca (targeted towards obscure code pages) >> * libmagic / `file --mime-encoding` >> >> They all agreed on pure ASCII files (which is easy to check), and these I >> just ignored/accepted as good. The handling of pure binary files differed >> between the tools; most detected them as binary but some suggested arcane >> encodings for specific (often small) binary files. To keep my sanity, I >> decided that files ending in any of these extensions were binary, and I did >> not check them further: >> * >> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >> >> From the remaining list of non-ascii, non-known-binary files I selected two >> overlapping and exhaustive subsets: >> * All files where at least one tool claimed it to be UTF-8 >> * All files where at least one tool claimed it to be *not* UTF-8 >> >> For the first subset, I checked every non-ASCII character (using `C_ALL=C >> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat >> names-of-files-to-check.txt)`, and visually examining the results). At this >> stage, I found several files where unicode were unnecessarily used instead >> of pure ASCII, and I treated those files separately. Other from that, my >> inspection revealed no obvious encoding errors. This list comprised of about >> 2000 files, so I did not spend too much time on each file. The assumption, >> after all, was that these files are okay. >> >> For the second subset, I checked every non-ASCII character (using the same >> method). This list was about 300+ files. Most of them were okay far as I can >> tell; I can confirm encodings for European languages 100%, but JCK encodings >> could theoretically be wrong; they looked sane but I cannot read and confirm >> fully. Several were in fact pure... > > src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 497: > >> 495: /* >> 496: The algorithm below is based on Intel publication: >> 497: "Fast SHA-256 Implementations on Intel(R) Architecture Processors" by >> Jim Guilford, Kirk Yap and Vinodh Gopal. > > Note: There is of course a unicode `®` symbol, which is what it was > originally before it was botched here, but I found no reason to keep this, > and in the spirit of JDK-8354213, I thought it better to use pure ASCII here. I guess the difference at L.1 in the various files is just the BOM? - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037161789
Re: RFR: 8354266: Fix non-UTF-8 text encoding
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily > enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little > endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. > In the words of the Unicode Consortium: "Use of a BOM is neither required nor > recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These > should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for > determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the > worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I > just ignored/accepted as good. The handling of pure binary files differed > between the tools; most detected them as binary but some suggested arcane > encodings for specific (often small) binary files. To keep my sanity, I > decided that files ending in any of these extensions were binary, and I did > not check them further: > * > `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two > overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C > ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat > names-of-files-to-check.txt)`, and visually examining the results). At this > stage, I found several files where unicode were unnecessarily used instead of > pure ASCII, and I treated those files separately. Other from that, my > inspection revealed no obvious encoding errors. This list comprised of about > 2000 files, so I did not spend too much time on each file. The assumption, > after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same > method). This list was about 300+ files. Most of them were okay far as I can > tell; I can confirm encodings for European languages 100%, but JCK encodings > could theoretically be wrong; they looked sane but I cannot read and confirm > fully. Several were in fact pure binary files, but without any telling > exten... src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 497: > 495: /* > 496: The algorithm below is based on Intel publication: > 497: "Fast SHA-256 Implementations on Intel(R) Architecture Processors" by > Jim Guilford, Kirk Yap and Vinodh Gopal. Note: There is of course a unicode `®` symbol, which is what it was originally before it was botched here, but I found no reason to keep this, and in the spirit of JDK-8354213, I thought it better to use pure ASCII here. - PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037012318