[Touch-packages] [Bug 1774857] Re: sort doesn't sort and uniq loses data for many non-Latin scripts on UTF-8 locales

2019-04-29 Thread Seth Arnold
Probably related:
https://bugzilla.redhat.com/show_bug.cgi?id=1336308

and probably related:
https://sourceware.org/git/?p=glibc.git;a=commit;h=b11643c21c5c9d67a69c8ae952e5231ce002e7f1

Thanks

** Bug watch added: Red Hat Bugzilla #1336308
   https://bugzilla.redhat.com/show_bug.cgi?id=1336308

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to coreutils in Ubuntu.
https://bugs.launchpad.net/bugs/1774857

Title:
  sort doesn't sort and uniq loses data for many non-Latin scripts on
  UTF-8 locales

Status in coreutils package in Ubuntu:
  New
Status in glibc package in Ubuntu:
  New

Bug description:
  I’ve found out that sort doesn’t sort strings for many non-Latin
  scripts at all if the locale you’re using is one of en_US.UTF-8,
  fr_FR.UTF-8 or fi_FI.UTF-8 (probably others, too, but these are the
  ones I have tested). For locales ”C” and ko_KR.UTF-8, things work as
  expected. Here’s a test case:

  Open xterm, launch sort and input some lines of Syriac, Ethiopic,
  Korean, Japanese (Hiragana or Katakana, not Han) or Thai text
  repeating one of the lines twice. Here’s an example in Syriac:

  ܡܠܬܐ
  ܒܝܬܐ
  ܒܪܢܫܐ
  ܡܠܬܐ

  Sort produces the following:

  ܡܠܬܐ
  ܒܝܬܐ
  ܡܠܬܐ
  ܒܪܢܫܐ

  Here strings are ordered only according to their length but not
  characters. Even the two instances of the word ܡܠܬܐ are found on non-
  adjacent lines (1 and 3). The expected sort order based on Unicode
  points would be:

  ܒܝܬܐ
  ܒܪܢܫܐ
  ܡܠܬܐ
  ܡܠܬܐ

  If you further pass sort’s output to uniq, it produces the following:

  ܡܠܬܐ
  ܒܪܢܫܐ

  Here the word on line 2 ܒܝܬܐ is completely lost since, like sort, uniq
  seems to consider all Syriac strings of equal length as the same.

  Although this issue affects locale, I think it is not a locale issue
  per se, since perl seems to handle similar cases expectedly. For
  instance, the following command produces the expected result:

  perl -CDS -e 'use locale; use utf8; @str = ("ܡܠܬܐ", "ܒܝܬܐ", "ܒܪܢܫܐ",
  "ܡܠܬܐ"); foreach $i (sort @str) { print "$i\n"; }'

  Curiously enough, codepoints in Plane 1 seem to count as two
  codepoints of the basic plane, so that if you sort | uniq the
  following (six codepoints of Syriac and three codepoints of
  Phoenician):

  ܥܠܝܟܘܢ
  ँउक

  you get ”ܥܠܝܟܘܢ" as the result whereas ”ँउक” is lost. This is of
  course due to the UTF-8 representation of Plane 1 characters as two
  surrogate characters on the basic plane.

  Also curiously, LTR scripts seem to conflate with each other and RTL
  scripts among themselves but not across the directionality line, so
  that if you sort | uniq the following (three codepoints each in
  Ethiopic, Hangul, Syriac, Hiragana and Thai):

  ዘመን
  스물셋
  ܐܢܐ
  わたし
  ฟ้า

  you are left with:

  ܐܢܐ
  ዘመን

  That’s one line of Syriac and one line of Ethiopic; everything else
  was lost. This issue does not seem to affect most Indic scripts
  (Devanagari, Bengali, Telugu etc.) or Arabic. For CJK, things work as
  expected for the main Unicode block (4E00..9FFF) but not for Extension
  A (3400..4DBF, such as 㗖 or 㡘 or 㰋). For Greek, monotonic accents work
  fine but all polytonic letters are conflated (αὐλὸς and αὐλῆς conflate
  to αὐλῆς). For Hebrew, letters and vowel marks work fine but
  cantillation marks are conflated.

  
  Description:  Ubuntu 18.04 LTS
  Release:  18.04

  coreutils:
Installed: 8.28-1ubuntu1
Candidate: 8.28-1ubuntu1
Version table:
   *** 8.28-1ubuntu1 500
  500 http://mr.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
  100 /var/lib/dpkg/status

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: coreutils 8.28-1ubuntu1
  ProcVersionSignature: Ubuntu 4.15.0-22.24-generic 4.15.17
  Uname: Linux 4.15.0-22-generic x86_64
  ApportVersion: 2.20.9-0ubuntu7.1
  Architecture: amd64
  CurrentDesktop: ubuntu:GNOME
  Date: Sun Jun  3 10:13:06 2018
  InstallationDate: Installed on 2017-02-13 (474 days ago)
  InstallationMedia: Ubuntu 16.10 "Yakkety Yak" - Release amd64 (20161012.2)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=
   LANG=fi_FI.UTF-8
   SHELL=/bin/bash
  SourcePackage: coreutils
  UpgradeStatus: Upgraded to bionic on 2018-05-31 (2 days ago)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/1774857/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 1774857] Re: sort doesn't sort and uniq loses data for many non-Latin scripts on UTF-8 locales

2018-12-08 Thread Adam Conrad
Using the first test case, this does appear to be fixed in cosmic (glibc
2.28) and beyond, and only affect bionic (glibc 2.27), which certainly
implies either an upstream or Debian fix slipped in between the two.
I'm not sure I'll have the bandwidth to dig into it this SRU cycle, but
I'll try to look again when I can and.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to coreutils in Ubuntu.
https://bugs.launchpad.net/bugs/1774857

Title:
  sort doesn't sort and uniq loses data for many non-Latin scripts on
  UTF-8 locales

Status in coreutils package in Ubuntu:
  New
Status in glibc package in Ubuntu:
  New

Bug description:
  I’ve found out that sort doesn’t sort strings for many non-Latin
  scripts at all if the locale you’re using is one of en_US.UTF-8,
  fr_FR.UTF-8 or fi_FI.UTF-8 (probably others, too, but these are the
  ones I have tested). For locales ”C” and ko_KR.UTF-8, things work as
  expected. Here’s a test case:

  Open xterm, launch sort and input some lines of Syriac, Ethiopic,
  Korean, Japanese (Hiragana or Katakana, not Han) or Thai text
  repeating one of the lines twice. Here’s an example in Syriac:

  ܡܠܬܐ
  ܒܝܬܐ
  ܒܪܢܫܐ
  ܡܠܬܐ

  Sort produces the following:

  ܡܠܬܐ
  ܒܝܬܐ
  ܡܠܬܐ
  ܒܪܢܫܐ

  Here strings are ordered only according to their length but not
  characters. Even the two instances of the word ܡܠܬܐ are found on non-
  adjacent lines (1 and 3). The expected sort order based on Unicode
  points would be:

  ܒܝܬܐ
  ܒܪܢܫܐ
  ܡܠܬܐ
  ܡܠܬܐ

  If you further pass sort’s output to uniq, it produces the following:

  ܡܠܬܐ
  ܒܪܢܫܐ

  Here the word on line 2 ܒܝܬܐ is completely lost since, like sort, uniq
  seems to consider all Syriac strings of equal length as the same.

  Although this issue affects locale, I think it is not a locale issue
  per se, since perl seems to handle similar cases expectedly. For
  instance, the following command produces the expected result:

  perl -CDS -e 'use locale; use utf8; @str = ("ܡܠܬܐ", "ܒܝܬܐ", "ܒܪܢܫܐ",
  "ܡܠܬܐ"); foreach $i (sort @str) { print "$i\n"; }'

  Curiously enough, codepoints in Plane 1 seem to count as two
  codepoints of the basic plane, so that if you sort | uniq the
  following (six codepoints of Syriac and three codepoints of
  Phoenician):

  ܥܠܝܟܘܢ
  ँउक

  you get ”ܥܠܝܟܘܢ" as the result whereas ”ँउक” is lost. This is of
  course due to the UTF-8 representation of Plane 1 characters as two
  surrogate characters on the basic plane.

  Also curiously, LTR scripts seem to conflate with each other and RTL
  scripts among themselves but not across the directionality line, so
  that if you sort | uniq the following (three codepoints each in
  Ethiopic, Hangul, Syriac, Hiragana and Thai):

  ዘመን
  스물셋
  ܐܢܐ
  わたし
  ฟ้า

  you are left with:

  ܐܢܐ
  ዘመን

  That’s one line of Syriac and one line of Ethiopic; everything else
  was lost. This issue does not seem to affect most Indic scripts
  (Devanagari, Bengali, Telugu etc.) or Arabic. For CJK, things work as
  expected for the main Unicode block (4E00..9FFF) but not for Extension
  A (3400..4DBF, such as 㗖 or 㡘 or 㰋). For Greek, monotonic accents work
  fine but all polytonic letters are conflated (αὐλὸς and αὐλῆς conflate
  to αὐλῆς). For Hebrew, letters and vowel marks work fine but
  cantillation marks are conflated.

  
  Description:  Ubuntu 18.04 LTS
  Release:  18.04

  coreutils:
Installed: 8.28-1ubuntu1
Candidate: 8.28-1ubuntu1
Version table:
   *** 8.28-1ubuntu1 500
  500 http://mr.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
  100 /var/lib/dpkg/status

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: coreutils 8.28-1ubuntu1
  ProcVersionSignature: Ubuntu 4.15.0-22.24-generic 4.15.17
  Uname: Linux 4.15.0-22-generic x86_64
  ApportVersion: 2.20.9-0ubuntu7.1
  Architecture: amd64
  CurrentDesktop: ubuntu:GNOME
  Date: Sun Jun  3 10:13:06 2018
  InstallationDate: Installed on 2017-02-13 (474 days ago)
  InstallationMedia: Ubuntu 16.10 "Yakkety Yak" - Release amd64 (20161012.2)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=
   LANG=fi_FI.UTF-8
   SHELL=/bin/bash
  SourcePackage: coreutils
  UpgradeStatus: Upgraded to bionic on 2018-05-31 (2 days ago)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/1774857/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 1774857] Re: sort doesn't sort and uniq loses data for many non-Latin scripts on UTF-8 locales

2018-08-29 Thread Miikka-Markus Alhonen
One user on debbugs.gnu.org reported that the problem is more likely
related to the locale / glibc than coreutils, and that it occurs on
Ubuntu 18.04 but not Fedora 28, in case that helps any. He thought it
might have already been fixed in glibc, since Fedora tends to be more up
to date than Ubuntu.

** Also affects: glibc (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to coreutils in Ubuntu.
https://bugs.launchpad.net/bugs/1774857

Title:
  sort doesn't sort and uniq loses data for many non-Latin scripts on
  UTF-8 locales

Status in coreutils package in Ubuntu:
  New
Status in glibc package in Ubuntu:
  New

Bug description:
  I’ve found out that sort doesn’t sort strings for many non-Latin
  scripts at all if the locale you’re using is one of en_US.UTF-8,
  fr_FR.UTF-8 or fi_FI.UTF-8 (probably others, too, but these are the
  ones I have tested). For locales ”C” and ko_KR.UTF-8, things work as
  expected. Here’s a test case:

  Open xterm, launch sort and input some lines of Syriac, Ethiopic,
  Korean, Japanese (Hiragana or Katakana, not Han) or Thai text
  repeating one of the lines twice. Here’s an example in Syriac:

  ܡܠܬܐ
  ܒܝܬܐ
  ܒܪܢܫܐ
  ܡܠܬܐ

  Sort produces the following:

  ܡܠܬܐ
  ܒܝܬܐ
  ܡܠܬܐ
  ܒܪܢܫܐ

  Here strings are ordered only according to their length but not
  characters. Even the two instances of the word ܡܠܬܐ are found on non-
  adjacent lines (1 and 3). The expected sort order based on Unicode
  points would be:

  ܒܝܬܐ
  ܒܪܢܫܐ
  ܡܠܬܐ
  ܡܠܬܐ

  If you further pass sort’s output to uniq, it produces the following:

  ܡܠܬܐ
  ܒܪܢܫܐ

  Here the word on line 2 ܒܝܬܐ is completely lost since, like sort, uniq
  seems to consider all Syriac strings of equal length as the same.

  Although this issue affects locale, I think it is not a locale issue
  per se, since perl seems to handle similar cases expectedly. For
  instance, the following command produces the expected result:

  perl -CDS -e 'use locale; use utf8; @str = ("ܡܠܬܐ", "ܒܝܬܐ", "ܒܪܢܫܐ",
  "ܡܠܬܐ"); foreach $i (sort @str) { print "$i\n"; }'

  Curiously enough, codepoints in Plane 1 seem to count as two
  codepoints of the basic plane, so that if you sort | uniq the
  following (six codepoints of Syriac and three codepoints of
  Phoenician):

  ܥܠܝܟܘܢ
  ँउक

  you get ”ܥܠܝܟܘܢ" as the result whereas ”ँउक” is lost. This is of
  course due to the UTF-8 representation of Plane 1 characters as two
  surrogate characters on the basic plane.

  Also curiously, LTR scripts seem to conflate with each other and RTL
  scripts among themselves but not across the directionality line, so
  that if you sort | uniq the following (three codepoints each in
  Ethiopic, Hangul, Syriac, Hiragana and Thai):

  ዘመን
  스물셋
  ܐܢܐ
  わたし
  ฟ้า

  you are left with:

  ܐܢܐ
  ዘመን

  That’s one line of Syriac and one line of Ethiopic; everything else
  was lost. This issue does not seem to affect most Indic scripts
  (Devanagari, Bengali, Telugu etc.) or Arabic. For CJK, things work as
  expected for the main Unicode block (4E00..9FFF) but not for Extension
  A (3400..4DBF, such as 㗖 or 㡘 or 㰋). For Greek, monotonic accents work
  fine but all polytonic letters are conflated (αὐλὸς and αὐλῆς conflate
  to αὐλῆς). For Hebrew, letters and vowel marks work fine but
  cantillation marks are conflated.

  
  Description:  Ubuntu 18.04 LTS
  Release:  18.04

  coreutils:
Installed: 8.28-1ubuntu1
Candidate: 8.28-1ubuntu1
Version table:
   *** 8.28-1ubuntu1 500
  500 http://mr.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
  100 /var/lib/dpkg/status

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: coreutils 8.28-1ubuntu1
  ProcVersionSignature: Ubuntu 4.15.0-22.24-generic 4.15.17
  Uname: Linux 4.15.0-22-generic x86_64
  ApportVersion: 2.20.9-0ubuntu7.1
  Architecture: amd64
  CurrentDesktop: ubuntu:GNOME
  Date: Sun Jun  3 10:13:06 2018
  InstallationDate: Installed on 2017-02-13 (474 days ago)
  InstallationMedia: Ubuntu 16.10 "Yakkety Yak" - Release amd64 (20161012.2)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=
   LANG=fi_FI.UTF-8
   SHELL=/bin/bash
  SourcePackage: coreutils
  UpgradeStatus: Upgraded to bionic on 2018-05-31 (2 days ago)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/1774857/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 1774857] Re: sort doesn't sort and uniq loses data for many non-Latin scripts on UTF-8 locales

2018-08-18 Thread Miikka-Markus Alhonen
Since nobody has reacted to this report for a couple of months, I
decided to file an upstream report at
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=32472

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to coreutils in Ubuntu.
https://bugs.launchpad.net/bugs/1774857

Title:
  sort doesn't sort and uniq loses data for many non-Latin scripts on
  UTF-8 locales

Status in coreutils package in Ubuntu:
  New

Bug description:
  I’ve found out that sort doesn’t sort strings for many non-Latin
  scripts at all if the locale you’re using is one of en_US.UTF-8,
  fr_FR.UTF-8 or fi_FI.UTF-8 (probably others, too, but these are the
  ones I have tested). For locales ”C” and ko_KR.UTF-8, things work as
  expected. Here’s a test case:

  Open xterm, launch sort and input some lines of Syriac, Ethiopic,
  Korean, Japanese (Hiragana or Katakana, not Han) or Thai text
  repeating one of the lines twice. Here’s an example in Syriac:

  ܡܠܬܐ
  ܒܝܬܐ
  ܒܪܢܫܐ
  ܡܠܬܐ

  Sort produces the following:

  ܡܠܬܐ
  ܒܝܬܐ
  ܡܠܬܐ
  ܒܪܢܫܐ

  Here strings are ordered only according to their length but not
  characters. Even the two instances of the word ܡܠܬܐ are found on non-
  adjacent lines (1 and 3). The expected sort order based on Unicode
  points would be:

  ܒܝܬܐ
  ܒܪܢܫܐ
  ܡܠܬܐ
  ܡܠܬܐ

  If you further pass sort’s output to uniq, it produces the following:

  ܡܠܬܐ
  ܒܪܢܫܐ

  Here the word on line 2 ܒܝܬܐ is completely lost since, like sort, uniq
  seems to consider all Syriac strings of equal length as the same.

  Although this issue affects locale, I think it is not a locale issue
  per se, since perl seems to handle similar cases expectedly. For
  instance, the following command produces the expected result:

  perl -CDS -e 'use locale; use utf8; @str = ("ܡܠܬܐ", "ܒܝܬܐ", "ܒܪܢܫܐ",
  "ܡܠܬܐ"); foreach $i (sort @str) { print "$i\n"; }'

  Curiously enough, codepoints in Plane 1 seem to count as two
  codepoints of the basic plane, so that if you sort | uniq the
  following (six codepoints of Syriac and three codepoints of
  Phoenician):

  ܥܠܝܟܘܢ
  ँउक

  you get ”ܥܠܝܟܘܢ" as the result whereas ”ँउक” is lost. This is of
  course due to the UTF-8 representation of Plane 1 characters as two
  surrogate characters on the basic plane.

  Also curiously, LTR scripts seem to conflate with each other and RTL
  scripts among themselves but not across the directionality line, so
  that if you sort | uniq the following (three codepoints each in
  Ethiopic, Hangul, Syriac, Hiragana and Thai):

  ዘመን
  스물셋
  ܐܢܐ
  わたし
  ฟ้า

  you are left with:

  ܐܢܐ
  ዘመን

  That’s one line of Syriac and one line of Ethiopic; everything else
  was lost. This issue does not seem to affect most Indic scripts
  (Devanagari, Bengali, Telugu etc.) or Arabic. For CJK, things work as
  expected for the main Unicode block (4E00..9FFF) but not for Extension
  A (3400..4DBF, such as 㗖 or 㡘 or 㰋). For Greek, monotonic accents work
  fine but all polytonic letters are conflated (αὐλὸς and αὐλῆς conflate
  to αὐλῆς). For Hebrew, letters and vowel marks work fine but
  cantillation marks are conflated.

  
  Description:  Ubuntu 18.04 LTS
  Release:  18.04

  coreutils:
Installed: 8.28-1ubuntu1
Candidate: 8.28-1ubuntu1
Version table:
   *** 8.28-1ubuntu1 500
  500 http://mr.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
  100 /var/lib/dpkg/status

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: coreutils 8.28-1ubuntu1
  ProcVersionSignature: Ubuntu 4.15.0-22.24-generic 4.15.17
  Uname: Linux 4.15.0-22-generic x86_64
  ApportVersion: 2.20.9-0ubuntu7.1
  Architecture: amd64
  CurrentDesktop: ubuntu:GNOME
  Date: Sun Jun  3 10:13:06 2018
  InstallationDate: Installed on 2017-02-13 (474 days ago)
  InstallationMedia: Ubuntu 16.10 "Yakkety Yak" - Release amd64 (20161012.2)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=
   LANG=fi_FI.UTF-8
   SHELL=/bin/bash
  SourcePackage: coreutils
  UpgradeStatus: Upgraded to bionic on 2018-05-31 (2 days ago)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/1774857/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp