[Bug 684317] Re: Incorrect alphabetical sort order in thunar with non-latin (eg. cyrillic) file names

Bug Watch Updater Mon, 26 Jan 2015 20:56:35 -0800

Launchpad has imported 31 comments from the remote bug at
https://bugzilla.xfce.org/show_bug.cgi?id=7110.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2011-01-16T21:49:56+00:00 Alexander wrote:

Created attachment 3358
the screenshot

Sometimes when I open a folder in Thunar (item arrangement set to "By
Name") I see it put some folders into the "second series" of
arrangement.

For example. it goes 0-9, then A-Z, then А-Я (Cyrillic), and then...
again 0-9, then A-Z, then А-Я! The folders (haven't ever seen it happen
with files, only with folders) from the first and the second series
aren't the same, but I have no idea about what defines which one does a
folder belong to.

So, having both non-Cyrillic and Cyrillic folders in one folder may lead
to having this bug. Workaround: change item arrangement to any other (e.
g. "By Modification Date"), then change it back to "By Name".

Screenshot included:
http://www4.picturepush.com/photo/a/4880597/img/4880597.png

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/2

------------------------------------------------------------------------
On 2012-04-10T07:31:09+00:00 Masato HASHIMOTO wrote:

Created attachment 4311
Screenshot in Japanese

This issue seems to always occur and jufofu's workaround doesn't work on
thunar-git.
I get this issue in Japanese.
Attached is screenshot of thunar and bash.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/4

------------------------------------------------------------------------
On 2012-04-10T07:37:33+00:00 Masato HASHIMOTO wrote:

Created attachment 4312
test sample

Attached is test sample files of Comment #1.
Each numbers following Japanese characters in file name are unicode codepoint
of the ja characters.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/5

------------------------------------------------------------------------
On 2012-04-14T20:22:04+00:00 Stephan Arts wrote:

The problem seems to be caused by (the use of) glib.

g_utf8_get_char () returns '0' on the first character.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/6

------------------------------------------------------------------------
On 2012-04-14T20:30:35+00:00 Stephan Arts wrote:

To clarify: 'utf-8' ordering fails due to the problem described above.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/7

------------------------------------------------------------------------
On 2012-04-30T09:35:21+00:00 Andrzej wrote:

Created attachment 4375
A fix.

Seems to work here.

I know nothing about Thunar internals so I can't guarantee that the
patch is correct.

(thank you Hashimoto-san, greetings from Japan).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/8

------------------------------------------------------------------------
On 2012-04-30T13:22:51+00:00 Stephan Arts wrote:

This patch would sort the items as followed:

(test) John Doe.txt
あおい輝彦.3042-304a-3044-8f1d-5f66.txt
Alan Smithee.txt
一ノ瀬泰造.4e00-30ce-702c-6cf0-9020.txt
一条忠頼.4e00-6761-5fe0-983c.txt
一青窈.4e00-9752-7a88.txt
堀口雅也.5800-53e3-96c5-4e5f.txt
堀孝史.5800-5b5d-53f2.txt
朱謙之.6731-8b19-4e4b.txt

Did we run into another bug with the sorting algorithm?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/9

------------------------------------------------------------------------
On 2012-04-30T17:30:18+00:00 Andrzej wrote:

Created attachment 4377
More fixes.

I've found two more bugs:
- comparison function should not return 0 ("equal"), even if we're using case
insensitive sorting
- arguments of strcoll were truncated to a single character - strcoll doesn't
like it and returns a different result than with a longer string.

The sorting order now closely follows behavior of strcoll, so if there are
any problems with it, they are likely coming from strcoll.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/10

------------------------------------------------------------------------
On 2012-04-30T17:55:21+00:00 Andrzej wrote:

Created attachment 4378
More fixes.

Added one more bugfix - a check for filename length of otherwise
identical utf-8 filenames.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/11

------------------------------------------------------------------------
On 2012-05-01T04:12:25+00:00 Andrzej wrote:

IMHO the code is ready to be used, I don't have anything else to add.

There are some remaining issues, which cannot be easily fixed:

1. 'A' < 'a' but 'ą' < 'Ą' - this is because former is coming from ascii code
comparison, and the latter from strcoll.
Reported upstream: http://sourceware.org/bugzilla/show_bug.cgi?id=14039

Possible solutions:
- always use strcoll - gives a consistent ('a' < 'A' and 'ą' < 'Ą') ordering
but is slower for ascii characters, especially in case insensitive mode.
- just flip 'a-z' and 'A-Z' codes manually [1] (also gives 'a' < 'A' and 'ą'
< 'Ą')
- wait for http://sourceware.org/bugzilla/show_bug.cgi?id=14039 to be
resolved (would give 'A' < 'a' and 'Ą' < 'ą' but that's very unlikely)

2. あ < a < あa < aa < あaa
Reported upstream: http://sourceware.org/bugzilla/show_bug.cgi?id=14038

No solution (but hopefully this will be fixed upstream). If fixed,
then the workaround in the patch (g_strconcat) will not be necessary, so
we can then improve performance a bit by removing it.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/12

------------------------------------------------------------------------
On 2012-05-01T04:34:08+00:00 Andrzej wrote:

Created attachment 4379
Swap ascii codes a-z and A-Z

This is a patch implementing the solution 1.2 from comment #9. It's
likely much faster than solution 1.1.

It *changes* the sorting order of ascii characters to make it consistent
with the order of non-ascii ones.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/13

------------------------------------------------------------------------
On 2012-05-01T08:37:46+00:00 Andrzej wrote:

Got some feedback from glibc bugzilla

1. They recommend using strxfrm for converting the string so that it
matches strcoll ordering during simple comparison.

However, strxfrm itself is pretty heavy, if we wanted "proper"
sorting we could simply switch to using strcoll on all strings. So, my
suggestion is to use the patch swapping 'a-z' for 'A-Z' maybe not the
prettiest but it does 90% of strxfrm at almost 0 cost.

2. Weird ordering of Japanese characters and our workaround - apparently
there are no Japanese language definitions in iso14651_t1_common file,
which means they are ignored in the first pass and handled in the second
one.

They said that the "workaround" is indeed a correct way of using
strcoll as there might be other ignored characters.

There was no indication whether Japanese definition will be added to
the iso14651_t1_common file but the bug was not closed so I imagine that
still on the table.

My conclusion:
Current patches are doing as much as we can without sacrificing performance in
ascii case (otherwise we could switch to strcoll completely). Other errors are
mostly caused by limitations of strcoll in glibc (possibly will be resolved
later).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/14

------------------------------------------------------------------------
On 2012-05-01T13:16:11+00:00 Andrzej wrote:

Created attachment 4380
sort using g_utf8_collate_key_for_filename()

After discussion on IRC we have decided to try the
g_utf8_collate_key_for_filename() function. It doesn't support number
sort (and there is no way to add it efficiently), but should do a better
job at sorting, and can potentially be faster (sorting itself is done by
a key comparison, cost of collation is unknown).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/15

------------------------------------------------------------------------
On 2012-05-01T13:27:56+00:00 Andrzej wrote:

Created attachment 4381
plugged a memory leak in the previous patch

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/16

------------------------------------------------------------------------
On 2012-05-02T05:23:52+00:00 Masato HASHIMOTO wrote:

(In reply to comment #13)
> Created attachment 4381 [details]
> plugged a memory leak in the previous patch

Andrzej-san:

Sorry for late reply.
Your patch works fine!!
Thank you for your quick work in spite of the Golden Week :)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/17

------------------------------------------------------------------------
On 2012-05-02T07:15:55+00:00 Gymka-p wrote:

note: "ū" is in wrong place, it's between "j" and "k" but it should be
in the end of alphabet(i looked to Maori, Hawaiian, Marshallese,
Lithuanian, Livonian, Latvian and Cornish alphabets in all these
alphabets that letter is in the end before "v" or "w").

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/18

------------------------------------------------------------------------
On 2012-05-02T07:43:29+00:00 Andrzej wrote:

(In reply to comment #15)
> note: "ū" is in wrong place, it's between "j" and "k" but it should be in
> the end of alphabet

Which patch are you using, and what's is your locale (LC_COLLATE)?

I've checked that with LC_COLLATE=POSIX "ū" is after "z"
I don't have Lithuanian locale installed so I can't check it here but different
locales yield different results (e.g. with LC_COLLATE=pl_PL.UTF8 "ū" is between
"u" and "v")

Note that with patch #13 sorting is done by glib (and ultimately by
glibc), so if you see are any errors they come either from an error in
your system configuration or from a bug in these libraries (glibc).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/19

------------------------------------------------------------------------
On 2012-05-02T07:51:20+00:00 Gymka-p wrote:

(In reply to comment #16)
> (In reply to comment #15)
> > note: "ū" is in wrong place, it's between "j" and "k" but it should be in
> > the end of alphabet
>
> Which patch are you using, and what's is your locale (LC_COLLATE)?
>
> I've checked that with LC_COLLATE=POSIX "ū" is after "z"
> I don't have Lithuanian locale installed so I can't check it here but
> different locales yield different results (e.g. with LC_COLLATE=pl_PL.UTF8
> "ū" is between "u" and "v")
>
> Note that with patch #13 sorting is done by glib (and ultimately by glibc),
> so if you see are any errors they come either from an error in your system
> configuration or from a bug in these libraries (glibc).

i'm using patch from comment #13.
LC_COLLATE=C
in my system, variable "LANG" has value which you mentioning, in my case
lt_LT.UTF-8

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/20

------------------------------------------------------------------------
On 2012-05-02T08:18:07+00:00 Andrzej wrote:

Most likely there is no bug (if you use correct LC_COLLATE), or if there
is, it is not in thunar.

Try this:
/close *all* thunar windows /
$ thunar -q
$ LC_COLLATE=lt_LT.UTF-8 thunar

If there are any problems tell me about it on #xfce (irc.freenode.net).
Bugzilla is not a support forum.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/21

------------------------------------------------------------------------
On 2012-08-14T02:27:06+00:00 Bw-owlet wrote:

The latest version of Thunar (1.4.0) incorrectly sorts contents of
folders with cyrillic letters in files and folders names.

Here is the contents of one folder "sorted" by name (ascending).
Looking from top to bottom I see file names starting with...
* cyrillic upper letters
* digits
* cyrillic lower letters
* again digits
* again cyrillic lower letters
* english lower letters
* and again cyrillic lower letters

Concrete example of another folder "sorted" by name:
> голубь.txt
> иволга.txt
> аист.txt
> орёл.txt
> сова.txt

Absoluletly wrong order. The file with name that starts with 'A' is in
the middle.

Cyrillic alpabet is:

Аа Бб Вв Гг Дд Ее Ёё Жж Зз Ии Йй Кк Лл Мм Нн Оо Пп Рр Сс Тт Уу Фф Хх Цц
Чч Шш Щщ ЬЬ Ыы ЪЪ Ээ Юю Яя

Meanwhile, "ls -1" gives right order
> аист.txt
> голубь.txt
> иволга.txt
> орёл.txt
> сова.txt

PCManFM and other filemanagers give right sort order.

So, Thunar DOES NOT sort with "ls". Also, sort order in Thunar CAN NOT
be changed using LC_COLLATE. It ignores this variable, but instead uses
its own "mega-wise" algorithm.

What's the matter, guys?! Prior to version 1.4.0, everything was OK in
Thunar.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/22

------------------------------------------------------------------------
On 2012-08-16T11:56:15+00:00 guoxh wrote:

Same problem here, for Chinese filenames.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/23

------------------------------------------------------------------------
On 2012-10-03T10:00:58+00:00 8-nick wrote:

Can people help here a bit with some test files?

Name the files the following way: $(name).$(expected_position).txt, so
for example "аист.1.txt", "голубь.2.txt"

Talking about Cyrillic/non-Cyrillic here, Chinese. All that don't fit
into [a-Z]

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/24

------------------------------------------------------------------------
On 2012-10-03T10:37:57+00:00 8-nick wrote:

And please mention the used LC_COLLATE.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/25

------------------------------------------------------------------------
On 2012-10-03T12:31:08+00:00 Bw-owlet wrote:

My variants in Cyrillic:

Variant 1

> голубь.2.txt
> аист.1.txt

Variant 2

> вишня.4.txt
> груша.5.txt
> апельсин.2.txt
> банан.3.txt
> ананас.1.txt
> киви.6.txt
> лимон.7.txt
> яблоко.8.txt

Locale settings. All is English.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/26

------------------------------------------------------------------------
On 2012-10-03T20:40:27+00:00 Andrzej wrote:

Created attachment 4647
test case

Sorting order as in nautilus. ls uses slightly different sort order (no
numeric sort, non-alphanumeric characters). andrzejr/utf8_collate
behaves mostly like nautilus except for the '#' sign.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/27

------------------------------------------------------------------------
On 2012-10-03T21:13:10+00:00 8-nick wrote:

Created attachment 4648
Updated patch

Slightly updated patch that peeks the case-folded name, if equal use the
case-hash. Saves some hashing and memory.

We could reduce the hashing to do this on the fly in the
thunar_file_compare_by_name function, but that's too much imho.

The special case in nautilus for '.' and '#' might be useful. Duno if
there are locales that put other characters in front of '.'/'#'

IMHO we should keep the hidden no-case option

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/28

------------------------------------------------------------------------
On 2012-10-03T21:55:05+00:00 Andrzej wrote:

Works well for me. Good idea with the optimization. Lazy hashing could
make sense when other methods of sorting are used (e.g. by modification
time) and only if they don't fall back to compare by name. IMHO benefit
not worth the complexity.

I have no preference for special characters ("#", "."). I don't know why
nautilus is treating them differently.

I also feel leaving case-sensitive option for POSIX locale users is OK.
We should probably change the default to case insensitive sort, to avoid
confusion.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/29

------------------------------------------------------------------------
On 2012-10-04T06:56:57+00:00 8-nick wrote:

The default is already case-insensitive in Thunar, so that doesn't need
to change.

Nautilus sorts 'hidden' files after the other names, instead of showing
them first. GTK+ doesn't and there are also bugs for that in the gnome
bugtracker: https://bugzilla.gnome.org/show_bug.cgi?id=358812

The change obviously fixed sorting locales, but are there also
situations where Thunar does a better job?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/30

------------------------------------------------------------------------
On 2012-10-04T16:30:39+00:00 8-nick wrote:

Pushed patch in 1fcb0e7 if there are sorting regressions please open a
new bug.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/31

------------------------------------------------------------------------
On 2012-10-12T09:57:04+00:00 8-nick wrote:

*** Bug 9218 has been marked as a duplicate of this bug. ***

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/32

------------------------------------------------------------------------
On 2012-10-30T07:56:20+00:00 8-nick wrote:

*** Bug 3724 has been marked as a duplicate of this bug. ***

Reply at:
https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/684317/comments/33

** Changed in: thunar
Status: Unknown => Fix Released

** Changed in: thunar
Importance: Unknown => Low

** Bug watch added: Sourceware.org Bugzilla #14039
http://sourceware.org/bugzilla/show_bug.cgi?id=14039

** Bug watch added: Sourceware.org Bugzilla #14038
http://sourceware.org/bugzilla/show_bug.cgi?id=14038

** Bug watch added: GNOME Bug Tracker #358812
https://bugzilla.gnome.org/show_bug.cgi?id=358812

--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/684317

Title:
Incorrect alphabetical sort order in thunar with non-latin (eg.
cyrillic) file names

To manage notifications about this bug go to:
https://bugs.launchpad.net/thunar/+bug/684317/+subscriptions

--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 684317] Re: Incorrect alphabetical sort order in thunar with non-latin (eg. cyrillic) file names

Reply via email to