[Bug 71386] Re: 'sort' does not correctly sort non-latin utf-8 encoded text

Luzius Thöny Wed, 16 May 2007 14:34:06 -0700

ok, let me explain my purpose: in a nutshell, i want to make some
statistics as to the frequencies of the indiviudal symbols in a specific
text. for example, i want to know how much more frequent an 's' is
compared to a 't'. the way to achieve this is to split the text up so
that every letter/symbol occurs on an individual line, then sort it, and
finally count the lines with the same symbol using 'uniq -c'. my sed
script is intented to do just this (except the 'uniq -c' part), and i
believe it is correct the way i wrote it.


the result i'm currently getting from the script run on the above text
is attached, and it just looks very wrong to me. you may see that the
normal letters (like 'n', 'r', or 's') are correctly sorted onto
adjacent lines in the result, but not the IPA-symbols like 'ʃ' or 'ʌ',
which occur in different places of the resultfile.

** Attachment added: "result of the 'sorting' sed command"
   http://librarian.launchpad.net/7676987/sorted.txt

-- 
'sort' does not correctly sort non-latin utf-8 encoded text
https://bugs.launchpad.net/bugs/71386
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 71386] Re: 'sort' does not correctly sort non-latin utf-8 encoded text

Reply via email to