ok, let me explain my purpose: in a nutshell, i want to make some statistics as to the frequencies of the indiviudal symbols in a specific text. for example, i want to know how much more frequent an 's' is compared to a 't'. the way to achieve this is to split the text up so that every letter/symbol occurs on an individual line, then sort it, and finally count the lines with the same symbol using 'uniq -c'. my sed script is intented to do just this (except the 'uniq -c' part), and i believe it is correct the way i wrote it.
the result i'm currently getting from the script run on the above text is attached, and it just looks very wrong to me. you may see that the normal letters (like 'n', 'r', or 's') are correctly sorted onto adjacent lines in the result, but not the IPA-symbols like 'ʃ' or 'ʌ', which occur in different places of the resultfile. ** Attachment added: "result of the 'sorting' sed command" http://librarian.launchpad.net/7676987/sorted.txt -- 'sort' does not correctly sort non-latin utf-8 encoded text https://bugs.launchpad.net/bugs/71386 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
