Re: [NTG-context] sort-lan.lua nitpicks and sorting
On 2-5-2010 3:59, Philipp Gesang wrote: 1. In sort-lan.lua, line 101 should read «['r'] = "r"», and line 144 «['r'] = 26, -- r». i patched the file 2. Although I read the disclaimer about said file being “preliminary and incomplete” -- is there some rationale behind the range of integers for each language mapping? The mapping for English goes from 1 to 51, interleaving 2 integers for each letter (which is odd because it should start from index 3 with “a”, shouldn't it?), while the Czech one goes from 1 to 40 without skipping, Finnish and Austrian from 1 to 58. some old (ruby) code was used etc etc What about mapping them onto a larger but common scale that would alleviate multilingual sorting so that the alphabetical representation of the phoneme /a/ maps to the same value over different languages?† E.g. ["a"] = 3, -- in a Latin mapping, ["α"] = 3, -- in Greek mapping, ["а"] = 3, -- in a Russian mapping. hm, interesting ... feel free to reshuffle and provide patches † I know this is impractical for many writing systems and even within the set of Latin or Greek based alphabets it largely depends on a given purpose how much precision you need in sorting. indeed but we can have multiple variants and are not bound to specific conventions Hans - Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl - ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
Re: [NTG-context] sort-lan.lua nitpicks and sorting
On 2010-05-02 <15:59:53>, Philipp Gesang wrote: > Hi again, > > > 1. In sort-lan.lua, line 101 should read «['r'] = "r"», and line 144 > «['r'] = 26, -- r». In lines 152 and 109 concerning the character “ů” (uring in unicode speak) there's a typo, the key should be “uc(0x016F)” instead of “uc(0x01F6)”. The long quantities “ó” and “ý” are missing as well. They belong after their short counterparts. I append a diff for the file. Philipp -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments --- /home/laokoon/base/sort-lan.lua 2010-04-07 23:10:04.0 +0200 +++ sort-lan.lua2010-05-03 09:28:23.813291928 +0200 @@ -98,7 +98,8 @@ ['o']= "o", ['p']= "p", ['q']= "q", -['s']= "r", +['r']= "r", +[uc(0x00F3)] = uc(0x00F3), -- oacute [uc(0x0147)] = uc(0x0147), -- rcaron ['s']= "s", [uc(0x0161)] = uc(0x0161), -- scaron @@ -106,11 +107,12 @@ [uc(0x0165)] = uc(0x0165), -- tcaron ['u']= "u", [uc(0x00FA)] = "u", -[uc(0x01F6)] = "u", +[uc(0x016F)] = "u", ['v']= "v", ['w']= "w", ['x']= "x", ['y']= "y", +[uc(0x00FD)] = uc(0x00FD), -- yacute ['z']= "z", [uc(0x017E)] = uc(0x017E), -- zcaron } @@ -139,23 +141,25 @@ ['n']= 21, -- n [uc(0x0147)] = 22, -- ncaron ['o']= 23, -- o -['p']= 24, -- p -['q']= 25, -- q -['s']= 26, -- r -[uc(0x0147)] = 27, -- rcaron -['s']= 28, -- s -[uc(0x0161)] = 29, -- scaron -['t']= 30, -- t -[uc(0x0165)] = 31, -- tcaron -['u']= 32, -- u -[uc(0x00FA)] = 33, -- uacute -[uc(0x01F6)] = 34, -- uring -['v']= 35, -- v -['w']= 36, -- w -['x']= 37, -- x -['y']= 38, -- y -['z']= 39, -- z -[uc(0x017E)] = 40, -- zcaron +[uc(0x00F3)] = 24, -- oacute +['p']= 25, -- p +['q']= 26, -- q +['r']= 27, -- r +[uc(0x0147)] = 28, -- rcaron +['s']= 29, -- s +[uc(0x0161)] = 20, -- scaron +['t']= 31, -- t +[uc(0x0165)] = 32, -- tcaron +['u']= 33, -- u +[uc(0x00FA)] = 34, -- uacute +[uc(0x016F)] = 35, -- uring +['v']= 36, -- v +['w']= 37, -- w +['x']= 38, -- x +['y']= 39, -- y +[uc(0x00FD)] = 40, -- yacute +['z']= 41, -- z +[uc(0x017E)] = 42, -- zcaron } -- French pgpbI6xHnUWlY.pgp Description: PGP signature ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
[NTG-context] sort-lan.lua nitpicks and sorting
Hi again, 1. In sort-lan.lua, line 101 should read «['r'] = "r"», and line 144 «['r'] = 26, -- r». 2. Although I read the disclaimer about said file being “preliminary and incomplete” -- is there some rationale behind the range of integers for each language mapping? The mapping for English goes from 1 to 51, interleaving 2 integers for each letter (which is odd because it should start from index 3 with “a”, shouldn't it?), while the Czech one goes from 1 to 40 without skipping, Finnish and Austrian from 1 to 58. What about mapping them onto a larger but common scale that would alleviate multilingual sorting so that the alphabetical representation of the phoneme /a/ maps to the same value over different languages?† E.g. ["a"] = 3, -- in a Latin mapping, ["α"] = 3, -- in Greek mapping, ["а"] = 3, -- in a Russian mapping. 3. Is it intended that the digraph “ch” resolves (temporarily) to http://www.fileformat.info/info/unicode/char/ff01/index.htm according to line 72? Feel free to state more general opinions on the sorting topic as I am playing with different ways of sorting my bibliography. I will be glad about any advice, Philipp † I know this is impractical for many writing systems and even within the set of Latin or Greek based alphabets it largely depends on a given purpose how much precision you need in sorting. -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments pgpMMTxusKfc8.pgp Description: PGP signature ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___