Re: [sword-devel] Greek dictionary - input needed

DM Smith Tue, 20 Jan 2009 11:14:00 -0800

Chris Little wrote:

Daniel Owens wrote:
The other MAJOR problem is that the dictionary keys are alwayscapitalized, which makes it really awkward to use for Greek. Can Ilobby again for a change in that? How many Greek students are used tolooking up words in capitals? I was taught using lower case letters,and many of the capitals are really fuzzy. When reading I can workthem out based on context sometimes because I have the rest of theword to clue me in. Capitals also make accent marks look strange.Then there is the issue of sort order again...
I quite agree. No casing language uses capital forms primarily.Capital letters are less recognizable and slow reading speed.
That said, I don't quite know how we ought to solve the issue. Wecan't simply lowercase the existing keys, since many would actuallyneed to incorporate capitals (e.g. personal & place names). And we'llneed to do some kind of case folding when we do key lookups.
Making keys be cased and doing case folding at runtime handles part ofthe issue. However, key sorting becomes more difficult and we have toguard against the possibility of keys that are identical except forcasing (e.g. "a" and "A").

It is a hard problem, but not intractable. I think it might require anew module type.

Some thoughts: Lookup and collation are two different problems that havea single solution today. Lookup is the process of taking input andfinding one or more entries. Collation is the ordering of entries forthe purpose of display. These don't have to have a single solution.

Today, our modules use a strict byte ordering of the upper caserepresentation of each entry's term. For latin-1/cp1252, this gives awell-defined, though sometimes inappropriate ordering. For UTF-8, thesituation is more complex. For a given glyph, there can be more than onerepresentation in UTF-8. For example, an accented letter may be a singlecode point or may multiple code points, with the letter followed by theaccents in any order. Without normalization of the entry's term (we'vesettled on NFC), the ordering is not well-defined. With it, it is. Butagain, it may produce inappropriate ordering.

Given this well-defined order, lookup of a word begins by converting itto upper case (and it should also include converting it to NFC whenlooking up in a UTF-8 dictionary) and then a binary search can beperformed. This will result in the nearest first match in the collated list.

When each dictionary module is built, the input file does not need to beordered. As each entry is added it is stored against the normalized key.If a subsequent entry normalizes to the same as a prior one, the keywill no longer point to the first, but will point to the subsequent. (Ona side note, the dat file will contain the first entry, and if nothingpoints to it, it will be orphaned.

One of the impacts of this mechanism is that there cannot be two entrieswith the same "key". Many dictionaries have multiple entries with thesame key. I think we should have a solution that provides for this.

I think that there needs to be a notion of an internal, sort key and anexternal, display key on a per entry basis. Lookup would need be againstthe internal key. So a routine would be needed to convert/normalizeinput into the form of the internal key and use that for lookup.

ICU has the notion of a collation key, which can be used for such apurpose. (I think we've gotten to the point where ICU is a requirementfor UTF-8 modules.) In ICU, the collation key is locale dependent. (Forexample, Germans sort accent marks differently than French. In Spanishdictionaries, at least older ones, ch come before ca.) I really don'tsee any way around having a static collation for a module. If so, thecollation would need to be fixed wrt either a fixed locale or a localebased upon the language of the module.

The other aspect of lookup is that we will be producing accenteddictionaries. But we want the dictionaries to work for unaccented texts.For example, we have unaccented Greek texts and it is possible to showHebrew without vowel points or cantillation. The next round of Greek andHebrew dictionaries will have accents, vowel points. It should work tofind one or more accented words that match an unaccented input.


We may also want to tackle lookup by transliteration.

For us to have multiple lookup mechanisms but a single collation, Ithink this argues for separating lookup from collation. I don't think wewant to show all the different ways an entry is indexed.

So, lookup depends on normalized input that matches normalizedindex(es). The result of a lookup is an entry which has a position in acollation.

As to solving the unique key problem, tei2mod could be changed to checkto see if there is already an entry with that normalized key. If thereis, then append a non-printing character to the end and try again. Orsimply change the engine to allow duplicates.

I implemented this many years ago in perl to run on a computer with 128MRAM. To see it, go to: http://nexis.com/sources

Some info:
Search and sorting are independent.

Each entry is indexed on several keys. Lookup can be against one or moreof them.

There can be more than one entry with the same key.

The search result is ordered according to the end-user's locale asprovided by their browser, if that locale is supported, otherwise itgoes to a default ordering.You will notice that the ordering takes noise words into account andproperly orders numbers. You might notice other complexities too. All ofit is handled by normalization and then generating a collation key forthe appropriate locale(s).


In Him,
   DM








_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Greek dictionary - input needed

Reply via email to