Chris Little wrote:
Daniel Owens wrote:
The other MAJOR problem is that the dictionary keys are always capitalized, which makes it really awkward to use for Greek. Can I lobby again for a change in that? How many Greek students are used to looking up words in capitals? I was taught using lower case letters, and many of the capitals are really fuzzy. When reading I can work them out based on context sometimes because I have the rest of the word to clue me in. Capitals also make accent marks look strange. Then there is the issue of sort order again...

I quite agree. No casing language uses capital forms primarily. Capital letters are less recognizable and slow reading speed.

That said, I don't quite know how we ought to solve the issue. We can't simply lowercase the existing keys, since many would actually need to incorporate capitals (e.g. personal & place names). And we'll need to do some kind of case folding when we do key lookups.

Making keys be cased and doing case folding at runtime handles part of the issue. However, key sorting becomes more difficult and we have to guard against the possibility of keys that are identical except for casing (e.g. "a" and "A").

It is a hard problem, but not intractable. I think it might require a new module type.

Some thoughts: Lookup and collation are two different problems that have a single solution today. Lookup is the process of taking input and finding one or more entries. Collation is the ordering of entries for the purpose of display. These don't have to have a single solution.

Today, our modules use a strict byte ordering of the upper case representation of each entry's term. For latin-1/cp1252, this gives a well-defined, though sometimes inappropriate ordering. For UTF-8, the situation is more complex. For a given glyph, there can be more than one representation in UTF-8. For example, an accented letter may be a single code point or may multiple code points, with the letter followed by the accents in any order. Without normalization of the entry's term (we've settled on NFC), the ordering is not well-defined. With it, it is. But again, it may produce inappropriate ordering.

Given this well-defined order, lookup of a word begins by converting it to upper case (and it should also include converting it to NFC when looking up in a UTF-8 dictionary) and then a binary search can be performed. This will result in the nearest first match in the collated list.

When each dictionary module is built, the input file does not need to be ordered. As each entry is added it is stored against the normalized key. If a subsequent entry normalizes to the same as a prior one, the key will no longer point to the first, but will point to the subsequent. (On a side note, the dat file will contain the first entry, and if nothing points to it, it will be orphaned.

One of the impacts of this mechanism is that there cannot be two entries with the same "key". Many dictionaries have multiple entries with the same key. I think we should have a solution that provides for this.

I think that there needs to be a notion of an internal, sort key and an external, display key on a per entry basis. Lookup would need be against the internal key. So a routine would be needed to convert/normalize input into the form of the internal key and use that for lookup.

ICU has the notion of a collation key, which can be used for such a purpose. (I think we've gotten to the point where ICU is a requirement for UTF-8 modules.) In ICU, the collation key is locale dependent. (For example, Germans sort accent marks differently than French. In Spanish dictionaries, at least older ones, ch come before ca.) I really don't see any way around having a static collation for a module. If so, the collation would need to be fixed wrt either a fixed locale or a locale based upon the language of the module.

The other aspect of lookup is that we will be producing accented dictionaries. But we want the dictionaries to work for unaccented texts. For example, we have unaccented Greek texts and it is possible to show Hebrew without vowel points or cantillation. The next round of Greek and Hebrew dictionaries will have accents, vowel points. It should work to find one or more accented words that match an unaccented input.

We may also want to tackle lookup by transliteration.

For us to have multiple lookup mechanisms but a single collation, I think this argues for separating lookup from collation. I don't think we want to show all the different ways an entry is indexed.

So, lookup depends on normalized input that matches normalized index(es). The result of a lookup is an entry which has a position in a collation.

As to solving the unique key problem, tei2mod could be changed to check to see if there is already an entry with that normalized key. If there is, then append a non-printing character to the end and try again. Or simply change the engine to allow duplicates.

I implemented this many years ago in perl to run on a computer with 128M RAM. To see it, go to: http://nexis.com/sources
Some info:
Search and sorting are independent.
Each entry is indexed on several keys. Lookup can be against one or more of them.
There can be more than one entry with the same key.

The search result is ordered according to the end-user's locale as provided by their browser, if that locale is supported, otherwise it goes to a default ordering. You will notice that the ordering takes noise words into account and properly orders numbers. You might notice other complexities too. All of it is handled by normalization and then generating a collation key for the appropriate locale(s).

In Him,
   DM








_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to