Dear Arthur, dear Mojca Attached you find a zip file named AlbanianHyphenation.zip.
This is the result of my efforts with the substantial help of MoA Sabina Koliqi, original Albanian graduate in Albanian Literature, then Italian professor graduated in Education Teaching. I do not know the Albanian language, but this language is dr Koliqi's mother language and is implied by her university studies; I know how to build hyphen patterns; we joined our competences and the above .zip file contains our results, in particular the hyph-sq.tex file contains the UTF-8 encoded patterns, with a preamble modeled on the other pattern files distributed with TeX Live.
We looked for an hyphenated Albanian word list, but we could not find any. Dr Koliqi, extracted a word list from a couple of chapters of an Albanian book; she tried to create an Albainan hyphenated word list. Then I entered the challenge, but I was unsuccessful with the patgen program that is distributed with the TeX System; documentation is very scarce and refers to the Omega program. As a result we abandoned the patgen solution and we moved to another approach that I find very effective, even if it requires a lot of "elbow grease".
The approach is based on LuaLaTeX and its ability to load on the fly a pattern file and to hyphenate a list of words given as simple text. This is provided by package testhyphens.sty and its checkhyphens environment. As you see form the zipped file, the source abanian-test-lualatex-2.tex loads also the multicol.sty package, in order to typeset the result in four column mode; of course the setting for four columns can be changed to 1 (one) column and the result may be used as a dictionary if patgen is to be used to find another (different) pattern-set created without any use of elbow grease. My preceding experience with other languages taught me that this elbow grease spent by a sufficiently well educated person produces better results than patgen. Of course this statement is not valid for certain languages, English in first place, because patterns are based on spelling and not on pronunciation; for English in both main incarnations, British and US, there are errors that can't be corrected because there are homographs that are pronounced differently if they refer to nouns or to verbs: for example "the record" and "I record"; "the analyses" and "he analyses".
Therefore we started with a basic list of a dozen patterns (the single letter patterns with implied 0 values on both sides were omitted, and only the Albanian digraphs were considered). After each run of the LuaLaTeX compilation dr Koliqi would correct on the printed list the wrong hyphenation points; I would modify the pattern list; and we would iterate until all words were correctly hyphenated. Non very professional, you might think, but very effective.
The Albanian hyphenation is peculiar; Albanians say they have an alphabet made up with more than 30 letters; while interacting with dr Koliqi I found out that in Albanian they miss a word for "letter" as it is implied by any computer encoding, from ASCII to UTF-8, therefore "sh", "dh", "zh", and similar digraphs are called with the same name as "a", "b", "c", and so on. Eventually we could find a common mutual understanding, and we could proceed pretty rapidly.
We worked on an initial set of a little more than 2600 words; then we reduced the set to the actual one contained in the LuaLaTeX source file. Differently from patgen, the pattern set we built up does not minimize the probabilities of hyphenation errors; the number of wrong hyphenated words is zero.
Notice: the LuaTeX source file sets both the left and right hyphenmin values to 1; in practice the hyphenation language description file should set both to the value 2. I always build the hyphen sets with the value 1, because I imagine that in some rare cases of narrow column typesetting, the correct justification may be achieved with this not too professional typographical setting.
But the word set we worked on is limited; and it is possible that while actually using this pattern set by the Albanian users with their actual documents, some more patterns, or a list of hyphenation exceptions might become necessary. I might be available to modify such patterns for a short while; at my age I am not going to live for ever; therefore the Albanian TeX community should take over.
All the best Claudio On 16/06/2020 15:22, Arthur Reutenauer wrote:
Dear Claudio, On Mon, Jun 15, 2020 at 11:57:33PM +0200, Claudio Beccari wrote:I can certainly ask the student to allow distributing her thesis, but I believe it will not be of great utility, because, as I said, the thesis is in Italian, with very few stretches in Albanian, where the needed rare hyphen points were set by hand.I think the list of hyphenated words would be very useful, so if she’s ready to publish that, it would be really great. Best, Arthur
<<attachment: AlbanianHyphenation.zip>>
