Re: [NTG-context] Ligature suppression word list

2021-04-12 Thread denis.maier
Hi,

a small update on this one :
I’ve built a small python script that uses the patterns from the selnolig 
package to extract words with suspicious ligatures from the word list provided 
by the Uni Leipzig corpus project. Running the script over a corpus of over 1 
million words produces the attached word list. The resulting list of words is 
not huge. That corpus gives us a list of about 790 words. I’ll need to check 
whether they already are in the goodies file or if I need to add them.

Anyway, I was thinking about making such a script more generic. Think of 
something along the lines of:
pdftotext book.pdf | showIncorrectLigatures.py > incorrect-ligatures.txt

Denis


Von: ntg-context  Im Auftrag von rh...@t-online.de
Gesendet: Mittwoch, 7. April 2021 20:20
An: ntg-context@ntg.nl
Betreff: Re: [NTG-context] Ligature suppression word list



Message: 2
Date: Tue, 6 Apr 2021 15:03:54 +
From: mailto:denis.ma...@ub.unibe.ch>>
To: mailto:j.ha...@xs4all.nl>>, 
mailto:ntg-context@ntg.nl>>
Subject: Re: [NTG-context] Ligature suppression word list
Message-ID: 
<41e6530172b54bffb7a82febff0a6...@ub.unibe.ch<mailto:41e6530172b54bffb7a82febff0a6...@ub.unibe.ch>>
Content-Type: text/plain; charset="iso-8859-1"


-Ursprüngliche Nachricht-
Von: Hans Hagen mailto:j.ha...@xs4all.nl>>
Gesendet: Samstag, 3. April 2021 17:58
An: mailing list for ConTeXt users 
mailto:ntg-context@ntg.nl>>; Maier, Denis
Christian (UB) mailto:denis.ma...@ub.unibe.ch>>
Betreff: Re: [NTG-context] Ligature suppression word list

[…]




2. A bigger solution might be to use selnoligs patterns in a script
   that can be run over a large corpus, such as the DWDS (Digitales
   Wörterbuch der deutschen Sprache). That should produce us a more
   complete list of words where ligatures must be suppressed.

where is that DWDS ... i can write some code to deal with it (i'd rather start
from the source than from some interpretation; who know what more there
is to uncover)

As it turn out, the linguists that helped with the selnolig package did use 
another corpus: Stuttgart "Deutsch" Web as Corpus
They describe their approach in that paper: 
https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf

A lot of  corpora can be found here: https://wortschatz.uni-leipzig.de/de
especially here: https://wortschatz.uni-leipzig.de/de/download/German

There are corpora of many other languages, too, such as English, French, Dutch, 
Spanish, Russian, Japanese, Latin, …

HTH

Ralf

Auftrag
Auftritt
Auftakt
Auflage
Auflösung
Aufträge
fünften
Auftraggeber
Auflagen
Auffassung
fünfte
Aufführung
Auftragseingang
auftreten
Auftritte
Fünftel
Straftaten
Auftragsbestand
Aufforderung
Auftreten
Auflistung
Auftrieb
Aufführungen
Straftat
auffällig
Auftragsvolumen
Auftragsbücher
Auffällig
Auftragseingänge
Schieflage
Auftragslage
auflösen
Auftragsbestätigung
Aufträgen
Cheftrainer
Fünfte
Auffinden
Kopftuch
auffallend
auffällige
elften
Auffassungen
Auflaufform
Auftraggebers
Auftritten
auffangen
offline
Auffahrt
Auftragswert
Fünfter
Auftaktveranstaltung
Auftragserteilung
Straftatbestand
auffüllen
Auflauf
Auftragnehmer
Dorffest
Kaufinteresse
Straftäter
auftauchen
elfte
Auffallend
fünften Mal
fünfter
Auffälligkeiten
Auftragsvergabe
Kaufinteressenten
auffindbar
auftretenden
Auffahrunfall
Prüfling
aufladen
auflaufen
Auftrages
Auftrags
Dampflokomotiven
auffälliges
fünftägigen
Auflösungen
Auftragen
Hofladen
Schlaflosigkeit
auftretende
Aufladen
Aufladung
Auflockerung
Auftaktspiel
Auftragsbüchern
Auftritts
Briefträger
Prüflinge
Schlaflabor
Schlaftabletten
kampflos
Auftaktquartal
Auftauchen
Auftragsabwicklung
Auftragseingängen
Dampflok
Hoffest
Tiefflug
auffallen
auffrischen
auffälliger
auffälligsten
schlaflose
Auffrischung
Auffällige
Auflehnung
Auflieger
Auftraggebern
Auftragssumme
Brieftasche
Elfter
Straftätern
auffälligste
fünftes
schlaflose Nächte
tarifliche
Auffahrunfälle
Auflagenhöhe
Auflassung
Auftaktsieg
Auftragsbearbeitung
Auftragseingangs
Auftragsmord
Auftragsrückgang
Auftretens
Kopfteil
Kopftuchverbot
Scheffler
Tariflohn
Tieflader
auffälligen
auftragsbezogen
Auffanglager
Aufführungspraxis
Auflockerungen
Auflösungsvertrag
Auftanken
Auftragnehmers
Auftragsbestände
Auftragsbuch
Auftragseinbruch
Auftragsplus
Auftrittsmöglichkeiten
Brieffreund
Golffahrer
Kopftücher
Offline
Tiefladern
auffallende
auffordern
aufführen
auflegen
auftauen
auftragen
auftreiben
fünftägige
schlaflosen
Auffahrten
Auffahrunfällen
Auffanggesellschaft
Aufforderungscharakter
Aufforstung
Auffälligstes
Auffüllen
Auflagefläche
Auflaufen
Auflösungserscheinungen
Auftauen
Auftragsannahme
Auftragsarbeiten
Auftragskomposition
Auftragsminus
Auftragsmorde
Auftrittsort
Brieffreundin
Brieffreundschaft
Brieftaschen
Brieftauben
Cheftrainer Andreas Hirsch
Fünften
Fünftes
Kampffeld
Kaufinteressent
Kopfleiste
Surftipps
Surftips
Tiefland
aufleben
auflockern
auftanken
auftrumpfen
stoffliche
tiefliegenden
A

Re: [NTG-context] Ligature suppression word list

2021-04-08 Thread Hans Hagen

On 4/8/2021 9:37 PM, Arthur Rosendahl wrote:


   Dutch, by contrast, does not seem so well served: the OpenTaal group
is dormant and no longer offers the hyphenated word list that was once
available (that was already the case five years ago).  The most relevant
page I find: https://www.opentaal.org/projecten/woordafbreking is from
2009.  There have apparently been recent updates by a single person (who
incidentally sometimes contributes to the German hyphenation working
group), but they’re rather generic.
fwiw: They are active in collecting words (they also do stuff for open 
office). Dutch patterns don't chaneg much because the hyphenation is 
syllable based and predictable enough I think. There haven't been that 
many released of dutch patterns.


Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
   tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-08 Thread Arthur Rosendahl
On Sat, Apr 03, 2021 at 06:02:10PM +0200, Hans Hagen wrote:
> german is just an example, dutch has some specific things, and i bet other
> languages have their demands so my aim is some general mechanism

  I appreciate that, but if you want to have data of sufficiently good
quality to use this mechanism for individual languages, you need to
invest a *lot* of time for each one of them.  German is one of the very
few languages I know of that has an active group of people working to
produce that data, the “Trennmuster people”, as Mojca calls them ;-)
Their word list supports many fine points of typography, even those that
few programs can use, for example weighted hyphenation.  Ligature
prevention came in as a side project.

  Dutch, by contrast, does not seem so well served: the OpenTaal group
is dormant and no longer offers the hyphenated word list that was once
available (that was already the case five years ago).  The most relevant
page I find: https://www.opentaal.org/projecten/woordafbreking is from
2009.  There have apparently been recent updates by a single person (who
incidentally sometimes contributes to the German hyphenation working
group), but they’re rather generic.

Best,

Arthur
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-08 Thread denis.maier

Von: ntg-context  Im Auftrag von rh...@t-online.de
Gesendet: Mittwoch, 7. April 2021 20:20
An: ntg-context@ntg.nl
Betreff: Re: [NTG-context] Ligature suppression word list

A lot of  corpora can be found here: https://wortschatz.uni-leipzig.de/de
especially here: https://wortschatz.uni-leipzig.de/de/download/German

There are corpora of many other languages, too, such as English, French, Dutch, 
Spanish, Russian, Japanese, Latin, …

HTH

Ralf


Wow, exactly what I was looking for. Thanks!

Denis
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-07 Thread rha17

> Message: 2
> Date: Tue, 6 Apr 2021 15:03:54 +
> From: mailto:denis.ma...@ub.unibe.ch>>
> To: mailto:j.ha...@xs4all.nl>>,  <mailto:ntg-context@ntg.nl>>
> Subject: Re: [NTG-context] Ligature suppression word list
> Message-ID: <41e6530172b54bffb7a82febff0a6...@ub.unibe.ch 
> <mailto:41e6530172b54bffb7a82febff0a6...@ub.unibe.ch>>
> Content-Type: text/plain; charset="iso-8859-1"
> 
>> -Ursprüngliche Nachricht-
>> Von: Hans Hagen mailto:j.ha...@xs4all.nl>>
>> Gesendet: Samstag, 3. April 2021 17:58
>> An: mailing list for ConTeXt users > <mailto:ntg-context@ntg.nl>>; Maier, Denis
>> Christian (UB) mailto:denis.ma...@ub.unibe.ch>>
>> Betreff: Re: [NTG-context] Ligature suppression word list

[…]

>> 
>>> 2. A bigger solution might be to use selnoligs patterns in a script
>>>that can be run over a large corpus, such as the DWDS (Digitales
>>>Wörterbuch der deutschen Sprache). That should produce us a more
>>>complete list of words where ligatures must be suppressed.
>> 
>> where is that DWDS ... i can write some code to deal with it (i'd rather 
>> start
>> from the source than from some interpretation; who know what more there
>> is to uncover)
> 
> As it turn out, the linguists that helped with the selnolig package did use 
> another corpus: Stuttgart "Deutsch" Web as Corpus
> They describe their approach in that paper: 
> https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf
>  
> <https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf>

A lot of  corpora can be found here: https://wortschatz.uni-leipzig.de/de 
<https://wortschatz.uni-leipzig.de/de>
especially here: https://wortschatz.uni-leipzig.de/de/download/German 
<https://wortschatz.uni-leipzig.de/de/download/German>

There are corpora of many other languages, too, such as English, French, Dutch, 
Spanish, Russian, Japanese, Latin, …

HTH

Ralf

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-06 Thread denis.maier
> -Ursprüngliche Nachricht-
> Von: Hans Hagen 
> Gesendet: Samstag, 3. April 2021 17:58
> An: mailing list for ConTeXt users ; Maier, Denis
> Christian (UB) 
> Betreff: Re: [NTG-context] Ligature suppression word list
> 
> On 4/3/2021 5:06 PM, denis.ma...@ub.unibe.ch wrote:
> > Hi everyone
> >
> > Now that Hans has implemented the new ligature suppression mechanism
> > via language goodies - thanks again Hans! - we now need to come up
> > with wordlists.
> >
> > I've started working on a list of German words with ligatures that
> > should be suppressed. The list is derived from the word list that
> > comes with the lualatex selnolig package:
> > https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo
> > rdlist.tex
> > <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w
> > ordlist.tex>
> >
> > You can find the current list here :
> > https://github.com/denismaier/context-nolig-wordlist
> > <https://github.com/denismaier/context-nolig-wordlist>
> >
> > The list is currently organized as follows :
> >
> >  1. L.25-l.35: This specifies words where automatic pattern matching is
> > more difficult than usually because the words contain multiple
> > ligatures, some of which must be suppressed while others must be
> > preserved. In the case of « Auflagefläche » it's even the same
> > combination of letters. So here, we use the bar | to manually
> > indicate points where no ligature must occur.
> >  2. L. 36ff.: The vast amount of words is currently in that list that
> > specifies words where a ff, fl, fi, ffi, or ffl ligature has to be
> > broken up after the first f.
> >  3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be
> > prevented after the second f, so the first two fs form a ligature.
> >  4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225,
> > and l. 2277 suppress ligatures for « ft » and « fft »,  « fb » and
> > « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
> >
> > Obviously, that list is far from being complete, and the question is
> > if it ever can be. Please have a look and feel free to propose more
> > words to be included - either via mail or directly on github.
> >
> > More generally, there's the question how such a list should be enhanced?
> > I was thinking about two options:
> >
> >  1. The new language options features include a tracker that allows for
> > tracking for which words in a given document ligature prevention
> > happened, and which words haven't been touched by the mechanism. It
> > should be possible to analyze the log file and to create lists of
> > words with ligatures. Should be a rather simple step to derive new
> > words for the ligature-suppression wordlist.
> >  2. A bigger solution might be to use selnoligs patterns in a script
> > that can be run over a large corpus, such as the DWDS (Digitales
> > Wörterbuch der deutschen Sprache). That should produce us a more
> > complete list of words where ligatures must be suppressed.
> 
> where is that DWDS ... i can write some code to deal with it (i'd rather start
> from the source than from some interpretation; who know what more there
> is to uncover)

As it turn out, the linguists that helped with the selnolig package did use 
another corpus: Stuttgart "Deutsch" Web as Corpus
They describe their approach in that paper: 
https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf

Denis

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-06 Thread denis.maier


> -Ursprüngliche Nachricht-
> Von: Hans Hagen 
> Gesendet: Samstag, 3. April 2021 17:58
> An: mailing list for ConTeXt users ; Maier, Denis
> Christian (UB) 
> Betreff: Re: [NTG-context] Ligature suppression word list
> 
> On 4/3/2021 5:06 PM, denis.ma...@ub.unibe.ch wrote:
> > Hi everyone
> >
> > Now that Hans has implemented the new ligature suppression mechanism
> > via language goodies - thanks again Hans! - we now need to come up
> > with wordlists.
> >
> > I've started working on a list of German words with ligatures that
> > should be suppressed. The list is derived from the word list that
> > comes with the lualatex selnolig package:
> > https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo
> > rdlist.tex
> > <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w
> > ordlist.tex>
> >
> > You can find the current list here :
> > https://github.com/denismaier/context-nolig-wordlist
> > <https://github.com/denismaier/context-nolig-wordlist>
> >
> > The list is currently organized as follows :
> >
> >  1. L.25-l.35: This specifies words where automatic pattern matching is
> > more difficult than usually because the words contain multiple
> > ligatures, some of which must be suppressed while others must be
> > preserved. In the case of « Auflagefläche » it's even the same
> > combination of letters. So here, we use the bar | to manually
> > indicate points where no ligature must occur.
> >  2. L. 36ff.: The vast amount of words is currently in that list that
> > specifies words where a ff, fl, fi, ffi, or ffl ligature has to be
> > broken up after the first f.
> >  3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be
> > prevented after the second f, so the first two fs form a ligature.
> >  4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225,
> > and l. 2277 suppress ligatures for « ft » and « fft »,  « fb » and
> > « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
> >
> > Obviously, that list is far from being complete, and the question is
> > if it ever can be. Please have a look and feel free to propose more
> > words to be included - either via mail or directly on github.
> >
> > More generally, there's the question how such a list should be enhanced?
> > I was thinking about two options:
> >
> >  1. The new language options features include a tracker that allows for
> > tracking for which words in a given document ligature prevention
> > happened, and which words haven't been touched by the mechanism. It
> > should be possible to analyze the log file and to create lists of
> > words with ligatures. Should be a rather simple step to derive new
> > words for the ligature-suppression wordlist.
> >  2. A bigger solution might be to use selnoligs patterns in a script
> > that can be run over a large corpus, such as the DWDS (Digitales
> > Wörterbuch der deutschen Sprache). That should produce us a more
> > complete list of words where ligatures must be suppressed.
> 
> where is that DWDS ... i can write some code to deal with it (i'd rather start
> from the source than from some interpretation; who know what more there
> is to uncover)

The DWDS is here: https://www.dwds.de/
But I still need to check how we can extract the words from there...

Denis
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-03 Thread Thangalin
Untested. Lists are not subject to copyright, so public domain should be
legal, even though SE posts are CC-BY-SA. When a word has a single suffix
or prefix (e.g., safflower/s), the two words are listed together, rather
than using an explicit suffix/prefix section.

return {
name   = "english",
version= "1.00",
comment= "English ligature suppression",
author = "Mico Loretan, Dave Jarvis, & Hans Hagen",
copyright  = "Public domain",
options= {
{
actions = {
["|"] = "noligature"
},
words = [[
]],
},
{
patterns = {
fi  = "f|i",
fl  = "f|l",
},
words = [[
-- f|i
deafish
dwarfish
elfish
oafish
selfish
serfish
unselfish
wolfish

-- f|l
beefless
briefless
hoofless
leafless
roofless
selfless
turfless
]],
suffixes = [[
ness
ly
]],
},
{
patterns = {
fi  = "f|i",
},
words = [[
proofing
]],
prefixes = [[
air-
child-
fire-
flame-
moth-
rust-
sound-
water-
weather-
]],
},
{
patterns = {
ff  = "f|f",
fi  = "f|i",
fl  = "f|l",
ffi = "f|fi",
ffl = "f|fl",
},
words = [[
-- f|f
bookshelfful
mantelshelfful
shelfful

-- f|i
elfin

chafing
leafing
loafing
sheafing
strafing
vouchsafing
beefing
reefing
briefing
debriefing
coifing
fifing
jackknifing
knifing
midwifing
waifing
wifing

goofing
hoofing
roofing
reroofing
spoofing
whoofing
woofing

gulfing
begulfing
engulfing
ingulfing
golfing
gulfing
rolfing
selfing
wolfing
barfing
bedwarfing
dwarfing
enserfing
kerfing
scarfing
snarfing
surfing
windsurfing
turfing
wharfing

beefier
comfier
goofier
gulfier
leafier
surfier
turfier
beefiest
comfiest
goofiest
gulfiest
leafiest
surfiest
turfiest

beefily
goofily
goofiness

-- f|l
aloofly
briefly
chiefly
deafly
liefly

calflike
dwarflike
elflike
gulflike
hooflike
leaflike
rooflike
serflike
sheaflike
shelflike
surflike
turflike
waiflike
wolflike

halflife
shelflife
halfline
roofline

leaflet
leaflets
leafleted
leafleting
leafletting
leafletted
leafleteer

pdflatex

-- f|fi
chaffinch
wolffish

-- f|fl
safflower
safflowers
]],
},
{
patterns = {
ffi = "ff|i",
},
words = [[
-- ff|i
cuffing
]],
prefixes = [[
hand
un
]],
},
{
patterns = {
ffi = "ff|i",
},
words = [[
-- ff|i
feoffing
]],
  

Re: [NTG-context] Ligature suppression word list

2021-04-03 Thread Hans Hagen

On 4/3/2021 6:30 PM, Thangalin wrote:

A starting list of English non-ligatures:

https://english.stackexchange.com/a/50957/22099 



The entire SE thread has additional resources and is quite informative.

So can you make a file from that like we made as starting point for German?

Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
   tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-03 Thread Hans Hagen

On 4/3/2021 5:06 PM, denis.ma...@ub.unibe.ch wrote:





For those interested, that file only has ligature prevention definitions.

{
   actions = {
   ["|"] = "noligature"
   },
   words = [[
   Auf|lagefläche
   Auf|lageflächen
   Auf|lagenziffer
   Auf|lagenziffern
   ]],
},

can be (lig prevention already in words):

{
   words = [[
   Auf|lagefläche
   Auf|lageflächen
   Auf|lagenziffer
   Auf|lagenziffern
   ]],
},

or the more efficient (first match only):

{
   actions = {
   ["|"] = "noligature"
   },
   matches = { 1 }
   words = [[
   Auflagefläche
   Auflageflächen
   Auflagenziffer
   Auflagenziffern
   ]],
},

or if you want all matches:

{
   actions = {
   ["|"] = "noligature"
   },
   words = [[
   Auflagefläche
   Auflageflächen
   Auflagenziffer
   Auflagenziffern
   ]],
},

or when you want no kerns either (of course on can also use the petterns 
key):


   actions = {
   ["|"] = "noligature nokern"
   },
   words = [[
 ef|fe
   ]],
},

btw, user will also be able to do this in a document source

\startlanguageoptions[de]
Zapf|innovation
whatever+innovation
\stoplanguageoptions

ligature prevention in the first and compound word in the next one.

so, one way to see what we need is if users try to analyze their 
'exceptions' if they have them defined at all, so that we can spot 
possible tricks needed,


(i might actually combine this with exceptions that normally come after 
this stage)


Hans


-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
   tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-03 Thread Thangalin
A starting list of English non-ligatures:

https://english.stackexchange.com/a/50957/22099

The entire SE thread has additional resources and is quite informative.
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-03 Thread Hans Hagen

On 4/3/2021 5:06 PM, denis.ma...@ub.unibe.ch wrote:


 1. The new language options features include a tracker that allows for
tracking for which words in a given document ligature prevention
happened, and which words haven’t been touched by the mechanism. It
should be possible to analyze the log file and to create lists of
words with ligatures. Should be a rather simple step to derive new
words for the ligature-suppression wordlist.
I already have some code for that but can't make you an update (garden 
is / will be down for some days due to maintenance).


Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
   tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-03 Thread Hans Hagen

On 4/3/2021 5:20 PM, Arthur Rosendahl wrote:

On Sat, Apr 03, 2021 at 03:06:22PM +, denis.ma...@ub.unibe.ch wrote:

What do you think?


   I think you should collaborate with the group of volunteers working on
German hyphenation and related topics.  They have a mailing list (in
German): https://lists.dante.de/mailman/listinfo/trennmuster which is
quite active and where Mico Loretan, the author of selnolig,
occasionally posts.  I’m sure they’ll be happy to help with suggestions
and collaborative efforts, even if all of the main contributors use
LaTeX.


german is just an example, dutch has some specific things, and i bet 
other languages have their demands so my aim is some general mechanism 
(for which much is already in place btw) ... we're talking of a what i 
tag as 'languages goodies' just like we have 'font goodies' .. a plug in 
system


Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
   tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-03 Thread Hans Hagen

On 4/3/2021 5:06 PM, denis.ma...@ub.unibe.ch wrote:

Hi everyone

Now that Hans has implemented the new ligature suppression mechanism via 
language goodies – thanks again Hans! – we now need to come up with 
wordlists.


I’ve started working on a list of German words with ligatures that 
should be suppressed. The list is derived from the word list that comes 
with the lualatex selnolig package: 
https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wordlist.tex 



You can find the current list here : 
https://github.com/denismaier/context-nolig-wordlist 



The list is currently organized as follows :

 1. L.25-l.35: This specifies words where automatic pattern matching is
more difficult than usually because the words contain multiple
ligatures, some of which must be suppressed while others must be
preserved. In the case of « Auflagefläche » it’s even the same
combination of letters. So here, we use the bar | to manually
indicate points where no ligature must occur.
 2. L. 36ff.: The vast amount of words is currently in that list that
specifies words where a ff, fl, fi, ffi, or ffl ligature has to be
broken up after the first f.
 3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be
prevented after the second f, so the first two fs form a ligature.
 4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225,
and l. 2277 suppress ligatures for « ft » and « fft »,  « fb » and
« ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»

Obviously, that list is far from being complete, and the question is if 
it ever can be. Please have a look and feel free to propose more words 
to be included – either via mail or directly on github.


More generally, there’s the question how such a list should be enhanced? 
I was thinking about two options:


 1. The new language options features include a tracker that allows for
tracking for which words in a given document ligature prevention
happened, and which words haven’t been touched by the mechanism. It
should be possible to analyze the log file and to create lists of
words with ligatures. Should be a rather simple step to derive new
words for the ligature-suppression wordlist.
 2. A bigger solution might be to use selnoligs patterns in a script
that can be run over a large corpus, such as the DWDS (Digitales
Wörterbuch der deutschen Sprache). That should produce us a more
complete list of words where ligatures must be suppressed.


where is that DWDS ... i can write some code to deal with it (i'd rather 
start from the source than from some interpretation; who know what more 
there is to uncover)


additional info: we're talking of a mechanism sort of integrated in the 
hyphenation loop, where we can also handle compound words, if needed 
with details about how influence to hyphenate these) so the above 
question involves:


- exceptions to exceptions
- replacements before hyphenation
- compound words (including lhmin/rhmin overloads)
- (left right two sided) ligature and/or kern prevention

and whatever we like/need more (within reasonable bounds),

Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
   tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Ligature suppression word list

2021-04-03 Thread Arthur Rosendahl
On Sat, Apr 03, 2021 at 03:06:22PM +, denis.ma...@ub.unibe.ch wrote:
> What do you think?

  I think you should collaborate with the group of volunteers working on
German hyphenation and related topics.  They have a mailing list (in
German): https://lists.dante.de/mailman/listinfo/trennmuster which is
quite active and where Mico Loretan, the author of selnolig,
occasionally posts.  I’m sure they’ll be happy to help with suggestions
and collaborative efforts, even if all of the main contributors use
LaTeX.

Arthur
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


[NTG-context] Ligature suppression word list

2021-04-03 Thread denis.maier
Hi everyone

Now that Hans has implemented the new ligature suppression mechanism via 
language goodies - thanks again Hans! - we now need to come up with wordlists.

I've started working on a list of German words with ligatures that should be 
suppressed. The list is derived from the word list that comes with the lualatex 
selnolig package: 
https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wordlist.tex

You can find the current list here : 
https://github.com/denismaier/context-nolig-wordlist

The list is currently organized as follows :


  1.  L.25-l.35: This specifies words where automatic pattern matching is more 
difficult than usually because the words contain multiple ligatures, some of 
which must be suppressed while others must be preserved. In the case of « 
Auflagefläche » it's even the same combination of letters. So here, we use the 
bar | to manually indicate points where no ligature must occur.
  2.  L. 36ff.: The vast amount of words is currently in that list that 
specifies words where a ff, fl, fi, ffi, or ffl ligature has to be broken up 
after the first f.
  3.  L.1804ff contain words where ffi, ffl, or fff ligatures have to be 
prevented after the second f, so the first two fs form a ligature.
  4.  The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225, and 
l. 2277 suppress ligatures for « ft » and « fft »,  « fb » and « ffb », « fh » 
and « ffh», «fj» and «ffj», and «fk» and «ffk»

Obviously, that list is far from being complete, and the question is if it ever 
can be. Please have a look and feel free to propose more words to be included - 
either via mail or directly on github.

More generally, there's the question how such a list should be enhanced? I was 
thinking about two options:

  1.  The new language options features include a tracker that allows for 
tracking for which words in a given document ligature prevention happened, and 
which words haven't been touched by the mechanism. It should be possible to 
analyze the log file and to create lists of words with ligatures. Should be a 
rather simple step to derive new words for the ligature-suppression wordlist.
  2.  A bigger solution might be to use selnoligs patterns in a script that can 
be run over a large corpus, such as the DWDS (Digitales Wörterbuch der 
deutschen Sprache). That should produce us a more complete list of words where 
ligatures must be suppressed.

What do you think?

Best,
Denis
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___