Re: [NTG-context] Transliteration

2022-02-03 Thread Hans Hagen via ntg-context

On 2/3/2022 10:01 PM, Mojca Miklavec wrote:

On Thu, 3 Feb 2022 at 21:41, Hans Hagen wrote:



I have also merged the Serbian hyphenation patterns, so there is no need
to switch the language in order to have hyphenation in transliterated text.
That was possible because cyrillic and latin scripts use different code
points, and there are no conflicts in patterns.
So I suggest merging the patterns for Serbian cyrillic and latin.


I'd like to hear Arthur / Mojca on that  we can of course load them
both but if that is an upstream merge i'll wait for that


Yes, loading both patterns at once is definitely the correct approach.
That's what the rest of the TeX world already does (at least LuaTeX
and XeTeX; pdfTeX not of course), see
 
https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/loadhyph/loadhyph-sr-latn.tex

We have two sets of Cyrillic patterns (and several Latin ones as
well), so composing a single file was a bit of a (somewhat political)
challenge.
Now at least in theory the users are free to choose which of the two
sets of patterns they want.

I never checked what ConTeXt was doing with the Serbian patterns.
Personally I would suggest taking hyph-sh-cyrl.pat.txt and hyph-sh-latn.pat.txt.

we currently do this:

{ "sr",  "hyph-sr","serbian", false, { "hyph-sr-cyrl", 
"hyph-sr-latn" }, },


so you suggest to replace that by the "sh" variants

Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
   tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Transliteration

2022-02-03 Thread Mojca Miklavec via ntg-context
On Thu, 3 Feb 2022 at 21:41, Hans Hagen wrote:
>
> > I have also merged the Serbian hyphenation patterns, so there is no need
> > to switch the language in order to have hyphenation in transliterated text.
> > That was possible because cyrillic and latin scripts use different code
> > points, and there are no conflicts in patterns.
> > So I suggest merging the patterns for Serbian cyrillic and latin.
>
> I'd like to hear Arthur / Mojca on that  we can of course load them
> both but if that is an upstream merge i'll wait for that

Yes, loading both patterns at once is definitely the correct approach.
That's what the rest of the TeX world already does (at least LuaTeX
and XeTeX; pdfTeX not of course), see

https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/loadhyph/loadhyph-sr-latn.tex

We have two sets of Cyrillic patterns (and several Latin ones as
well), so composing a single file was a bit of a (somewhat political)
challenge.
Now at least in theory the users are free to choose which of the two
sets of patterns they want.

I never checked what ConTeXt was doing with the Serbian patterns.
Personally I would suggest taking hyph-sh-cyrl.pat.txt and hyph-sh-latn.pat.txt.

Mojca
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] Transliteration

2022-02-03 Thread Hans Hagen via ntg-context

On 2/3/2022 8:15 PM, Ivan Pešić via ntg-context wrote:

Hello!
I've been working on a Serbian book and I had to transliterate it from 
cyrillic to latin.
There's been some nice improvement in transliteration, and I would like 
to propose a small change.
One of the peculiarities that current transliteration mechanisms (both 
internal one and the 3rd party module from Philipp Gesang)
don't process is that Љ, Њ and Џ are transliterated to Lj, Nj and Dž in 
normal words that start the sentence, or in names that normally start 
with a capital letter,
but in titles written in all capitals they should be transliterated to 
LJ, NJ and DŽ.
So, the quick solution was to update the current mapping vector and add 
another one (that is attached) that maps cyrillic capitals to LJ, NJ and DŽ

and set the correct 30 letters used in Serbian language.
It requires a bit more manual work to set the correct mapping for all 
capitals text, but it works.
I have also merged the Serbian hyphenation patterns, so there is no need 
to switch the language in order to have hyphenation in transliterated text.
That was possible because cyrillic and latin scripts use different code 
points, and there are no conflicts in patterns.

So I suggest merging the patterns for Serbian cyrillic and latin.


I'd like to hear Arthur / Mojca on that  we can of course load them 
both but if that is an upstream merge i'll wait for that


you can actually map multiple to multiple in the tranmsliteration tables

["foo"] = "oof"

and such and there is in the next version also an exception mechanism 
that permits clone a transliteration and add exceptions


There is another issue if one wants to use a dropcap and the rest of 
that first word, and several following words are to be typeset in small 
caps.
If that first letter is Љ (or other two letters that transliterate as 
digraphs), then the second letter of the digraph is not typeset in small 
caps because

it gets injected before the group that turns on small caps.
For example:

\placeinitial
Љ{\sc уди нису знали}

but this is quite a special case...
you can use \settransliteration{name} locally so as part of a style 
specification (there is also \resettransliteration)


the next upload has some more that Sreeram is currently documenting on 
the wiki


Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
   tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


[NTG-context] Transliteration

2022-02-03 Thread Ivan Pešić via ntg-context

Hello!
I've been working on a Serbian book and I had to transliterate it from 
cyrillic to latin.
There's been some nice improvement in transliteration, and I would like 
to propose a small change.
One of the peculiarities that current transliteration mechanisms (both 
internal one and the 3rd party module from Philipp Gesang)
don't process is that Љ, Њ and Џ are transliterated to Lj, Nj and Dž in 
normal words that start the sentence, or in names that normally start 
with a capital letter,
but in titles written in all capitals they should be transliterated to 
LJ, NJ and DŽ.
So, the quick solution was to update the current mapping vector and add 
another one (that is attached) that maps cyrillic capitals to LJ, NJ and DŽ

and set the correct 30 letters used in Serbian language.
It requires a bit more manual work to set the correct mapping for all 
capitals text, but it works.
I have also merged the Serbian hyphenation patterns, so there is no need 
to switch the language in order to have hyphenation in transliterated text.
That was possible because cyrillic and latin scripts use different code 
points, and there are no conflicts in patterns.

So I suggest merging the patterns for Serbian cyrillic and latin.

There is another issue if one wants to use a dropcap and the rest of 
that first word, and several following words are to be typeset in small 
caps.
If that first letter is Љ (or other two letters that transliterate as 
digraphs), then the second letter of the digraph is not typeset in small 
caps because

it gets injected before the group that turns on small caps.
For example:

   \placeinitial
   Љ{\sc уди нису знали}

but this is quite a special case...

Regards,
Ivan
return {
  transliterations = {
["c2l"] = {
mapping = {
["А"] = "A",  ["а"] = "a",
["Б"] = "B",  ["б"] = "b",
["В"] = "V",  ["в"] = "v",
["Г"] = "G",  ["г"] = "g",
["Д"] = "D",  ["д"] = "d",
["Ђ"] = "Đ",  ["ђ"] = "đ",
["Е"] = "E",  ["е"] = "e",
["Ж"] = "Ž",  ["ж"] = "ž",
["З"] = "Z",  ["з"] = "z",
["И"] = "I",  ["и"] = "i",
["Ј"] = "J",  ["ј"] = "j",
["К"] = "K",  ["к"] = "k",
["Л"] = "L",  ["л"] = "l",
["Љ"] = "Lj",  ["љ"] = "lj",
["М"] = "M",  ["м"] = "m",
["Н"] = "N",  ["н"] = "n",
["Њ"] = "Nj",  ["њ"] = "nj",
["О"] = "O",  ["о"] = "o",
["П"] = "P",  ["п"] = "p",
["Р"] = "R",  ["р"] = "r",
["С"] = "S",  ["с"] = "s",
["Т"] = "T", ["т"] = "t",
["Ћ"] = "Ć",  ["ћ"] = "ć",
["У"] = "U",  ["у"] = "u",
["Ф"] = "F",  ["ф"] = "f",
["Х"] = "H", ["х"] = "h",
["Ц"] = "C",  ["ц"] = "c",
["Ч"] = "Č",  ["ч"] = "č",
["Џ"] = "Dž", ["џ"] = "dž",
["Ш"] = "Š", ["ш"] = "š",
}
},
["C2L"] = {
mapping = {
["А"] = "A",  ["а"] = "a",
["Б"] = "B",  ["б"] = "b",
["В"] = "V",  ["в"] = "v",
["Г"] = "G",  ["г"] = "g",
["Д"] = "D",  ["д"] = "d",
["Ђ"] = "Đ",  ["ђ"] = "đ",
["Е"] = "E",  ["е"] = "e",
["Ж"] = "Ž",  ["ж"] = "ž",
["З"] = "Z",  ["з"] = "z",
["И"] = "I",  ["и"] = "i",
["Ј"] = "J",  ["ј"] = "j",
["К"] = "K",  ["к"] = "k",
["Л"] = "L",  ["л"] = "l",
["Љ"] = "LJ",  ["љ"] = "lj",
["М"] = "M",  ["м"] = "m",
["Н"] = "N",  ["н"] = "n",
["Њ"] = "NJ",  ["њ"] = "nj",
["О"] = "O",  ["о"] = "o",
["П"] = "P",  ["п"] = "p",
["Р"] = "R",  ["р"] = "r",
["С"] = "S",  ["с"] = "s",
["Т"] = "T", ["т"] = "t",
["Ћ"] = "Ć",  ["ћ"] = "ć",
["У"] = "U",  ["у"] = "u",
["Ф"] = "F",  ["ф"] = "f",
["Х"] = "H", ["х"] = "h",
["Ц"] = "C",  ["ц"] = "c",
["Ч"] = "Č",  ["ч"] = "č",
["Џ"] = "DŽ", ["џ"] = "dž",
["Ш"] = "Š", ["ш"] = "š",
}
 }
  }
}
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-31 Thread Jano Kula

Hi!

On 10/30/2010 11:34 AM, Khaled Hosny wrote:

On Sat, Oct 30, 2010 at 10:17:11AM +0200, Hans Hagen wrote:

On 30-10-2010 12:05, Khaled Hosny wrote:

On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:

By far the easiest and most portable solution would be if you could
convince Taco to implement something like latin a is equivalent to
cyrillic a as far as hyphenation is concerned (which could also solve
many other problems that we have). Actually, you can already do that
by redefining \lccode of latin a to point to cyrillic a (and do that
for the whole alphabet), but then you need to make sure that you don't
use any commands for lowercasing/uppercasing words. If you need
details, I can help you out, but first exact transliteration rules are
needed.


I was thinking, since using \lccode for hyphenation is really a wired
choice (I'm sure don has a good reason back then, but such things are
usually no longer relevant), and since it is used in a sort of
controlled environment (playing with \lccode's for hyphenation is not
ever one's toy), may be luatex can break the backward compatibility in
the hyphenation area and have a dedicated new code, \hycode or
something, only for hyphenation purposes (may be backward compatibility
can be kept by using it in addition to \lccode, maybe).

What do you think?


just any letter (catcode letter) would do and the rest is to be
controlled by the patterns


The issue here is that we want to make some character equivalent to each
other, e.g. ' and ’ which are needed for some languages, without the
need to duplicate the patterns.


Before jumping too deep to the subject, consider if it really worth an 
effort. There is not much more then, titles written in the 
transliterated text. No continuous reading.


My experience says, whatever language is the original title, reader 
usually expects hyphenation similar to the language of the main text. 
Whenever I've used English patterns in English titles (even citations), 
they where changed by the Czech proofreader -- though they were 
perfectly correct in English -- to resemble Czech patterns. I'm not 
saying it is the right approach, but from the readers' and proofreaders' 
point of view if he reads in Czech and doesn't now English patterns or 
even English, patterns different from Czech are disturbing.


Jano

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-31 Thread Khaled Hosny
On Sun, Oct 31, 2010 at 07:12:20PM +0100, Jano Kula wrote:
 Hi!
 
 On 10/30/2010 11:34 AM, Khaled Hosny wrote:
 On Sat, Oct 30, 2010 at 10:17:11AM +0200, Hans Hagen wrote:
 On 30-10-2010 12:05, Khaled Hosny wrote:
 On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
 By far the easiest and most portable solution would be if you could
 convince Taco to implement something like latin a is equivalent to
 cyrillic a as far as hyphenation is concerned (which could also solve
 many other problems that we have). Actually, you can already do that
 by redefining \lccode of latin a to point to cyrillic a (and do that
 for the whole alphabet), but then you need to make sure that you don't
 use any commands for lowercasing/uppercasing words. If you need
 details, I can help you out, but first exact transliteration rules are
 needed.
 
 I was thinking, since using \lccode for hyphenation is really a wired
 choice (I'm sure don has a good reason back then, but such things are
 usually no longer relevant), and since it is used in a sort of
 controlled environment (playing with \lccode's for hyphenation is not
 ever one's toy), may be luatex can break the backward compatibility in
 the hyphenation area and have a dedicated new code, \hycode or
 something, only for hyphenation purposes (may be backward compatibility
 can be kept by using it in addition to \lccode, maybe).
 
 What do you think?
 
 just any letter (catcode letter) would do and the rest is to be
 controlled by the patterns
 
 The issue here is that we want to make some character equivalent to each
 other, e.g. ' and ’ which are needed for some languages, without the
 need to duplicate the patterns.
 
 Before jumping too deep to the subject, consider if it really worth
 an effort. There is not much more then, titles written in the
 transliterated text. No continuous reading.

It not about the problem in this thread specifically, but rather another
issue that were brought recently in xetex mailing list; basically if one
is using the curly apostrophe (’) all hyphenation patterns depends on the
ASCII one (') will not be taken into account.

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-30 Thread Hans Hagen

On 30-10-2010 12:05, Khaled Hosny wrote:

On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:

By far the easiest and most portable solution would be if you could
convince Taco to implement something like latin a is equivalent to
cyrillic a as far as hyphenation is concerned (which could also solve
many other problems that we have). Actually, you can already do that
by redefining \lccode of latin a to point to cyrillic a (and do that
for the whole alphabet), but then you need to make sure that you don't
use any commands for lowercasing/uppercasing words. If you need
details, I can help you out, but first exact transliteration rules are
needed.


I was thinking, since using \lccode for hyphenation is really a wired
choice (I'm sure don has a good reason back then, but such things are
usually no longer relevant), and since it is used in a sort of
controlled environment (playing with \lccode's for hyphenation is not
ever one's toy), may be luatex can break the backward compatibility in
the hyphenation area and have a dedicated new code, \hycode or
something, only for hyphenation purposes (may be backward compatibility
can be kept by using it in addition to \lccode, maybe).

What do you think?


just any letter (catcode letter) would do and the rest is to be 
controlled by the patterns


Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
 | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-30 Thread Taco Hoekwater

On 10/30/2010 10:17 AM, Hans Hagen wrote:

On 30-10-2010 12:05, Khaled Hosny wrote:

On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:

By far the easiest and most portable solution would be if you could
convince Taco to implement something like latin a is equivalent to
cyrillic a as far as hyphenation is concerned


You could try to convince me, but that would take considerable effort
because that is a form of cheating that I am not comfortable with.

Besides, in the non-trivial cases, a single cyrillic letter maps to
multiple latin ones, and setting that up as an internal remapping
is not trivial.

There is a simpler solution, I think: treat transliterations as a
separate language on the macro side. Generating the patterns for
that new language is simple if the transliteration rules are correct;
just do the replacements like so:

  ‘я’-‘j8a’

Best wishes,
Taco
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-30 Thread Khaled Hosny
On Sat, Oct 30, 2010 at 10:17:11AM +0200, Hans Hagen wrote:
 On 30-10-2010 12:05, Khaled Hosny wrote:
 On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
 By far the easiest and most portable solution would be if you could
 convince Taco to implement something like latin a is equivalent to
 cyrillic a as far as hyphenation is concerned (which could also solve
 many other problems that we have). Actually, you can already do that
 by redefining \lccode of latin a to point to cyrillic a (and do that
 for the whole alphabet), but then you need to make sure that you don't
 use any commands for lowercasing/uppercasing words. If you need
 details, I can help you out, but first exact transliteration rules are
 needed.
 
 I was thinking, since using \lccode for hyphenation is really a wired
 choice (I'm sure don has a good reason back then, but such things are
 usually no longer relevant), and since it is used in a sort of
 controlled environment (playing with \lccode's for hyphenation is not
 ever one's toy), may be luatex can break the backward compatibility in
 the hyphenation area and have a dedicated new code, \hycode or
 something, only for hyphenation purposes (may be backward compatibility
 can be kept by using it in addition to \lccode, maybe).
 
 What do you think?
 
 just any letter (catcode letter) would do and the rest is to be
 controlled by the patterns

The issue here is that we want to make some character equivalent to each
other, e.g. ' and ’ which are needed for some languages, without the
need to duplicate the patterns.

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-30 Thread Philipp Gesang
On 2010-10-30 01:06:33, Andrzej Orłowski-Skoczyk wrote:
 On 10/30/2010 12:47 AM, Philipp Gesang wrote:
  As others already pointed out, with a small number of strings
  Steffen might get acceptable results by using the patterns of a
  similar language. Although real transliterations work best with
  Czech or Slovak, this peculiar transcription might be better off
  with Polish or even (judging by the use of ‘sh’) standard
  English.
 
 I'm afraid Polish will not do (Polish always hyphenates sz-cz, though in
 Russian shch is one character; and such).

Of course, your point is clear. Still I think Polish would be of
more use than Czech in this case because it shares more
similarities withe the transcribed Russian. E.g. Russian and
Polish have ‘ks’ where Czech has ‘x’; both RuPl allow ‘ki’ and
‘gi’ which is illegal in Cz; and Czech lacks a native ‘g’, while
others have kept it. Thus you can hope for more valid hyphenation
points if you use the Polish patterns, don’t you?

 
 I'm afraid none Slavic language will do unless there is one that uses
 Latin script _and_ soft/hard sign (yer) - these are tricky, not similar
 to anything you meet in Polish/Czech and so on.

None of them are perfect, but most cases don’t require
perfection. Trans[cription|literation] rarely occurs in masses,
so often I just insert the break points by hand and forget about
it.

Regards, Philipp


 -- 
 Andrzej Orłowski-Skoczyk
 ___
 If your question is of interest to others as well, please add an entry to the 
 Wiki!
 
 maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
 webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
 archive  : http://foundry.supelec.fr/projects/contextrev/
 wiki : http://contextgarden.net
 ___

-- 
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments


pgp7ZIroTFlvG.pgp
Description: PGP signature
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-30 Thread Steffen Wolfrum

Am 30.10.2010 um 00:15 schrieb Andrzej Orłowski-Skoczyk:

 
 Warning: the transliteration used in Steffen's document is (or at least
 the example is) lossy and as such will likely produce wrong hyphenation
 output no matter the applied method of making TeX hyphenate it.
 
 The transliteration (in the example) is also inconsistent - if you tried
 to reverse transliterate it to Cyrillic, you would not only miss some
 characters, but you would also get some other characters wrong.
 




Andrzej,

thanks for your statement! 

Thus I will leave it to the author to draw in the appropriate break points when 
reading the first proof. After all it's her text.

I think the results were not in proportion to the effort, when we were trying 
to work on a general solution on the context/luatex side. At least not for this 
specific project.


My question starting this thread was made under the assumption of a good 
transliteration ...


Thank you all for your very interesting hints and notes!

Steffen
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


[NTG-context] transliteration russian

2010-10-29 Thread Steffen Wolfrum
Hi all,

I am just about to typeset a book of a russian author written in english, but 
with a lot of russian literature listed in the bibliography:
The titles of theses sources are russian but in latin transliteration, like 
this ...
O koordinacii mezhdunarodnyh i vneshnejekonomicheskih svjazej subjektov 
Rossijskoj Federacii

But even though I assigned \language[ru] the word vneshnejekonomicheskih 
eg. does not get hyphenated.
And there are some dozen titles more that show the same problem ...

Is this (to not hyphenate) because of the transliteration?
Do I have to choose another \language key?

Yours,
Steffen
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-29 Thread Thomas A. Schmitz

On Oct 29, 2010, at 1:18 PM, Steffen Wolfrum wrote:

 Hi all,
 
 I am just about to typeset a book of a russian author written in english, but 
 with a lot of russian literature listed in the bibliography:
 The titles of theses sources are russian but in latin transliteration, like 
 this ...
 O koordinacii mezhdunarodnyh i vneshnejekonomicheskih svjazej subjektov 
 Rossijskoj Federacii
 
 But even though I assigned \language[ru] the word vneshnejekonomicheskih 
 eg. does not get hyphenated.
 And there are some dozen titles more that show the same problem ...
 
 Is this (to not hyphenate) because of the transliteration?
 Do I have to choose another \language key?
 
Of course. To the luaTeX parser, the transliterated Russian is just 
gobbledygook, the hyphenation patterns expect proper unicode input.

Thomas

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-29 Thread Jano Kula

On 10/29/2010 01:58 PM, Thomas A. Schmitz wrote:


On Oct 29, 2010, at 1:18 PM, Steffen Wolfrum wrote:


Hi all,

I am just about to typeset a book of a russian author written in english, but 
with a lot of russian literature listed in the bibliography:
The titles of theses sources are russian but in latin transliteration, like 
this ...
O koordinacii mezhdunarodnyh i vneshnejekonomicheskih svjazej subjektov 
Rossijskoj Federacii

But even though I assigned \language[ru] the word vneshnejekonomicheskih 
eg. does not get hyphenated.
And there are some dozen titles more that show the same problem ...

Is this (to not hyphenate) because of the transliteration?
Do I have to choose another \language key?


I would expect slavic languages (cz, pl) to give better results in 
hyphenation of this transliterated text, though they will not give 
perfect results and exceptions will be needed. I'm assuming the reader 
how expects Russian hyphenation rules in these cases.


Jano

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-29 Thread Mojca Miklavec
On Fri, Oct 29, 2010 at 13:18, Steffen Wolfrum wrote:
 Hi all,

 I am just about to typeset a book of a russian author written in english, but 
 with a lot of russian literature listed in the bibliography:
 The titles of theses sources are russian but in latin transliteration, like 
 this ...
 O koordinacii mezhdunarodnyh i vneshnejekonomicheskih svjazej subjektov 
 Rossijskoj Federacii

 But even though I assigned \language[ru] the word vneshnejekonomicheskih 
 eg. does not get hyphenated.
 And there are some dozen titles more that show the same problem ...

 Is this (to not hyphenate) because of the transliteration?
 Do I have to choose another \language key?

Dear Steffen,

The Russian patterns only cover the Cyrillic part. Serbian patterns
are the only ones that cover both scripts, but even then the patterns
themselves are seen as two different languages by TeX.

The best thing to do would be to transliterate Russian patterns into
Latin script (under one condition: transliteration needs to be
one-to-one; if one cyrillic glyph transliterates into two latin
characters, that doesn't help you). If you use LuaTeX you may then
load the patterns on the fly.

Another easy option would be to load any other slavic patterns as
Jano suggested and then add exceptions where needed. I'm not sure if
transliterated patterns belong to hyph-utf8. (If nothing else, Russian
is transliterated differently into Slovenian for example, so one would
formally then need transliteration from Russian to any other given
language written in Cyrillic script).

[still under assumption that you use LuaTeX and that transliteration
is one-to-one]
By far the easiest and most portable solution would be if you could
convince Taco to implement something like latin a is equivalent to
cyrillic a as far as hyphenation is concerned (which could also solve
many other problems that we have). Actually, you can already do that
by redefining \lccode of latin a to point to cyrillic a (and do that
for the whole alphabet), but then you need to make sure that you don't
use any commands for lowercasing/uppercasing words. If you need
details, I can help you out, but first exact transliteration rules are
needed.

Mojca
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-29 Thread Khaled Hosny
On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
 By far the easiest and most portable solution would be if you could
 convince Taco to implement something like latin a is equivalent to
 cyrillic a as far as hyphenation is concerned (which could also solve
 many other problems that we have). Actually, you can already do that
 by redefining \lccode of latin a to point to cyrillic a (and do that
 for the whole alphabet), but then you need to make sure that you don't
 use any commands for lowercasing/uppercasing words. If you need
 details, I can help you out, but first exact transliteration rules are
 needed.

I was thinking, since using \lccode for hyphenation is really a wired
choice (I'm sure don has a good reason back then, but such things are
usually no longer relevant), and since it is used in a sort of
controlled environment (playing with \lccode's for hyphenation is not
ever one's toy), may be luatex can break the backward compatibility in
the hyphenation area and have a dedicated new code, \hycode or
something, only for hyphenation purposes (may be backward compatibility
can be kept by using it in addition to \lccode, maybe).

What do you think?

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-29 Thread Andrzej Orłowski-Skoczyk
On 10/29/2010 11:25 PM, Mojca Miklavec wrote:
 The best thing to do would be to transliterate Russian patterns into
 Latin script (under one condition: transliteration needs to be
 one-to-one; if one cyrillic glyph transliterates into two latin
 characters, that doesn't help you). If you use LuaTeX you may then
 load the patterns on the fly.

Warning: the transliteration used in Steffen's document is (or at least
the example is) lossy and as such will likely produce wrong hyphenation
output no matter the applied method of making TeX hyphenate it.

The transliteration (in the example) is also inconsistent - if you tried
to reverse transliterate it to Cyrillic, you would not only miss some
characters, but you would also get some other characters wrong.

Examples:
- 'subjektov' is 'субъектов',
- 'vneshnejekonomicheskih' is 'внешнеэкономических',
thus 'je' stands for both 'ъе' and for 'э'.

This however could be just the authors typo. In such case 'subjektov'
should be corrected to 'subektov'.


The way to achieve a univocal (one-to-one) transliteration would be
first to reverse transliterate it to Cyrillic, and then transliterate
back to Latin using ISO 9 transliteration standard:
http://en.wikipedia.org/wiki/ISO_9
The example 'О координации международных и внешнеэкономических связей
субъектов Российской Федерации' would then output 'O koordinacii
meždunarodnyh i vnešneèkonomičeskih svâzej subektov Rossijskoj
Federacii'. This however I wouldn't consider a very human-readable output.

A very handy tool for experiments can be found here:
http://translit.cc/

On the margin: Wouldn't it be much better to use just Cyrillic for that?
-- 
Andrzej Orłowski-Skoczyk
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-29 Thread Mojca Miklavec
2010/10/30 Andrzej Orłowski-Skoczyk wrote:
 On 10/29/2010 11:25 PM, Mojca Miklavec wrote:
 The best thing to do would be to transliterate Russian patterns into
 Latin script (under one condition: transliteration needs to be
 one-to-one; if one cyrillic glyph transliterates into two latin
 characters, that doesn't help you). If you use LuaTeX you may then
 load the patterns on the fly.

 Warning: the transliteration used in Steffen's document is (or at least
 the example is) lossy and as such will likely produce wrong hyphenation
 output no matter the applied method of making TeX hyphenate it.

I didn't inspect the transliteration, but now that you point it out -
true, to achieve perfect results, one would need to completely
redesign the patterns.

... or simply use a random slavic language and fix the wrong
hyphenations one-by-one (in particular, words with sh/ch could easily
break even though they represent a single letter).

 The example 'О координации международных и внешнеэкономических связей
 субъектов Российской Федерации' would then output 'O koordinacii
 meždunarodnyh i vnešneèkonomičeskih svâzej subektov Rossijskoj
 Federacii'. This however I wouldn't consider a very human-readable output.

... it depends on who the human is. Slavic-speaking countries have no
problem pronouncing čšž ... :) :) :) Quotation marks are a bit weird
though ...

Maybe the most sensible solution (assuming LuaTeX) that would work
perfectly but would not be easy to write could be to input the title
in Cyrillic script, let TeX hyphenate it, and finally output
automatically transliterated string.

Mojca
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-29 Thread Philipp Gesang
On 2010-10-29 23:25:20, Mojca Miklavec wrote:
 
 The best thing to do would be to transliterate Russian patterns into
 Latin script (under one condition: transliteration needs to be
 one-to-one; if one cyrillic glyph transliterates into two latin

The one in question is rather a transcription (‘romanization’)
than a transliteration, thus unfortunately there is no
bijective mapping (e.g. ‘я’-‘ja’, ‘ш’-‘sh’ etc.). It seems to
be a hybrid between the standard Library of Congress-style
transcription and an older ISO or ΓΟСТ transliteration. Also, ‘j’
occurs in very odd positions. Whatever it is, we would need the
complete transcription mapping.

As others already pointed out, with a small number of strings
Steffen might get acceptable results by using the patterns of a
similar language. Although real transliterations work best with
Czech or Slovak, this peculiar transcription might be better off
with Polish or even (judging by the use of ‘sh’) standard
English.

@Steffen, if you could convince the author to supply the original
Russian text and if he would agree to use a more common style,
you could let the transliteration module do the job instead
(http://bitbucket.org/phg/transliterator).

 By far the easiest and most portable solution would be if you could
 convince Taco to implement something like latin a is equivalent to
 cyrillic a as far as hyphenation is concerned (which could also solve
 many other problems that we have).

+1. This would be a great feature.

Good night all, Philipp



 
 Mojca
 ___
 If your question is of interest to others as well, please add an entry to the 
 Wiki!
 
 maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
 webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
 archive  : http://foundry.supelec.fr/projects/contextrev/
 wiki : http://contextgarden.net
 ___

-- 
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments


pgpNzwv0VxZ9h.pgp
Description: PGP signature
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] transliteration russian

2010-10-29 Thread Andrzej Orłowski-Skoczyk
On 10/30/2010 12:47 AM, Philipp Gesang wrote:
 As others already pointed out, with a small number of strings
 Steffen might get acceptable results by using the patterns of a
 similar language. Although real transliterations work best with
 Czech or Slovak, this peculiar transcription might be better off
 with Polish or even (judging by the use of ‘sh’) standard
 English.

I'm afraid Polish will not do (Polish always hyphenates sz-cz, though in
Russian shch is one character; and such).

I'm afraid none Slavic language will do unless there is one that uses
Latin script _and_ soft/hard sign (yer) - these are tricky, not similar
to anything you meet in Polish/Czech and so on.
-- 
Andrzej Orłowski-Skoczyk
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___