Re: [XeTeX] babel

2016-03-24 Thread Ross Moore
Hi Javier,

On Mar 24, 2016, at 5:59 PM, Javier Bezos 
> wrote:

Apostolos,

preface = \textPi \textrho\acctonos \textomicron\textlambda 
\textomicron\textgamma

XeLaTeX is Unicode aware and can handle Unicode strings. Therefore, I fail to 
see
why you are doing things this way. The LGR font encoding is an ancient hack that
has no usage anymore.

Of course, in Unicode engines the default captions section
apply, not the captions.licr subsection.

I think that it is absolutely correct that you build in continuing support
for old encodings that may no longer be used with new documents.

The existence of old documents using such encodings certainly
warrants this — especially in the case of archives that process
old (La)TeX sources to create PDFs on the fly.

It is quite possible that in future these will be required to conform
to modern standards, rather than just reproduce exactly what those
sources did in past decades. Then there is the issue of old documents
being aggregated with newer ones, for “Collected Works”-like publications.

It is quite wrong to say that because we now have newer, better methods
that those older methods should be discarded entirely.


I’m facing exactly this problem, adapting  pdfx.sty  to be able to translate
Metadata provided in old encodings: KOI8-R, LGR, OT6 etc.
automatically into UTF-8, because the latter is required by XMP for
requirements to satisfy PDF/A, PDF/X and PDF/E standards.



Javier

Keep up the good work.

Cheers,

Ross


Dr Ross Moore

Mathematics Dept | Level 2, S2.638 AHH
Macquarie University, NSW 2109, Australia

T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: 
ross.mo...@mq.edu.au

http://www.maths.mq.edu.au


[cid:image001.png@01D030BE.D37A46F0]


CRICOS Provider Number 2J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] babel

2016-03-24 Thread Javier Bezos

Apostolos,


preface = \textPi \textrho\acctonos \textomicron\textlambda 
\textomicron\textgamma

XeLaTeX is Unicode aware and can handle Unicode strings. Therefore, I fail to 
see
why you are doing things this way. The LGR font encoding is an ancient hack that
has no usage anymore.


Of course, in Unicode engines the default captions section
apply, not the captions.licr subsection.

Javier


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] babel

2016-03-24 Thread Javier Bezos

Mojca,

Thank you. See me reply to Zdeněk.


What is the difference between months.format.wide and
months.stand-alone.wide?


In most languages, none. This distinction is made by the CLDR,
but I wonder if it's useful here, so very likely the format
branch should be removed.


In Slovenian one sometimes uses the genetive
form of the date, like
 Today is "24. marec 2016" (nominativ)
 This happened on "24. marca 2016" (genetiv)
I don't know whether there is any sane way to encode this though.


From the babel manual:

‘More interesting are differences in the sentence structure or related
to it. For example, in Basque the number precedes the name (including 
chapters), in Hungarian “from (1)” is “(1)-b˝ol”, but “from (3)” is 
“(3)-ból”, in Spanish an item labelled “3.o” may be referred to as

either “ítem 3.o” or “3.er ítem”, and so on.’

So, yes :-).

Javier




I don't know how months.format.narrow is used, but a single letter is
completely useless because it's too ambiguous. One uses 24.3.2016 or
24.03.2016 (in tables etc. where aligning is important). Whether or
not there is space in date.short is debatable. (Officially it's
correct to use space, but almost nobody uses it.) Officially one is
also supposed to write time with a dot rather than colon, but most use
a colon.

German typography doesn't use French spacing as far as I know.

For Slovenian:
- OT1 and LY1 are not suitable encondings.
- Glossary is not a slovenian word. It should probably be "Slovar"
- headto = Prejme is weird
- righthyphenmin = 2
- I don't understand the zillion entries about hyphenchar, but it must
be similar to other European languages.
- Having just "quotes =" might not be sufficient if you want to
automatically support quotes one day like ConTeXt does with
\quote{...} and \quotation{...}. We use two flavours (one can decide
to use either one or the other) and in both flavours one has both
single and double quotes.
   (a) ›single‹ »double«
   (b) ‚single‘ „double“
- What is meant with "exponential = e"? (I use $2{,}1\cdot 10^{-5}$ or
perhaps \times instead of \cdot.) Isn't "e" just a convention for
entering numbers into computers that has absolutely nothing to do with
typography?

I'm not sure if it's correct to use "po n. št." or just "n. št." (at
some point you will probably have to introduce comments in those ini
files). But we don't have BCE. So you might want to use something like
this (I don't want to certify correctness):

eras.abbreviated.0-alt-variant = pr. Kr.
eras.abbreviated.0 = pr. n. št.
eras.abbreviated.1 = po n. št.
eras.abbreviated.1-alt-variant = po Kr.
eras.wide.0-alt-variant = pred Kristusom
eras.wide.0 = pred našim štetjem
eras.wide.1 = našega štetja % or "po našem štetju"
eras.wide.1-alt-variant = po Kristusu
eras.narrow.0-alt-variant = pr. Kr.
eras.narrow.0 = pr. n. št.
eras.narrow.1 = po n. št.
eras.narrow.1-alt-variant = po Kr.

The following is useless (= nobody will understand):

dayPeriods.format.narrow.am = d
dayPeriods.format.narrow.noon = n
dayPeriods.format.narrow.pm = p

We use numbers 0-23 to denote hour of the day rather than some bogus "d/n/p".

Mojca



--
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex





--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] babel

2016-03-24 Thread Mojca Miklavec
On 23 March 2016 at 19:31, Javier Bezos  wrote:
> Hi all,
>
> I'm working on a new version of babel, with a new way to define
> languages in a descriptive way, more than in a programmatic one (of
> course, the latter won't be excluded because it's still necessary).
>
> The idea is to create a set of ini file like those you can find on
>
> https://latex-project.org/svnroot/latex2e-public/trunk/required/babel/locales/
>
> They are tentative and some of them are incomplete. I'm working on the
> code to read and 'transform' their data, but in the meanwhile I'd like
> to improve the ini files. The first step in the roadmap is to provide
> real utf-8 strings for captions and dates with current styles so
> that they can be useable even without fontenc.
>
> Any help or comments would be greatly appreciated.

The alphabetic order here is completely confusing:

days.format.wide.fri =
days.format.wide.mon =
days.format.wide.sat =
days.format.wide.sun =
days.format.wide.thu =
days.format.wide.tue =
days.format.wide.wed =

What is the difference between captions and captions.licr?

What is the difference between months.format.wide and
months.stand-alone.wide? In Slovenian one sometimes uses the genetive
form of the date, like
Today is "24. marec 2016" (nominativ)
This happened on "24. marca 2016" (genetiv)
I don't know whether there is any sane way to encode this though.

I don't know how months.format.narrow is used, but a single letter is
completely useless because it's too ambiguous. One uses 24.3.2016 or
24.03.2016 (in tables etc. where aligning is important). Whether or
not there is space in date.short is debatable. (Officially it's
correct to use space, but almost nobody uses it.) Officially one is
also supposed to write time with a dot rather than colon, but most use
a colon.

German typography doesn't use French spacing as far as I know.

For Slovenian:
- OT1 and LY1 are not suitable encondings.
- Glossary is not a slovenian word. It should probably be "Slovar"
- headto = Prejme is weird
- righthyphenmin = 2
- I don't understand the zillion entries about hyphenchar, but it must
be similar to other European languages.
- Having just "quotes =" might not be sufficient if you want to
automatically support quotes one day like ConTeXt does with
\quote{...} and \quotation{...}. We use two flavours (one can decide
to use either one or the other) and in both flavours one has both
single and double quotes.
  (a) ›single‹ »double«
  (b) ‚single‘ „double“
- What is meant with "exponential = e"? (I use $2{,}1\cdot 10^{-5}$ or
perhaps \times instead of \cdot.) Isn't "e" just a convention for
entering numbers into computers that has absolutely nothing to do with
typography?

I'm not sure if it's correct to use "po n. št." or just "n. št." (at
some point you will probably have to introduce comments in those ini
files). But we don't have BCE. So you might want to use something like
this (I don't want to certify correctness):

eras.abbreviated.0-alt-variant = pr. Kr.
eras.abbreviated.0 = pr. n. št.
eras.abbreviated.1 = po n. št.
eras.abbreviated.1-alt-variant = po Kr.
eras.wide.0-alt-variant = pred Kristusom
eras.wide.0 = pred našim štetjem
eras.wide.1 = našega štetja % or "po našem štetju"
eras.wide.1-alt-variant = po Kristusu
eras.narrow.0-alt-variant = pr. Kr.
eras.narrow.0 = pr. n. št.
eras.narrow.1 = po n. št.
eras.narrow.1-alt-variant = po Kr.

The following is useless (= nobody will understand):

dayPeriods.format.narrow.am = d
dayPeriods.format.narrow.noon = n
dayPeriods.format.narrow.pm = p

We use numbers 0-23 to denote hour of the day rather than some bogus "d/n/p".

Mojca



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] babel

2016-03-24 Thread Javier Bezos

Apostolos,

> preface = \textPi \textrho\acctonos \textomicron\textlambda 
\textomicron\textgamma

>
> XeLaTeX is Unicode aware and can handle Unicode strings. Therefore, I 
fail to see
> why you are doing things this way. The LGR font encoding is an ancient 
hack that

> has no usage anymore.

Of course, in Unicode engines the default captions section
apply, not the captions.licr subsection.

Javier


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] babel

2016-03-24 Thread Javier Bezos

Zdeněk,

Thank you very much. Very useful, and you confirm my suspect
the data in the CLDR is not always reliable. Furthermore, it's
obvious it's intended mainly for displaying plain text in
some especific contexts and not for fine typesetting. At first
my idea was to sinchronize more or less regularly the ini files
with the CLDR, but now I'm not sure it's a good idea.


I do not understand the meaning of the encoding field.


The goal is to provide information about which encodings
support or have supported the language, even partially
(definitely, one couldn't say OT1 supports any language
except English and a few others). This field is essentially
informative.


I understand hyphenchar (should be the same as in English in all mentioned
languages) but do not understand the other hyphen* fields.


Most of them are intended for luatex (only for the languages
they make sense, of course).

Javier




The minus sign in both Czech and Slovak should be –

The quotes in both Czech and Slovak are „ and “ (the closing quote has its
codepoint in Unicode but is rarely present in fonts, it is better to use
English opening quote which has the same shape).

In Czech (and maybe also in Slovak) the time separator is a period, in
sport results and time tables a colon is used.

Slovak: characters Ä Ď Ô Ť in index look strange to me, it should be proved
by a native Slovak speaker.

Hindi


See the note on the encoding above

A few misprints and missing items in the captions
bib = संदर्भ-ग्रन्थ (or संदर्भ-ग्रंथ)
contents - the version you have is one of the alternatives suggested by
Anshuman Pandey but most books I have bought in India contain अनुक्रम
part = खण्ड (or खंड)
page = पृष्ठ
proof = प्रमाण
glossary = शब्दार्थ सूची

cc, encl, and headto make no sense, I am probably the only man who writes
business e-mails in Hindi...

I have never seen abreviated months (a native Hindi speaker should help).
The only abbreviations for days of week I have seen at the Aligarh railway
station are:
Monday = सो॰, Tuesday = मं॰, Wednesday = बु॰, Thursday = बृह॰, Friday = शुक॰
(or शुक्र॰, the plate was not clearly readable), Saturday = शनि॰, Sunday =
रवि॰. I would not be surprized if the ॰ punctuation were omitted.

[characters] ङ  and ञ are not used in Hindi, they should be removed from index

frenchspacing – I am afraid that it has no sense in Hindi as well as other
Indic languages. The proper spacing was implemented in GNU Freefont (at
least for Hindi) and is activated automatically by language switching. The
rules are explained (in Hindi only, links to other languages switch to a
different text) at
https://hi.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AA%E0%A5%80%E0%A4%A1%E0%A4%BF%E0%A4%AF%E0%A4%BE:%E0%A4%B9%E0%A4%BF%E0%A4%A8%E0%A5%8D%E0%A4%A6%E0%A5%80_%E0%A4%AE%E0%A5%87%E0%A4%82_%E0%A4%B8%E0%A4%BE%E0%A4%AE%E0%A4%BE%E0%A4%A8%E0%A5%8D%E0%A4%AF_%E0%A4%97%E0%A4%B2%E0%A4%A4%E0%A4%BF%E0%A4%AF%E0%A4%BE%E0%A4%81

punctuation: danda । and double danda ॥ should be listed as the most
important punctuation
quotes: either English double quotes or English single quotes are used
(depends on the preference of an author and/or a publisher)

number: Both Devanagari and Arabic digits are used, it is hard to say which
one should be he default

counters: the way how list items are numbered does not conform to the LaTeX
system. I have a normative document how it should be done, it is written in
Marathi and I probably have also a Hindi version. Unfortunately I have not
found time to implement it so far.



Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-03-23 19:31 GMT+01:00 Javier Bezos >:

Hi all,

I'm working on a new version of babel, with a new way to define
languages in a descriptive way, more than in a programmatic one (of
course, the latter won't be excluded because it's still necessary).

The idea is to create a set of ini file like those you can find on


https://latex-project.org/svnroot/latex2e-public/trunk/required/babel/locales/

They are tentative and some of them are incomplete. I'm working on the
code to read and 'transform' their data, but in the meanwhile I'd like
to improve the ini files. The first step in the roadmap is to provide
real utf-8 strings for captions and dates with current styles so
that they can be useable even without fontenc.

Any help or comments would be greatly appreciated.

[Crossposted to xetex and luatex lists.]

Javier


--
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex






--
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex





--
Subscriptions,