Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-12-02 Thread Joseph Wright

On 02/12/2019 17:52, Doug McKenna wrote:

Joseph -

A similar ambiguity occurs later in the README.md file.  It says

- \Umathcode for all letters as TeX class 7 (var)

Does "letters" mean those code points on the TeX side with \catcode 11, or 
those Unicode code points labeled with 'L' in UnicodeData.txt?

If the former, then combining marks (Unicode 'M') should be entered into 
\Umathcode as TeX class 7; if the latter, then presumably not, though it's not 
clear why a math variable name can't have a combining mark.

- Doug McKenna



The former: I've clarified.

Joseph


Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-12-02 Thread Doug McKenna
Joseph -

A similar ambiguity occurs later in the README.md file.  It says

- \Umathcode for all letters as TeX class 7 (var)

Does "letters" mean those code points on the TeX side with \catcode 11, or 
those Unicode code points labeled with 'L' in UnicodeData.txt?

If the former, then combining marks (Unicode 'M') should be entered into 
\Umathcode as TeX class 7; if the latter, then presumably not, though it's not 
clear why a math variable name can't have a combining mark.

- Doug McKenna



Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-12-01 Thread Joseph Wright

On 02/12/2019 05:56, Doug McKenna wrote:

- \lccode and/or \uccode for non-letter code points
   for which an upper or lower case mapping is given

The problem with this is that earlier, it is stated that all combining mark 
code points (class code starting with 'M' in the UnicodeData.txt file) are to 
be considered letters (\catcode set to 11).  So there's an ambiguity here that 
needs clearing up.  Does the above apply to combining mark code points or not?


You've read something in that is not in the README ;)

The file says

  - `\catcode` 11 for all combining marks (Unicode class "M")

where I've very deliberately kept the TeX 'side' as what *actually 
happens* (catcode-11), not said they are 'treated as letters', or similar.


I will clarify that 'letter' here means a codepoint with Unicode 
character class "L", and is not linked to the TeX catcode.



It may be that none of the combining marks in the data file have any case 
mappings, but there's no guarantee that is true.  So the question is, if a 
combining mark has an uppercase or lowercase mapping, does that get installed 
in \lccode and/or \uccode?


Yes, or at least would be the case in principle: all code points with 
upper/lower/title properties are set up.



Also, there's a confusing typo ("can"?) in

- \lccode and \uccode for all of class "Lt" (title
   case letters) to the lower can upper case mappings
   (or if not given to the code point itself)

Should "can' be "and/or"?


It is 'and': you need to set lccode and uccode for these code points.

Joseph



Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-12-01 Thread Doug McKenna
Joseph Wright wrote:

>| Er, I thought the README was reasonably clear, ah well!

Here's an example of something that's not so clear to me.

The README.md file displayed at

  

says

- \lccode and/or \uccode for non-letter code points
  for which an upper or lower case mapping is given

The problem with this is that earlier, it is stated that all combining mark 
code points (class code starting with 'M' in the UnicodeData.txt file) are to 
be considered letters (\catcode set to 11).  So there's an ambiguity here that 
needs clearing up.  Does the above apply to combining mark code points or not?

It may be that none of the combining marks in the data file have any case 
mappings, but there's no guarantee that is true.  So the question is, if a 
combining mark has an uppercase or lowercase mapping, does that get installed 
in \lccode and/or \uccode?

Also, there's a confusing typo ("can"?) in

- \lccode and \uccode for all of class "Lt" (title
  case letters) to the lower can upper case mappings
  (or if not given to the code point itself)

Should "can' be "and/or"?


Doug McKenna


Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-11-28 Thread Ross Moore
Hi Joseph.

On 28 Nov 2019, at 6:29 pm, Joseph Wright 
mailto:joseph.wri...@morningstar2.co.uk>> 
wrote:

On 28/11/2019 00:16, Ross Moore wrote:
If by ignoring you mean removing the character entirely, then that is surely 
not best at all.
Most  N Class (Normal) characters would be simply of the default  \mathord  
class.

That is already the case: it's where IniTeX starts off, chars are mathord. So 
'nothing to do here'. Also note that some of this information is already set 
from the main Unicode file: it tells us which chars are letters.

OK. That’s what I’d expect.

I’d expect others to be mapped instead into a macro that corresponds to 
something that TeX does support.
e.g.
 space characters for  thinspace, 2-em space, etc.  in  U+2000 – U+200A
can expand into things like:   \, \; \> \quad \qquad  etc.  ( even to 
constructions like  \mskip1mu )

That's not a generic IniTeX thing, I'm afraid.

Yeah, well there are so many of these extra space characters.
I really don’t know where they are all used in practice by other (non-TeX) apps.

The Unicode data loaders are explicitly about setting up the basic data in 
Unicode TeX engines that's held in (primitive) tables.

Creating macros is the job of the 'rest' of the format. Here, presumably you 
are thinking of making chars math-active: that's well out-of-scope for the 
loader.

Fair enough; especially if this is all happening before processing any textual 
input intended for the typeset page.


After all, this is essentially what happens when pdfTeX reads raw Unicode input.

pdfTeX reads bytes, there's not really much comparison. In IniTeX mode, there 
is not much happening with UTF-8 and pdfTeX: perhaps you are thinking of with 
LaTeX?

Yes, sure I’m thinking of LaTeX; at least now that UTF-8 input has become the 
default.
Previously there would be (inputenc) package and  .def  file loading.
But, as you say above, this comes later.

One has to wonder then, how much of the Unicode range needs to be (or can be) 
handled earlier;
e.g, when there is only one sensible interpretation for the use of specific 
characters?
Conversely, how much can, or should, be left to later when there may be a 
better idea of which
(classes of) characters are present within the input source?

I suppose that is the kind of question you are dealing with; so I’ll now butt 
out of this conversation,
but still watch it if there’s further continuation.


Joseph



Cheers,

Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.mo...@mq.edu.au
http://www.maths.mq.edu.au
[cid:image001.png@01D030BE.D37A46F0]
CRICOS Provider Number 2J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. 




Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-11-27 Thread Joseph Wright

On 28/11/2019 00:16, Ross Moore wrote:

If by ignoring you mean removing the character entirely, then that is surely 
not best at all.

Most  N Class (Normal) characters would be simply of the default  \mathord  
class.


That is already the case: it's where IniTeX starts off, chars are 
mathord. So 'nothing to do here'. Also note that some of this 
information is already set from the main Unicode file: it tells us which 
chars are letters.



I’d expect others to be mapped instead into a macro that corresponds to 
something that TeX does support.
e.g.
  space characters for  thinspace, 2-em space, etc.  in  U+2000 – U+200A
can expand into things like:   \, \; \> \quad \qquad  etc.  ( even to 
constructions like  \mskip1mu )


That's not a generic IniTeX thing, I'm afraid. The Unicode data loaders 
are explicitly about setting up the basic data in Unicode TeX engines 
that's held in (primitive) tables. Creating macros is the job of the 
'rest' of the format. Here, presumably you are thinking of making chars 
math-active: that's well out-of-scope for the loader.



After all, this is essentially what happens when pdfTeX reads raw Unicode input.


pdfTeX reads bytes, there's not really much comparison. In IniTeX mode, 
there is not much happening with UTF-8 and pdfTeX: perhaps you are 
thinking of with LaTeX?


Joseph


Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-11-27 Thread Joseph Wright

On 28/11/2019 01:26, Doug McKenna wrote:

Ross wrote:


| If by ignoring you mean removing the character entirely, then that is surely 
not best at all.
|
| Most N Class (Normal) characters would be simply of the default \mathord 
class.


The parsing code in load-unicode-math-classes.tex installs values in the 
\Umathcode table that comport with some rule, which without too much of a close 
look seems to me to be whether the character code math class read from 
MathClass.txt is one of the eight possibilities that parsing code pays 
attention to, out of the 15 possible ones in the file. Therefore it appears to 
me that all entries in MathClass.txt that are marked with, for instance, 'N', 
are ignored with respect to installing any entry in the \Umathcode table.

It may be that such characters in MatClass.txt marked with 'N' take on the 
\mathOrd attribute by default when TeX finds them within math mode, I'm not 
sure without looking at its code.

Doug McKenna


The loader is intended for use in IniTeX mode and so relies on the 
defaults. As you say, characters are already \mathord unless actively 
set to something else.


Joseph



Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-11-27 Thread Doug McKenna
Ross wrote: 

>| If by ignoring you mean removing the character entirely, then that is surely 
>not best at all. 
>| 
>| Most N Class (Normal) characters would be simply of the default \mathord 
>class. 

The parsing code in load-unicode-math-classes.tex installs values in the 
\Umathcode table that comport with some rule, which without too much of a close 
look seems to me to be whether the character code math class read from 
MathClass.txt is one of the eight possibilities that parsing code pays 
attention to, out of the 15 possible ones in the file. Therefore it appears to 
me that all entries in MathClass.txt that are marked with, for instance, 'N', 
are ignored with respect to installing any entry in the \Umathcode table. 

It may be that such characters in MatClass.txt marked with 'N' take on the 
\mathOrd attribute by default when TeX finds them within math mode, I'm not 
sure without looking at its code. 

Doug McKenna 



From: "Ross Moore"  
To: "xetex"  
Sent: Wednesday, November 27, 2019 5:16:44 PM 
Subject: Re: [XeTeX] Math class initialization in Unicde-aware engine 

Hi Joe, Doug 




On 28 Nov 2019, at 10:27 am, Joseph Wright < [ 
mailto:joseph.wri...@morningstar2.co.uk | joseph.wri...@morningstar2.co.uk ] > 
wrote: 




BQ_BEGIN

> # N - Normal - includes all digits and symbols requiring only one form 

BQ_END


BQ_BEGIN

> # D - Diacritic 

BQ_END


BQ_BEGIN

> # F - Fence - unpaired delimiter (often used as opening or closing) 

BQ_END



BQ_BEGIN

> # G - Glyph_Part - piece of large operator 

BQ_END


BQ_BEGIN

> # S - Space 
> # U - Unary - operators that are only unary 

BQ_END


BQ_BEGIN

> # X - Special - characters not covered by other classes 

BQ_END




BQ_BEGIN

> Unfortunately, the documentation/comments don't say what happens to entries 
> having these other Unicode math codes (N, D, F, G, S, U, and X). Are they 
> completely ignored, or are they mapped to one of the other eight codes that 
> matches what TeX is interested in or only capable of handing? 
> 
> I can imagine that the space character, given Unicode math class 'S' in 
> MathClass.txt, is ignored during this parse. But what happens to the '¬' 
> character (U+00AC) ("NOT SIGN"), which is assigned 'U' (Unary Operator). 
> Surely the logical not sign is not being ignored during initialization of a 
> Unicode-aware engine, yet the comments in load-unicode-math-classes.tex don't 
> say one way or the other, and it appears to me that the parsing code is 
> ignoring it. 

BQ_END


BQ_BEGIN

The other Unicode math classes don't really map directly to TeX ones, so 
they are currently ignored. Suggestions for improvements here are of 
course welcome. 

BQ_END


If by ignoring you mean removing the character entirely, then that is surely 
not best at all. 

Most N Class (Normal) characters would be simply of the default \mathord class. 

I’d expect others to be mapped instead into a macro that corresponds to 
something that TeX does support. 
e.g. 
space characters for thinspace, 2-em space, etc. in U+2000 – U+200A 
can expand into things like: \, \; \> \quad \qquad etc. ( even to constructions 
like \mskip1mu ) 

After all, this is essentially what happens when pdfTeX reads raw Unicode 
input. 

The G class (Glyph_Part) is a lot harder, as those glyph parts don’t correspond 
to any single 
TeX macro. Think about a very large opening brace spanning 3+ ordinary line 
widths, say, 
as may be generated by \left\{ ... \right\} surrounding some (inner-) displayed 
math alignment. 
On input, the whole grouping would need to be identified and mapped to 
appropriate TeX coding. 

Basically there is a lot here that needs to be looked more or less 
individually. 

I’ve been through this kind of exercise, in reverse, to decide what to specify 
as /Alt and /ActualText 
replacements (for accessibility) for what TeX produces with various math 
constructions. 
I don’t have definitive answers for everything, but have tried some 
possibilities for many things. 


BQ_BEGIN


Joseph 

BQ_END



Hope this helps. 

Ross 


Dr Ross Moore 
Department of Mathematics and Statistics 
12 Wally’s Walk, Level 7, Room 734 
Macquarie University, NSW 2109, Australia 
T: +61 2 9850 8955 | F: +61 2 9850 8114 
M:+61 407 288 255 | E: [ mailto:ross.mo...@mq.edu.au | ross.mo...@mq.edu.au ] 
[ http://www.maths.mq.edu.au/ | http://www.maths.mq.edu.au ] [ 
http://mq.edu.au/ | 

CRICOS
 Provider Number 2J. Think before you print. 
Please
 consider the environment before printing this email.

This
 message is intended for the addressee named and
 may 
contain
 confidential information. If you are not the intended 
recipient,
 please delete it and notify the sender. Views expressed 
in
 this message are those of the individual sender, and are
 not 
necessarily
 the views of Macquarie University.  ] 
[ http://mq.edu.au/ ] 




Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-11-27 Thread Ross Moore
Hi Joe, Doug

On 28 Nov 2019, at 10:27 am, Joseph Wright 
mailto:joseph.wri...@morningstar2.co.uk>> 
wrote:

> # N - Normal - includes all digits and symbols requiring only one form

> # D - Diacritic

> # F - Fence - unpaired delimiter (often used as opening or closing)

> # G - Glyph_Part - piece of large operator

> # S - Space
> # U - Unary - operators that are only unary

> # X - Special - characters not covered by other classes


> Unfortunately, the documentation/comments don't say what happens to entries 
> having these other Unicode math codes (N, D, F, G, S, U, and X). Are they 
> completely ignored, or are they mapped to one of the other eight codes that 
> matches what TeX is interested in or only capable of handing?
>
> I can imagine that the space character, given Unicode math class 'S' in 
> MathClass.txt, is ignored during this parse. But what happens to the '¬' 
> character (U+00AC) ("NOT SIGN"), which is assigned 'U' (Unary Operator). 
> Surely the logical not sign is not being ignored during initialization of a 
> Unicode-aware engine, yet the comments in load-unicode-math-classes.tex don't 
> say one way or the other, and it appears to me that the parsing code is 
> ignoring it.

The other Unicode math classes don't really map directly to TeX ones, so
they are currently ignored. Suggestions for improvements here are of
course welcome.

If by ignoring you mean removing the character entirely, then that is surely 
not best at all.

Most  N Class (Normal) characters would be simply of the default  \mathord  
class.

I’d expect others to be mapped instead into a macro that corresponds to 
something that TeX does support.
e.g.
 space characters for  thinspace, 2-em space, etc.  in  U+2000 – U+200A
can expand into things like:   \, \; \> \quad \qquad  etc.  ( even to 
constructions like  \mskip1mu )

After all, this is essentially what happens when pdfTeX reads raw Unicode input.

The G class (Glyph_Part) is a lot harder, as those glyph parts don’t correspond 
to any single
TeX macro. Think about a very large opening brace spanning 3+ ordinary line 
widths, say,
as may be generated by  \left\{ ... \right\}  surrounding some (inner-) 
displayed math alignment.
On input, the whole grouping would need to be identified and mapped to 
appropriate TeX coding.

Basically there is a lot here that needs to be looked more or less individually.

I’ve been through this kind of exercise, in reverse, to decide what to specify 
as /Alt  and /ActualText
replacements (for accessibility) for what TeX produces with various math 
constructions.
I don’t have definitive answers for everything, but have tried some 
possibilities for many things.


Joseph


Hope this helps.

Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.mo...@mq.edu.au
http://www.maths.mq.edu.au
[cid:image001.png@01D030BE.D37A46F0]
CRICOS Provider Number 2J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. 




Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-11-27 Thread Joseph Wright

On 27/11/2019 23:20, Doug McKenna wrote:

Another question about Unicode-aware TeX engine (e.g., XeTeX) initialization 
files.

The Unicode Consortium provides a file, MathClass.txt, e.g.,

./texmf-dist/tex/generic/unicode-data/MathClass.txt

It contains a list of lines (and comments).  Field 0 of an entry line is a 
Unicode code point or a range of code points, and field 1 is a single ASCII 
character that declares the Unicode math class to which the code point or range 
of code points belongs.

Comments in that file say that there are (currently) 15 different Unicode math 
class codes:

#   N - Normal - includes all digits and symbols requiring only one form
#   A - Alphabetic
#   B - Binary
#   C - Closing - usually paired with opening delimiter
#   D - Diacritic
#   F - Fence - unpaired delimiter (often used as opening or closing)
#   G - Glyph_Part - piece of large operator
#   L - Large - n-ary or large operator, often takes limits
#   O - Opening - usually paired with closing delimiter
#   P - Punctuation
#   R - Relation - includes arrows
#   S - Space
#   U - Unary - operators that are only unary
#   V - Vary - operators that can be unary or binary depending on context
#   X - Special - characters not covered by other classes

During XeTeX format initialization, the file load-unicode-math-classes.tex in 
that same directory is executed, in order to declare to the engine which 
Unicode code points belong to which TeX math classes.  The comments in that 
file say that the classes it pays attention to are those with the following 
Unicode math codes:

% This file parses MathClass.txt, provided by the Unicode Consortium, and sets
% up the following mapping between Unicode classes and TeX math types
% - "L" (large)   \mathop
% - "B" (binary)  \mathbin
% - "V" (vary)\mathbin
% - "R" (relation)\mathrel
% - "O" (opening) \mathopen
% - "C" (closing) \mathclose
% - "P" (punctuation) \mathpunct
% - "A" (alphabetic)  \mathalpha

That means that there are 7 other Unicode math classes that are unaccounted for.

Unfortunately, the documentation/comments don't say what happens to entries 
having these other Unicode math codes (N, D, F, G, S, U, and X).  Are they 
completely ignored, or are they mapped to one of the other eight codes that 
matches what TeX is interested in or only capable of handing?

I can imagine that the space character, given Unicode math class 'S' in MathClass.txt, is 
ignored during this parse.  But what happens to the '¬' character (U+00AC) ("NOT 
SIGN"), which is assigned 'U' (Unary Operator).  Surely the logical not sign is not 
being ignored during initialization of a Unicode-aware engine, yet the comments in 
load-unicode-math-classes.tex don't say one way or the other, and it appears to me that 
the parsing code is ignoring it.

The ReadMe.md file



is also deficient in answering this question.

TIA,


Er, I thought the README was reasonably clear, ah well!

The other Unicode math classes don't really map directly to TeX ones, so 
they are currently ignored. Suggestions for improvements here are of 
course welcome.


Joseph


[XeTeX] Math class initialization in Unicde-aware engine

2019-11-27 Thread Doug McKenna
Another question about Unicode-aware TeX engine (e.g., XeTeX) initialization 
files.

The Unicode Consortium provides a file, MathClass.txt, e.g.,

./texmf-dist/tex/generic/unicode-data/MathClass.txt

It contains a list of lines (and comments).  Field 0 of an entry line is a 
Unicode code point or a range of code points, and field 1 is a single ASCII 
character that declares the Unicode math class to which the code point or range 
of code points belongs.

Comments in that file say that there are (currently) 15 different Unicode math 
class codes:

#   N - Normal - includes all digits and symbols requiring only one form
#   A - Alphabetic
#   B - Binary
#   C - Closing - usually paired with opening delimiter
#   D - Diacritic
#   F - Fence - unpaired delimiter (often used as opening or closing)
#   G - Glyph_Part - piece of large operator
#   L - Large - n-ary or large operator, often takes limits
#   O - Opening - usually paired with closing delimiter
#   P - Punctuation
#   R - Relation - includes arrows
#   S - Space
#   U - Unary - operators that are only unary
#   V - Vary - operators that can be unary or binary depending on context
#   X - Special - characters not covered by other classes

During XeTeX format initialization, the file load-unicode-math-classes.tex in 
that same directory is executed, in order to declare to the engine which 
Unicode code points belong to which TeX math classes.  The comments in that 
file say that the classes it pays attention to are those with the following 
Unicode math codes:

% This file parses MathClass.txt, provided by the Unicode Consortium, and sets
% up the following mapping between Unicode classes and TeX math types
% - "L" (large)   \mathop
% - "B" (binary)  \mathbin
% - "V" (vary)\mathbin
% - "R" (relation)\mathrel
% - "O" (opening) \mathopen
% - "C" (closing) \mathclose
% - "P" (punctuation) \mathpunct
% - "A" (alphabetic)  \mathalpha

That means that there are 7 other Unicode math classes that are unaccounted for.

Unfortunately, the documentation/comments don't say what happens to entries 
having these other Unicode math codes (N, D, F, G, S, U, and X).  Are they 
completely ignored, or are they mapped to one of the other eight codes that 
matches what TeX is interested in or only capable of handing?

I can imagine that the space character, given Unicode math class 'S' in 
MathClass.txt, is ignored during this parse.  But what happens to the '¬' 
character (U+00AC) ("NOT SIGN"), which is assigned 'U' (Unary Operator).  
Surely the logical not sign is not being ignored during initialization of a 
Unicode-aware engine, yet the comments in load-unicode-math-classes.tex don't 
say one way or the other, and it appears to me that the parsing code is 
ignoring it.

The ReadMe.md file



is also deficient in answering this question.

TIA,


Doug McKenna