Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX installation?

2021-08-23 Thread Doug McKenna
George - 

Thanks! 

- Doug McK. 


From: "George N. White III"  
To: "xetex"  
Sent: Sunday, August 22, 2021 8:14:58 AM 
Subject: Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX 
installation? 

On Fri, 20 Aug 2021 at 22:46, Doug McKenna < [ mailto:d...@mathemaesthetics.com 
| d...@mathemaesthetics.com ] > wrote: 



Using XeTeX, I want to typeset a LaTeX document into a PDF file. The LaTeX 
source code in UTF-8 



BQ_BEGIN

expressly includes the Unicode Replacement character (� = U+FFFD) (a black 
diamond with a 

BQ_END

BQ_BEGIN

question mark in it). 

I want to typeset it in this single document using a monospaced font in one 
place, and in another 

BQ_END

BQ_BEGIN

place in a variable-width font. 

I understand that XeTeX can take advantage of one's system's installed fonts, 
but my LaTeX file 

BQ_END

BQ_BEGIN

is being generated by another program that doesn't know what those fonts are or 
what glyphs 

BQ_END

BQ_BEGIN

they support. I simply want to guarantee that the fonts used are always 
available when processing 

BQ_END

BQ_BEGIN

that LaTeX file. 

BQ_END

BQ_BEGIN


I also understand that it's possible to synthesize the glyph graphically 
without using a font, but I'd 

BQ_END

BQ_BEGIN

rather not go that route. 

So ... What fixed-width and variable-width OpenType (or other) fonts, if any, 
are always distributed 

BQ_END

BQ_BEGIN

with TeX or TeXLive or whatever that one can rely upon to be available for 
placing this particular 

BQ_END

BQ_BEGIN

glyph in a final PDF file? What would be the correct incantation to doing so? 

BQ_END

0xFFFD should be in any mainstream general use OpenType font. It might be 
better to ask which 
OT fonts to avoid due to low quality, bugs, lack of ongoing support, etc. There 
has been a lot of 
churn in the available fonts over the years, so the answer may be different if 
you need fonts that 
can be expected to have long-term support and availability. 

-- 
George N. White III 



Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX installation?

2021-08-21 Thread Doug McKenna
Ulrike -

Excellent.  Thank you!

Using \setmainfont{DejaVuSerif.ttf} works on my non-Linux machine, and it is 
not listed as "installed" in my Mac's FontBook, which means it's being used 
solely within the TeXosystem.

DejaVuSerif is a variable-width font.  Is there a similar fixed-width 
OpenType/TrueType font distributed with TeXLive that would work?


Doug McKenna
Mathemaesthetics, Inc.



- Original Message -
From: "news3" 
To: "xetex" 
Sent: Saturday, August 21, 2021 10:39:25 AM
Subject: Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX  
installation?

Am Sat, 21 Aug 2021 09:25:14 -0600 (MDT) schrieb Doug McKenna:

> Thanks all for your interesting responses. 
> 
> Unfortunately, my possibly poorly worded question remains unanswered. Let me 
> try again. 
> 
> Consider the short example just used: 
> 
> \documentclass{article} 
> \usepackage{fontspec} 
> \setmainfont{DejaVu Serif} 
> 
> \begin{document} 
> fffd 
> \end{document} 
> 
> When I run it, fontspec complains that it can't find the font. So obviously 
> "DejaVu Serif" is not installed, either on my system or anywhere in the 
> bowels of all the ~150,000 TeXLive (2019) files that have been installed in 
> the TDS on my machine. 

No, it only says that it is not found by fontname. Something that
happens often on linux. Try with \setmainfont{DejaVuSerif.ttf}


> So, is there a font name I can use in the \setmainfont{} command
> that is ALWAYS available (upon TeX installation) when processing
> this LaTeX file with XeTeX? Or always available after a certain
> version of a TeX installation? 

I have no idea when DejaVu was added but it is in texlive 2019.
If you want to support also older systems try e.g. on overleaf.



-- 
Ulrike Fischer 
http://www.troubleshooting-tex.de/


Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX installation?

2021-08-21 Thread Doug McKenna
Thanks all for your interesting responses. 

Unfortunately, my possibly poorly worded question remains unanswered. Let me 
try again. 

Consider the short example just used: 

\documentclass{article} 
\usepackage{fontspec} 
\setmainfont{DejaVu Serif} 

\begin{document} 
fffd 
\end{document} 

When I run it, fontspec complains that it can't find the font. So obviously 
"DejaVu Serif" is not installed, either on my system or anywhere in the bowels 
of all the ~150,000 TeXLive (2019) files that have been installed in the TDS on 
my machine. 

So, is there a font name I can use in the \setmainfont{} command that is ALWAYS 
available (upon TeX installation) when processing this LaTeX file with XeTeX? 
Or always available after a certain version of a TeX installation? 

I want to automatedly create such a LaTeX file that permits its user to declare 
or override a default font for typesetting the Unicode Replacement character, 
but which doesn't require the user to search for or declare such fonts at all 
in the simplest case. One would think that every OpenType font supporting 
Unicode glyphs would include a glyph for U+FFFD, but it doesn't appear to me 
that that is the case. 

All I want is for the computer-generated LaTeX file "to just work" out of the 
box (so to speak), so that a naive user isn't faced with an error message the 
first time they typeset it with something like TeXShop. 

Doug McKenna 
Mathemaesthetics, Inc. 




[XeTeX] Guaranteed Unicode replacement glyph in every TeX installation?

2021-08-20 Thread Doug McKenna
Using XeTeX, I want to typeset a LaTeX document into a PDF file. The LaTeX 
source code in UTF-8 expressly includes the Unicode Replacement character (� = 
U+FFFD) (a black diamond with a question mark in it). 

I want to typeset it in this single document using a monospaced font in one 
place, and in another place in a variable-width font. 

I understand that XeTeX can take advantage of one's system's installed fonts, 
but my LaTeX file is being generated by another program that doesn't know what 
those fonts are or what glyphs they support. I simply want to guarantee that 
the fonts used are always available when processing that LaTeX file. 

I also understand that it's possible to synthesize the glyph graphically 
without using a font, but I'd rather not go that route. 

So ... What fixed-width and variable-width OpenType (or other) fonts, if any, 
are always distributed with TeX or TeXLive or whatever that one can rely upon 
to be available for placing this particular glyph in a final PDF file? What 
would be the correct incantation to doing so? 

Thanks. 

Doug McKenna 
Mathemaesthetics, Inc. 



Re: [XeTeX] A LaTeX Unicode initialization desire/question/suggestion

2020-01-12 Thread Doug McKenna
loc() or whatever the equivalent might be on some 
system.  This makes it harder to create a \dump format file, though not 
impossible.  But it wouldn't be (or need to be) compatible with anything in the 
official TeX world.  Regardless, my goal is to see how far one can get without 
needing format files.  Also, see below.

>| The pressure to load more into a
>| format is likely to increase rather than decrease, people often
>| routinely make custom formats preloading large packages like tikz or
>| pstricks for example.

True, but there is a fundamental difference between what I'm working toward, 
and what the TeX infrastructure does.  In the TeX world, every job is a single 
process.  Every time a TeX job is done, a process is launched, the job gets 
done, and the program ends.  It's the Unix/command-line way.  So the format has 
to be loaded (fast) on every job.  Makes perfect sense.

But when your engine is just a library linked into another program the lives 
for a long time, perhaps measured in days, and when the user is running 
multiple jobs from the same program, then there ought to be a way to load the 
format from its source code >once<, and have it live in the engine's memory 
even while job after job is executing on top, with a clean-up after each job 
ends.  This is, after all, completely conformant with everyday use of TeX 
(edit...run job...edit...run job...), not to mention every other computer 
language.  I'm pretty sure that I've architected my code to allow this, 
although it's untested for now.  One step at a time.

>| As noted above, with latex-dev releases you are still going to need
>| the unicode data files to be read using tex macros.

Are these files read more than once, and if so, why?  If not, I don't 
understand why I'm still going to need to read them.

>| Before making any
>| changes to the tex macros you may want to do timings with the these
>| versions. It may be that you choose to reconsider not making (the
>| equivalent of) format files, as just saving the time for setting the
>| lccodes may be a less significant proportion of the startup time.

Agreed.

>| To be in the core tex macros we would need to have the engine
>| incorporated into texlive so that it could be tested as part of our
>| test suite and continuous integration tests.

That doesn't make sense to me.  Adding a couple of lines of code to 
"load-unicode.data.tex" and then determining with regression tests that 
absolutely nothing has changed doesn't involve any third party at all. 

>| However as already discussed in this thread there are several
>| possibilities for you to build something along those lines without
>| requiring any changes to the core macro files, so lack of change here
>| shouldn't be seen as a discouragement and anyway gives you more
>| flexibility with changing names etc while jsbox is being developed.

Duly noted.

>| Returning to your original question as to what constitutes a "Unicode"
>| TeX for LaTeX, we have put some data on the requirements  for extended
>| TeX features in the draft ltnews31 which will be part of next week's
>| latex-dev release, but you can see the sources now at
>| 
>| Primitive Requirements:
>| https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L596
>| 
>| see also
>| 
>| Improved load-times for expl3:
>| https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L169
>| 
>| on the additional items preloaded in the format.

Many thanks!  This is very helpful.


Doug McKenna
Mathemaesthetics, Inc.


Re: [XeTeX] [EXT] A LaTeX Unicode initialization desire/question/suggestion

2020-01-12 Thread Doug McKenna
Phil Taylor wrote: 

>| So because JSBox is required/designed to incorporate all of XeTeX's 
>| features, it must (by definition) implement/provide \Umathcode. 

Just to be clear, JSBox can eventually incorporate all of XeTeX's features 
(primitives), but does not do so now. It doesn't even incorporate pdfTeX's 
features, but it is set up to. I'm merely adding XeTeX features as necessary to 
get the LaTeX macro library installed and then typeset a LaTeX document 
containing no Unicode at all. The problem is that somewhere in the LaTeX format 
initialization the ability to recognize a Unicode character (as opposed to a 
UTF-8 byte sequence) is equated with the assumption that it's being run under 
XeTeX, and that therefore at least some of XeTeX's features are there and can 
be relied upon at format initialization time. 

>| But could not JSbox perform (or simulate) the following : 

>| \let \Umathschar = \Umathchar % use British spelling as synonym 
>| \let \Umathchar = \undefined % inhibit "load-unicode-data.tex"'s special 
>treatment of engines that implement \Umathchar 
>| \input load-unicode-data % since it would seem that you cannot simply skip 
>this step 
>| \let \Umathchar = \Umathschar % restore canonical meaning of \Umathchar 

It could, but it's not my code that's issuing "\input load-unicode-data". The 
reading of "load-unicode-data.tex" is embedded within my version of LaTeX's own 
initialization code, and there's no guarantee that elsewhere in that code there 
isn't some dependence on \Umathchar that such a re-definition might interfere 
with. LaTeX's code has several tests that rely on whether |\Umathchar| is 
defined or not, and even in the latest versions, it is declared that \Umathchar 
existence is the official way to test. Indeed, the latest official comments, as 
David Carlisle brought to my attention in this thread, declare that \Umathchar 
existence testing is the current way to go in all sorts of places. 

Such negative "let's fool some other code to get something done" hacks are 
fragile because they render the other, affected TeX code impossible to 
understand when reading it. Far better and safer is an affirmative addition to 
the various checks already being made that facially means what it says: if 
Unicode character mapping data has been loaded, don't bother. 

Here is perhaps a slightly better hack: 

If it's acceptable as the very first executable line in latex.ltx (or other 
format source files) to test the catcode value of `{ to determine whether a 
format has already been loaded or not, then it should be acceptable within 
"load-unicode-data.tex" (or the like) to include a similar test to determine 
whether to proceed with the TeX parse of the Unicode data, or to bail because 
it's presumable that the tables are already initialized. For example, the first 
non-8-bit Unicode character is: 

0100;LATIN CAPITAL LETTER A WITH MACRON;Lu;0;L;0041 0304N;LATIN CAPITAL 
LETTER A MACRON;;;0101; 

It is safe, I think, to assume that this Unicode character will forever be 
classified as an uppercase letter (with a lowercase mapping value of U+0101). 

When the XeTeX engine begins running, before any TeX source code is 
interpreted, the engine initializes its internal |cat_code| array (all 
1,114,112 slots) with the value |other_char| (12). It then does the usual 
classic TeX initialization to declare ASCII letters as such, etc. Later, during 
the LaTeX format's reading of "load-unicode-data.tex", a simple test to 
determine whether to continue reading the file could be made based on whether 
the catcode value of U+0100 is 11 (letter) or 12 (other). If it's already known 
as a letter, then the catcode table is not in its initial default state, and a 
second initialization is unnecessary. If it's still an |other_char| (12), then 
things need initializing for letter characters and the rest of 
"load-unicode-data.tex" should be executed. 

>>| Furthermore, the purpose of executing "load-unicode-data.tex" is precisely 
>>to 
>>| populate the \Umathchar table, as well as other Unicode character tables. 
>>| So these tables have to exist prior to executing the file. 

>| Well, do they, in the case of JSBox? From what you wrote in your original 
>| query, I thought that that [1] was the very thing that you were trying to 
>avoid ... 
>| [1] "executing "load-unicode-data.tex" [in order] to populate the \Umathchar 
>table". 
>| So specifically, does the \Umathchar table have to exist, in JSBox, at the 
>point 
>| that "load-unicode-data.tex" is loaded ? 

I'm trying to avoid initializing these character mapping tables twice, 
especially when the second time (reading this file) rather inefficiently takes 
30 times longer than the first, and accomplishes nothing new. 

Thanks for thinking about my questions, I appreciate it. 

Doug McKenna 



Re: [XeTeX] [EXT] A LaTeX Unicode initialization desire/question/suggestion

2020-01-11 Thread Doug McKenna
Phil Taylor wrote:

>| How about delaying the definition of \Umathcode until after 
>| "load-unicode-data.tex" has been processed ?  Is that possible, and 
>| would it have undesirable side-effects ?

\Umathcode is a XeTeX primitive extension to the TeX language in the service of 
solving a problem in classic TeX, which was that the machinery and syntax of 
the classic TeX \mathchar primitive could not handle 21-bit Unicode values, or 
more than 16 math font families, etc.  So \Umathchar (and a bunch of other 
related extensions all starting with 'U') is defined in XeTeX's WEB source 
code; it exists the moment the engine is launched.  Thus it's not possible to 
delay \Umathchar's definition.

Furthermore, the purpose of executing "load-unicode-data.tex" is precisely to 
populate the \Umathchar table, as well as other Unicode character tables.  So 
these tables have to exist prior to executing the file.  Perhaps I'm 
misunderstanding your question.

In any case, my point is that a TeX engine interested in initializing itself as 
fast as possible (using a different form of the exact same official Unicode 
character data) should be able to avoid processing "load-unicode-data.tex" 
altogether, because doing so ends up being a completely redundant waste of time 
(and, depending upon implementation, space).  XeTeX does not have to care about 
this, but other Unicode engines, certainly the one I'm working on, will care.

A couple of lines of TeX code added to the file appears to me to solve the 
problem, with no downside to creating the XeTeX LaTeX format.


Doug McKenna
Mathemaesthetics, Inc.


[XeTeX] A LaTeX Unicode initialization desire/question/suggestion

2020-01-10 Thread Doug McKenna
t \Umathcode but has no need nor desire 
to execute this file because JSBox's mapping tables have *already* been 
initialized before any TeX code is ever pushed onto its execution stack, the 
same as classic TeX does for simple one-byte characters.

A solution is a dedicated, read-only "last_item" integer value, called, e.g., 
\Unicodedataloaded, whose existence or value prevents "load-unicode-data.tex" 
(or similar) from being executed (further).  The primitive doesn't even have to 
have a value, the fact that it exists can be sufficient to test against.  So 
adding the following lines after the eTeX test at the start of 
"load-unicode-data.tex" would solve the problem, not just for JSBox, but for 
any other future Unicode TeX engine faced with a similar situation.

% Give any Unicode engine the ability to initialize its mapping
% tables in its own way instead of relying on this file, as long
% as it implements a primitive named \Unicodedataloaded.
\ifdefined\Unicodedataloaded
  \expandafter\endinput
\fi

For current XeTeX LaTeX format initialization, there should be no change to how 
things are built.

I implemented this primitive today in JSBox (as a read-only value of 1), and 
made the above change in my local copy of "load-unicode-data.tex".  Executing 
"latex.ini" now takes about .5 second, which is a considerable improvement over 
1.25 seconds, certainly now within the bounds of what might be an acceptable 
user experience typesetting a Unicode LaTeX document after reading the format's 
source code.

Are there any downsides to this minor change that I'm missing?  Is there a 
better name for the primitive?  What can I do to encourage that the above test 
be officially added to "load-unicode-data.tex"?


Doug McKenna
Mathemaesthetics, Inc.


[XeTeX] Clarification on XeTeX documentation

2019-12-12 Thread Doug McKenna
Two questions:


Question #1:

In the latest document describing XeTeX extensions, dated 2019-12-09, for 
instance, at

<https://ctan.math.illinois.edu/info/xetexref/xetex-reference.pdf>,

in section 2.3 "Maths fonts" (currently on page 14), the following sentence 
needs clarification:

>| In the following commands, ⟨fam.⟩ is a number (0–255) representing
>| font to use in maths. ⟨math type⟩ is the 0–7 number corresponding to
>| the type of math symbol ...

But  is not a font number (or index).  As  denotes, it is a font 
family number (or index), where each font family represents a triplet of loaded 
fonts, one each for text, script, and scriptscript situations.

And throughout other TeX documentation, the word "class" is used to describe 
the purpose of a math character, a 3-bit number between 0 and 7.

I suggest this be amended to read:

In the following commands, ⟨fam.⟩ is a number (0–255) of the math font family. 
⟨math type⟩ is the 0–7 number corresponding to the class of math symbol ...


Question #2:

Later on, in various syntax declarations, e.g.,

>| \Umathcode⟨char slot⟩ [=] ⟨math type⟩ ⟨fam.⟩ ⟨glyph slot⟩

one finds the term .  This is curious, because XeTeX's source code 
parses this integer as an integer, using a procedure named scan_usv_num ("usv" 
stands for Unicode scalar value).  That routine complains about any value 
outside the Unicode range of 0 to "10 as illegal.

But glyph slot is a term usually used to describe the innards of a font, and is 
not the same as a Unicode character/code point/scalar value, which the font 
would internally map to a glyph slot (or index).  Also, every OpenType font is 
limited to no more than 2^{16} (65536) glyph slots, so it's concerning that 
this routine accepts a number that is outside of that range.

If this is the case, another problem is that it is then formally possible that 
a font contains a glyph whose internal slot number, for example, might be "D800 
(a legal 16-bit value that scan_usv_num won't complain about).  But "D800 is 
not a legal Unicode character value, it's a high-surrogate value for forming a 
full 21-bit Unicode character value with another low surrogate value.  "D800 
might be a Unicode scalar value, but it is not a character value.

So my question is:

What is a proper legal value for a ?  Alternatively, should  be changed in this documentation to something less ambiguous, such as 
 or  or ?


Doug McKenna
Mathemaesthetics, Inc.



[XeTeX] How much time to build LaTeX format for XeTeX

2019-12-05 Thread Doug McKenna
Given all the parsing of the Unicode character data files during INITEX, and 
all the inputting and creation of the hyphenation trees, how much CPU time 
elapses while building the XeTeX format file for LateX?  I'm going to assume 
that the writing out of the format at the final \dump command is negligible, 
though I don't really know.

- Doug McKenna


Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-12-02 Thread Doug McKenna
Joseph -

A similar ambiguity occurs later in the README.md file.  It says

- \Umathcode for all letters as TeX class 7 (var)

Does "letters" mean those code points on the TeX side with \catcode 11, or 
those Unicode code points labeled with 'L' in UnicodeData.txt?

If the former, then combining marks (Unicode 'M') should be entered into 
\Umathcode as TeX class 7; if the latter, then presumably not, though it's not 
clear why a math variable name can't have a combining mark.

- Doug McKenna



Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-12-01 Thread Doug McKenna
Joseph Wright wrote:

>| Er, I thought the README was reasonably clear, ah well!

Here's an example of something that's not so clear to me.

The README.md file displayed at

  <https://ctan.org/tex-archive/macros/generic/unicode-data>

says

- \lccode and/or \uccode for non-letter code points
  for which an upper or lower case mapping is given

The problem with this is that earlier, it is stated that all combining mark 
code points (class code starting with 'M' in the UnicodeData.txt file) are to 
be considered letters (\catcode set to 11).  So there's an ambiguity here that 
needs clearing up.  Does the above apply to combining mark code points or not?

It may be that none of the combining marks in the data file have any case 
mappings, but there's no guarantee that is true.  So the question is, if a 
combining mark has an uppercase or lowercase mapping, does that get installed 
in \lccode and/or \uccode?

Also, there's a confusing typo ("can"?) in

- \lccode and \uccode for all of class "Lt" (title
  case letters) to the lower can upper case mappings
  (or if not given to the code point itself)

Should "can' be "and/or"?


Doug McKenna


Re: [XeTeX] Math class initialization in Unicde-aware engine

2019-11-27 Thread Doug McKenna
Ross wrote: 

>| If by ignoring you mean removing the character entirely, then that is surely 
>not best at all. 
>| 
>| Most N Class (Normal) characters would be simply of the default \mathord 
>class. 

The parsing code in load-unicode-math-classes.tex installs values in the 
\Umathcode table that comport with some rule, which without too much of a close 
look seems to me to be whether the character code math class read from 
MathClass.txt is one of the eight possibilities that parsing code pays 
attention to, out of the 15 possible ones in the file. Therefore it appears to 
me that all entries in MathClass.txt that are marked with, for instance, 'N', 
are ignored with respect to installing any entry in the \Umathcode table. 

It may be that such characters in MatClass.txt marked with 'N' take on the 
\mathOrd attribute by default when TeX finds them within math mode, I'm not 
sure without looking at its code. 

Doug McKenna 



From: "Ross Moore"  
To: "xetex"  
Sent: Wednesday, November 27, 2019 5:16:44 PM 
Subject: Re: [XeTeX] Math class initialization in Unicde-aware engine 

Hi Joe, Doug 




On 28 Nov 2019, at 10:27 am, Joseph Wright < [ 
mailto:joseph.wri...@morningstar2.co.uk | joseph.wri...@morningstar2.co.uk ] > 
wrote: 




BQ_BEGIN

> # N - Normal - includes all digits and symbols requiring only one form 

BQ_END


BQ_BEGIN

> # D - Diacritic 

BQ_END


BQ_BEGIN

> # F - Fence - unpaired delimiter (often used as opening or closing) 

BQ_END



BQ_BEGIN

> # G - Glyph_Part - piece of large operator 

BQ_END


BQ_BEGIN

> # S - Space 
> # U - Unary - operators that are only unary 

BQ_END


BQ_BEGIN

> # X - Special - characters not covered by other classes 

BQ_END




BQ_BEGIN

> Unfortunately, the documentation/comments don't say what happens to entries 
> having these other Unicode math codes (N, D, F, G, S, U, and X). Are they 
> completely ignored, or are they mapped to one of the other eight codes that 
> matches what TeX is interested in or only capable of handing? 
> 
> I can imagine that the space character, given Unicode math class 'S' in 
> MathClass.txt, is ignored during this parse. But what happens to the '¬' 
> character (U+00AC) ("NOT SIGN"), which is assigned 'U' (Unary Operator). 
> Surely the logical not sign is not being ignored during initialization of a 
> Unicode-aware engine, yet the comments in load-unicode-math-classes.tex don't 
> say one way or the other, and it appears to me that the parsing code is 
> ignoring it. 

BQ_END


BQ_BEGIN

The other Unicode math classes don't really map directly to TeX ones, so 
they are currently ignored. Suggestions for improvements here are of 
course welcome. 

BQ_END


If by ignoring you mean removing the character entirely, then that is surely 
not best at all. 

Most N Class (Normal) characters would be simply of the default \mathord class. 

I’d expect others to be mapped instead into a macro that corresponds to 
something that TeX does support. 
e.g. 
space characters for thinspace, 2-em space, etc. in U+2000 – U+200A 
can expand into things like: \, \; \> \quad \qquad etc. ( even to constructions 
like \mskip1mu ) 

After all, this is essentially what happens when pdfTeX reads raw Unicode 
input. 

The G class (Glyph_Part) is a lot harder, as those glyph parts don’t correspond 
to any single 
TeX macro. Think about a very large opening brace spanning 3+ ordinary line 
widths, say, 
as may be generated by \left\{ ... \right\} surrounding some (inner-) displayed 
math alignment. 
On input, the whole grouping would need to be identified and mapped to 
appropriate TeX coding. 

Basically there is a lot here that needs to be looked more or less 
individually. 

I’ve been through this kind of exercise, in reverse, to decide what to specify 
as /Alt and /ActualText 
replacements (for accessibility) for what TeX produces with various math 
constructions. 
I don’t have definitive answers for everything, but have tried some 
possibilities for many things. 


BQ_BEGIN


Joseph 

BQ_END



Hope this helps. 

Ross 


Dr Ross Moore 
Department of Mathematics and Statistics 
12 Wally’s Walk, Level 7, Room 734 
Macquarie University, NSW 2109, Australia 
T: +61 2 9850 8955 | F: +61 2 9850 8114 
M:+61 407 288 255 | E: [ mailto:ross.mo...@mq.edu.au | ross.mo...@mq.edu.au ] 
[ http://www.maths.mq.edu.au/ | http://www.maths.mq.edu.au ] [ 
http://mq.edu.au/ | 

CRICOS
 Provider Number 2J. Think before you print. 
Please
 consider the environment before printing this email.

This
 message is intended for the addressee named and
 may 
contain
 confidential information. If you are not the intended 
recipient,
 please delete it and notify the sender. Views expressed 
in
 this message are those of the individual sender, and are
 not 
necessarily
 the views of Macquarie University.  ] 
[ http://mq.edu.au/ ] 




[XeTeX] Math class initialization in Unicde-aware engine

2019-11-27 Thread Doug McKenna
Another question about Unicode-aware TeX engine (e.g., XeTeX) initialization 
files.

The Unicode Consortium provides a file, MathClass.txt, e.g.,

./texmf-dist/tex/generic/unicode-data/MathClass.txt

It contains a list of lines (and comments).  Field 0 of an entry line is a 
Unicode code point or a range of code points, and field 1 is a single ASCII 
character that declares the Unicode math class to which the code point or range 
of code points belongs.

Comments in that file say that there are (currently) 15 different Unicode math 
class codes:

#   N - Normal - includes all digits and symbols requiring only one form
#   A - Alphabetic
#   B - Binary
#   C - Closing - usually paired with opening delimiter
#   D - Diacritic
#   F - Fence - unpaired delimiter (often used as opening or closing)
#   G - Glyph_Part - piece of large operator
#   L - Large - n-ary or large operator, often takes limits
#   O - Opening - usually paired with closing delimiter
#   P - Punctuation
#   R - Relation - includes arrows
#   S - Space
#   U - Unary - operators that are only unary
#   V - Vary - operators that can be unary or binary depending on context
#   X - Special - characters not covered by other classes

During XeTeX format initialization, the file load-unicode-math-classes.tex in 
that same directory is executed, in order to declare to the engine which 
Unicode code points belong to which TeX math classes.  The comments in that 
file say that the classes it pays attention to are those with the following 
Unicode math codes:

% This file parses MathClass.txt, provided by the Unicode Consortium, and sets
% up the following mapping between Unicode classes and TeX math types
% - "L" (large)   \mathop
% - "B" (binary)  \mathbin
% - "V" (vary)\mathbin
% - "R" (relation)\mathrel
% - "O" (opening) \mathopen
% - "C" (closing) \mathclose
% - "P" (punctuation) \mathpunct
% - "A" (alphabetic)  \mathalpha

That means that there are 7 other Unicode math classes that are unaccounted for.

Unfortunately, the documentation/comments don't say what happens to entries 
having these other Unicode math codes (N, D, F, G, S, U, and X).  Are they 
completely ignored, or are they mapped to one of the other eight codes that 
matches what TeX is interested in or only capable of handing?

I can imagine that the space character, given Unicode math class 'S' in 
MathClass.txt, is ignored during this parse.  But what happens to the '¬' 
character (U+00AC) ("NOT SIGN"), which is assigned 'U' (Unary Operator).  
Surely the logical not sign is not being ignored during initialization of a 
Unicode-aware engine, yet the comments in load-unicode-math-classes.tex don't 
say one way or the other, and it appears to me that the parsing code is 
ignoring it.

The ReadMe.md file

<https://ctan.org/tex-archive/macros/generic/unicode-data>

is also deficient in answering this question.

TIA,


Doug McKenna





Re: [XeTeX] Lowercase Unicode code points in hyphenation patterns

2019-11-24 Thread Doug McKenna
Is xgreek.sty loaded as part of creating the LaTeX format?  If not, my 
understanding is that its corrections wouldn't affect any of the hyphenation 
patterns installed from xetex.ini during the format build.

Perhaps this doesn't matter.

- Doug McKenna


- Original Message -
From: "David Carlisle" 
To: "Apostolos Syropoulos" , "xetex" 
Sent: Sunday, November 24, 2019 11:48:32 AM
Subject: Re: [XeTeX] Lowercase Unicode code points in hyphenation patterns

On Sun, 24 Nov 2019 at 18:41, Apostolos Syropoulos via XeTeX
 wrote:
> Of course these tables are all wrong but this is another problem.

Yes there is that.

However it seems better to start from a known standardised base shared
with basically everyone then fix as needed rather than try to come up
with a tex-specific set of mappings covering the whole Unicode code
range and having to document and maintain them and extend each year as
more characters are added.

> I have added the correct \uccodes and \lccodes in xgreek.sty

thanks:-)

David


[XeTeX] Lowercase Unicode code points in hyphenation patterns

2019-11-23 Thread Doug McKenna
When the LaTeX format is built, there are tests for whether or not a 
Unicode-aware TeX engine is doing the work.  I presume that XeTeX is such a 
Unicode-aware engine, though I'm not familiar with what the definition of 
"Unicode-aware TeX engine" actually is (separate issue).

During the input of various hyphenation pattern files (a group for each 
language code), the first such file that uses non-ASCII Unicode code points is 
for Ancient Greek, in the file

/usr/local/texlive/2017/texmf-dist/tex/generic/hyph-utf8/patterns/texhyph-grc.tex

at line 61, which starts out

α1 ε1 η1 ι1 ο1 υ1 ω1 ϊ1 ...

TeX's code and specification says that only lowercase letters can appear in 
pattern words, and the definition within TeX's source code of a lowercase 
letter is any entry in the \lccode table that, when indexed by a character, 
delivers itself.

But as near as I can tell, during the building of the LaTeX format (i.e., 
running "latex.ltx") there is no TeX source code that installs any of these 
Greek letters into the \lccode table.  Therefore, I'm concluding that the XeTeX 
engine does this itself when it initializes, rather than awaiting any TeX 
source code to do it.

But there are a whole lot of lowercase letters in Unicode, so I'm wondering how 
XeTeX determines legal lowercase letters for initial pattern files?

I've tried looking at some version of the xetex.web code, but without 
illumination, I'm afraid.

TIA,

Doug McKenna





Re: [XeTeX] Hyphenation of strings of more than 63 characters

2016-03-15 Thread Doug McKenna
There could be some subtle problems that simply changing the character count 
constant causes.

In particular, the allocation size of a "whatsit" language node might also need 
changing, which would require adjusting other code in the core engine that 
assumes a default small size for that language node sub-type of a "whatsit".

Or not.  I can't tell from the TeX source what the bit sizes of these node 
fields are.  But if they're too small to fit a pair of enhanced character count 
limits for hyphenation, there will likely be bugs elsewhere due to truncation 
or wraparound in the arithmetic.

FWIW,

Doug McKenna



- Original Message -
From: "Peter Mukunda Pasedach" <peter.pased...@googlemail.com>
To: "XeTeX (Unicode-based TeX) discussion." <xetex@tug.org>
Sent: Tuesday, March 15, 2016 9:13:08 AM
Subject: Re: [XeTeX] Hyphenation of strings of more than 63 characters

Dear Jonathan,

yes, recompiling xetex is fine!

At 255 characters I still have 32 occurences left, at 500 two, and at
1000 zero. Thanks for looking into this!

Peter

On Tue, Mar 15, 2016 at 3:46 PM, Jonathan Kew <jfkth...@gmail.com> wrote:
> On 15/3/16 14:24, Peter Mukunda Pasedach wrote:
>>
>> Dear XeTeX list,
>>
>> I am dealing with a collection of texts in Sanskrit, for which the
>> builtin limitation of TeX to not perform hyphenation after the 63rd
>> character of a string is imposing a serious limitation, as such
>> strings do occur. One reason for this is that one can freely form very
>> long compounds, another one is sandhi, in which due to euphonic
>> changes ending and beginning vowels fuse, another one that in Indic
>> scripts if one word ends in a consonant and the next one starts with a
>> vowel they are written together, another reason can be that scribes
>> simply do not use spaces consistently. Thus in the collection of texts
>> that I'm working on, currently comprising of 37 files, strings of more
>> than 63 characters occur 1823 times.
>>
>> Is this limitation of 63 characters just an odd remnant of the time
>> TeX was written in, then necessary because of hardware limitations, or
>> does it still make sense? Is there a reasonable way to remove it, or
>> set it significantly higher?
>
>
> I suspect (without actually checking the code) that it would be fairly
> trivial to make it significantly higher (less so to remove it entirely; but
> something like 255 or even 1000-plus would probably be simple).
>
> A change like this would need to be optional, however, so that the
> typesetting of existing documents does not change unless the user
> deliberately chooses the modified behavior.
>
> It's probably too late to be adding a new feature for the TL'16 release; are
> you prepared to recompile xetex yourself from source in order to make such a
> change?
>
> JK
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Latin Modern, from TFM to Unicode

2013-06-12 Thread Doug McKenna
 the font what to draw.  Since there's no mapping from 
Unicode, then the outside process either needs to know the absolute glyph 
IDs inside the font, or it needs to cause the font to go into some 
internal construction mode, like building a ligature, where the font 
itself knows the sequence and position of the glyphs to use to construct 
the tall symbol.  The latter seems impossible, because the font can't 
know the threshold height at which to stop construction.  The former 
means hard coding internal glyph IDs somewhere outside the font, which 
I'm hoping is not fragile, but worrying might be.

Sorry for the reams of details, but I'm trying to be explain my confusion 
exactly.


Doug McKenna



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex