suggestion

Doug McKenna Fri, 10 Jan 2020 16:32:58 -0800

What is the possibility of making a slight enhancement to how the Unicode LaTeX 
format is created with respect to "Unicode-aware" engines (an unfortunately 
somewhat ill-defined term)?  Here's the situation:


My TeX/e-TeX language interpreter, currently called JSBox, is implemented as a 
simple C library, so it can be incorporated into any other software (for 
instance, I've recently created a Java class wrapper around it, hoping to soon 
use it in an Android eBook/app).

JSBox is entirely Unicode-based internally; every TeX algorithm and data 
structure has been enhanced to treat a "character" as a 21-bit quantity, rather 
than an 8-bit byte.  Unlike XeTeX, JSBox does not use TeX language machinery to 
decode incoming UTF-8 byte sequences.  That happens in JSBox at a lower 
("transport") level where all the possible UTF streams (or older 8-bit 
encodings) are converted to 21-bit Unicode characters before the language 
scanner ever sees anything.

One of my goals, in the service of simplicity, is not to rely on dumped format 
files.  This means that prior to typesetting any document, JSBox must 
initialize itself by reading in the source code for whatever format is desired.

I've published an eBook/app for iOS (called "Hilbert Curves") that uses the 
JSBox library to typeset its simulated pages.  At app launch, it reads in the 
macros of "plain.tex" and of the "opmac.tex" markup macro library, and other 
files, before executing the TeX source code for the 160-page book.  All of this 
takes a negligible amount of time from the user's perspective.  One of my goals 
now is to do something similar for a LaTeX document.

But LaTeX's source code is of course several orders of magnitude more complex 
and longer than is plain's and opmac's.  I've been working on initializing a 
LaTeX typesetting job, simply by reading in "latex.ini".  For what it's worth, 
JSBox (configured to record statistics) reports that this parse of "latex.ini" 
results in:

7863 macro definitions or re-definitions

(basically a count of all calls to \def, \edef, etc.).

According to Joseph Wright, who recently answered a question of mine posed 
here, it takes somewhere between 2 and 3 seconds on his computer to initialize 
the LaTeX format for the Unicode-aware XeTeX engine.  In the TeX world, it 
doesn't really matter how long it takes, since it is the post-parse memory 
state that is saved into the binary format file that is distributed with TeX 
engines, to be read in later, presumably much faster, when a user starts a 
typesetting job.

JSBox doesn't rely on any of that.  On my 2.2GHz MacBook Pro laptop, JSBox 
takes about 1.25 seconds to read "latex.ini" and all its subordinate files 
(including some 85 different language hyphenation database files).  But it 
turns out that 60% of that time is due to executing the file 
"load-unicode-data.tex".  That file uses TeX macros to read and parse several 
large Unicode Consortium files so as to set up various character mapping tables 
(catcodes, upper- and lowercase characters, math characters, etc.).  The TeX 
macros that do this parsing are clever and concise, but they are way not 
efficient.  I've traced them, and the situation might be described as 
"algorithmic churn," parsing and re-parsing and re-re-parsing lines.

In contradistinction, JSBox depends on separately preprocessing the various 
Unicode Consortium data files (about 2MB of total text data) into a 
100-times-smaller (20K) binary file that can be used at interpreter 
initialization time.  Parsing this binary file using the interpreter's own C 
code takes only about 25 ms (1/40th of a second) to initialize JSBox's various 
internal Unicode character mapping tables to their non-default values for all 
the Unicode characters (code points).  That's about 30 times faster than what 
happens in "load-unicode-data.tex".

So ... It would be really great if there were a way to make the reading of 
"load-unicode-data.tex" conditional in some way, so that it works exactly the 
same way for XeTeX when building the Unicode LaTeX format, but allows other TeX 
language interpreters (such as JSBox) to bypass this inefficient parse of 
Unicode character files in favor of whatever the interpreter has otherwise 
already done.

The solution, I think, is pretty easy.

"load-unicode-data.tex" already tests for certain compatibility conditions and 
short-circuits itself accordingly.  Its first executable lines are:

% The data can only be loaded by Unicode engines. Currently this is limited to
% XeTeX and LuaTeX, both of which define \Umathcode.
\ifx\Umathcode\undefined
  \expandafter\endinput
\fi
% Just in case, check for the e-TeX extensions.
\ifx\eTeXversion\undefined
  \expandafter\endinput
\fi

But the first of these tests is no longer a good test, because JSBox is a 
Unicode/eTeX engine that does implement \Umathcode but has no need nor desire 
to execute this file because JSBox's mapping tables have *already* been 
initialized before any TeX code is ever pushed onto its execution stack, the 
same as classic TeX does for simple one-byte characters.

A solution is a dedicated, read-only "last_item" integer value, called, e.g., 
\Unicodedataloaded, whose existence or value prevents "load-unicode-data.tex" 
(or similar) from being executed (further).  The primitive doesn't even have to 
have a value, the fact that it exists can be sufficient to test against.  So 
adding the following lines after the eTeX test at the start of 
"load-unicode-data.tex" would solve the problem, not just for JSBox, but for 
any other future Unicode TeX engine faced with a similar situation.

% Give any Unicode engine the ability to initialize its mapping
% tables in its own way instead of relying on this file, as long
% as it implements a primitive named \Unicodedataloaded.
\ifdefined\Unicodedataloaded
  \expandafter\endinput
\fi

For current XeTeX LaTeX format initialization, there should be no change to how 
things are built.

I implemented this primitive today in JSBox (as a read-only value of 1), and 
made the above change in my local copy of "load-unicode-data.tex".  Executing 
"latex.ini" now takes about .5 second, which is a considerable improvement over 
1.25 seconds, certainly now within the bounds of what might be an acceptable 
user experience typesetting a Unicode LaTeX document after reading the format's 
source code.

Are there any downsides to this minor change that I'm missing?  Is there a 
better name for the primitive?  What can I do to encourage that the above test 
be officially added to "load-unicode-data.tex"?


Doug McKenna
Mathemaesthetics, Inc.

[XeTeX] A LaTeX Unicode initialization desire/question/suggestion

Reply via email to