What is the possibility of making a slight enhancement to how the Unicode LaTeX
format is created with respect to "Unicode-aware" engines (an unfortunately
somewhat ill-defined term)? Here's the situation:
My TeX/e-TeX language interpreter, currently called JSBox, is implemented as a
simple C library, so it can be incorporated into any other software (for
instance, I've recently created a Java class wrapper around it, hoping to soon
use it in an Android eBook/app).
JSBox is entirely Unicode-based internally; every TeX algorithm and data
structure has been enhanced to treat a "character" as a 21-bit quantity, rather
than an 8-bit byte. Unlike XeTeX, JSBox does not use TeX language machinery to
decode incoming UTF-8 byte sequences. That happens in JSBox at a lower
("transport") level where all the possible UTF streams (or older 8-bit
encodings) are converted to 21-bit Unicode characters before the language
scanner ever sees anything.
One of my goals, in the service of simplicity, is not to rely on dumped format
files. This means that prior to typesetting any document, JSBox must
initialize itself by reading in the source code for whatever format is desired.
I've published an eBook/app for iOS (called "Hilbert Curves") that uses the
JSBox library to typeset its simulated pages. At app launch, it reads in the
macros of "plain.tex" and of the "opmac.tex" markup macro library, and other
files, before executing the TeX source code for the 160-page book. All of this
takes a negligible amount of time from the user's perspective. One of my goals
now is to do something similar for a LaTeX document.
But LaTeX's source code is of course several orders of magnitude more complex
and longer than is plain's and opmac's. I've been working on initializing a
LaTeX typesetting job, simply by reading in "latex.ini". For what it's worth,
JSBox (configured to record statistics) reports that this parse of "latex.ini"
results in:
7863 macro definitions or re-definitions
(basically a count of all calls to \def, \edef, etc.).
According to Joseph Wright, who recently answered a question of mine posed
here, it takes somewhere between 2 and 3 seconds on his computer to initialize
the LaTeX format for the Unicode-aware XeTeX engine. In the TeX world, it
doesn't really matter how long it takes, since it is the post-parse memory
state that is saved into the binary format file that is distributed with TeX
engines, to be read in later, presumably much faster, when a user starts a
typesetting job.
JSBox doesn't rely on any of that. On my 2.2GHz MacBook Pro laptop, JSBox
takes about 1.25 seconds to read "latex.ini" and all its subordinate files
(including some 85 different language hyphenation database files). But it
turns out that 60% of that time is due to executing the file
"load-unicode-data.tex". That file uses TeX macros to read and parse several
large Unicode Consortium files so as to set up various character mapping tables
(catcodes, upper- and lowercase characters, math characters, etc.). The TeX
macros that do this parsing are clever and concise, but they are way not
efficient. I've traced them, and the situation might be described as
"algorithmic churn," parsing and re-parsing and re-re-parsing lines.
In contradistinction, JSBox depends on separately preprocessing the various
Unicode Consortium data files (about 2MB of total text data) into a
100-times-smaller (20K) binary file that can be used at interpreter
initialization time. Parsing this binary file using the interpreter's own C
code takes only about 25 ms (1/40th of a second) to initialize JSBox's various
internal Unicode character mapping tables to their non-default values for all
the Unicode characters (code points). That's about 30 times faster than what
happens in "load-unicode-data.tex".
So ... It would be really great if there were a way to make the reading of
"load-unicode-data.tex" conditional in some way, so that it works exactly the
same way for XeTeX when building the Unicode LaTeX format, but allows other TeX
language interpreters (such as JSBox) to bypass this inefficient parse of
Unicode character files in favor of whatever the interpreter has otherwise
already done.
The solution, I think, is pretty easy.
"load-unicode-data.tex" already tests for certain compatibility conditions and
short-circuits itself accordingly. Its first executable lines are:
% The data can only be loaded by Unicode engines. Currently this is limited to
% XeTeX and LuaTeX, both of which define \Umathcode.
\ifx\Umathcode\undefined
\expandafter\endinput
\fi
% Just in case, check for the e-TeX extensions.
\ifx\eTeXversion\undefined
\expandafter\endinput
\fi
But the first of these tests is no longer a good test, because JSBox is a
Unicode/eTeX engine that does implement \Umathcode but has no need nor desire
to execute this file because JSBox's mapping tables have *already* been
initialized before any TeX code is ever pushed onto its execution stack, the
same as classic TeX does for simple one-byte characters.
A solution is a dedicated, read-only "last_item" integer value, called, e.g.,
\Unicodedataloaded, whose existence or value prevents "load-unicode-data.tex"
(or similar) from being executed (further). The primitive doesn't even have to
have a value, the fact that it exists can be sufficient to test against. So
adding the following lines after the eTeX test at the start of
"load-unicode-data.tex" would solve the problem, not just for JSBox, but for
any other future Unicode TeX engine faced with a similar situation.
% Give any Unicode engine the ability to initialize its mapping
% tables in its own way instead of relying on this file, as long
% as it implements a primitive named \Unicodedataloaded.
\ifdefined\Unicodedataloaded
\expandafter\endinput
\fi
For current XeTeX LaTeX format initialization, there should be no change to how
things are built.
I implemented this primitive today in JSBox (as a read-only value of 1), and
made the above change in my local copy of "load-unicode-data.tex". Executing
"latex.ini" now takes about .5 second, which is a considerable improvement over
1.25 seconds, certainly now within the bounds of what might be an acceptable
user experience typesetting a Unicode LaTeX document after reading the format's
source code.
Are there any downsides to this minor change that I'm missing? Is there a
better name for the primitive? What can I do to encourage that the above test
be officially added to "load-unicode-data.tex"?
Doug McKenna
Mathemaesthetics, Inc.