On 13/01/2020 03:41, Doug McKenna wrote:
Phil Taylor wrote:

| So because JSBox is required/designed to incorporate all of XeTeX's
| features, it must (by definition) implement/provide \Umathcode.

Just to be clear, JSBox can eventually incorporate all of XeTeX's features 
(primitives), but does not do so now. It doesn't even incorporate pdfTeX's 
features, but it is set up to. I'm merely adding XeTeX features as necessary to 
get the LaTeX macro library installed and then typeset a LaTeX document 
containing no Unicode at all. The problem is that somewhere in the LaTeX format 
initialization the ability to recognize a Unicode character (as opposed to a 
UTF-8 byte sequence) is equated with the assumption that it's being run under 
XeTeX, and that therefore at least some of XeTeX's features are there and can 
be relied upon at format initialization time.

At present, there are two engines that implement \Umathcode, etc., 'in the wild', XeTeX and LuaTeX, and they have (over time) come to an agreed position on what core features are available at the macro level. (For example, originally XeTeX called it's new primitives \XeTeX... but they got renamed to \U... to match LuaTeX.)

They have quite a lot of differences too, but a core subset of features is available with both, and that comes about as they offer \Umathcode. Almost all of the tests in LaTeX look for the relevant primitive, so for example when we want \Uchar we look for it. However, there are as you note a few places where finding \Umathcode is by far the easiest marker.

It's quite possible to add additional tests to the core code, provided there is a spec or at least some notes on what's available. (For example, (u)pTeX for a long time had no docs in English, so things were tricky. But there is now a basic manual there to allow those of us who do not know Japanese to offer at least some basic support.)

| But could not JSbox perform (or simulate) the following :

| \let \Umathschar = \Umathchar % use British spelling as synonym
| \let \Umathchar = \undefined % inhibit "load-unicode-data.tex"'s special 
treatment of engines that implement \Umathchar
| \input load-unicode-data % since it would seem that you cannot simply skip 
this step
| \let \Umathchar = \Umathschar % restore canonical meaning of \Umathchar

It could, but it's not my code that's issuing "\input load-unicode-data". The reading of 
"load-unicode-data.tex" is embedded within my version of LaTeX's own initialization code, 
and there's no guarantee that elsewhere in that code there isn't some dependence on \Umathchar that 
such a re-definition might interfere with. LaTeX's code has several tests that rely on whether 
|\Umathchar| is defined or not, and even in the latest versions, it is declared that \Umathchar 
existence is the official way to test. Indeed, the latest official comments, as David Carlisle 
brought to my attention in this thread, declare that \Umathchar existence testing is the current 
way to go in all sorts of places.

I think you mean \Umathcode :)

Each place that uses Unicode features does test for this primitive; if it exists, we have to-date been able to assume a few additional primitives are also available (e-TeX, \Uchar, \Umathchardef) but mainly tells us that we can allocate \lccode and \uccode beyond 255.

Here is perhaps a slightly better hack:

If it's acceptable as the very first executable line in latex.ltx (or other format source 
files) to test the catcode value of `{ to determine whether a format has already been 
loaded or not, then it should be acceptable within "load-unicode-data.tex" (or 
the like) to include a similar test to determine whether to proceed with the TeX parse of 
the Unicode data, or to bail because it's presumable that the tables are already 
initialized. For example, the first non-8-bit Unicode character is:

0100;LATIN CAPITAL LETTER A WITH MACRON;Lu;0;L;0041 0304;;;;N;LATIN CAPITAL 
LETTER A MACRON;;;0101;

It is safe, I think, to assume that this Unicode character will forever be 
classified as an uppercase letter (with a lowercase mapping value of U+0101).

The test at the start of latex.ltx is about making sure we are in IniTeX mode: I'm not sure I'd choose to do that today, but the test is long-standing. For load-unicode-data, the idea was partly that there was really no issue about checking: unlike formats, that might have hidden stuff, here all we are trying to do is get to a known position. That links to the second reason I'm slightly wary of a test. As-written, load-unicode-data ensures that the \lccode, \uccode and \catcode tables are in a state *known to the macro layer*. I know it's slightly strange to you, but as a macro programmer I can't 'know' what different engine devs might do/change, and I certainly don't know exactly what version of UnicodeData.txt you are working from. By doing initialisation without checking, I can be sure that we are on a known Unicode version.

To be honest, that's all a minor concern: it's very much more that there was no need to worry about a test. It would be trivial to add one, not least since the Unicode Consortium have a clear position on stability.

I'm trying to avoid initializing these character mapping tables twice, 
especially when the second time (reading this file) rather inefficiently takes 
30 times longer than the first, and accomplishes nothing new.

Like I said, from a macro programmer POV it accomplishes 'the codes are in a known state I control', though practically that's not a major thing. (If you were using a Unicode version different from the one XeTeX/LuaTeX use, it would presumably impact on a rather limited subset of chars.)

Joseph

Reply via email to