Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX installation?
George - Thanks! - Doug McK. From: "George N. White III" To: "xetex" Sent: Sunday, August 22, 2021 8:14:58 AM Subject: Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX installation? On Fri, 20 Aug 2021 at 22:46, Doug McKenna < [ mailto:d...@mathemaesthetics.com | d...@mathemaesthetics.com ] > wrote: Using XeTeX, I want to typeset a LaTeX document into a PDF file. The LaTeX source code in UTF-8 BQ_BEGIN expressly includes the Unicode Replacement character (� = U+FFFD) (a black diamond with a BQ_END BQ_BEGIN question mark in it). I want to typeset it in this single document using a monospaced font in one place, and in another BQ_END BQ_BEGIN place in a variable-width font. I understand that XeTeX can take advantage of one's system's installed fonts, but my LaTeX file BQ_END BQ_BEGIN is being generated by another program that doesn't know what those fonts are or what glyphs BQ_END BQ_BEGIN they support. I simply want to guarantee that the fonts used are always available when processing BQ_END BQ_BEGIN that LaTeX file. BQ_END BQ_BEGIN I also understand that it's possible to synthesize the glyph graphically without using a font, but I'd BQ_END BQ_BEGIN rather not go that route. So ... What fixed-width and variable-width OpenType (or other) fonts, if any, are always distributed BQ_END BQ_BEGIN with TeX or TeXLive or whatever that one can rely upon to be available for placing this particular BQ_END BQ_BEGIN glyph in a final PDF file? What would be the correct incantation to doing so? BQ_END 0xFFFD should be in any mainstream general use OpenType font. It might be better to ask which OT fonts to avoid due to low quality, bugs, lack of ongoing support, etc. There has been a lot of churn in the available fonts over the years, so the answer may be different if you need fonts that can be expected to have long-term support and availability. -- George N. White III
Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX installation?
Ulrike - Excellent. Thank you! Using \setmainfont{DejaVuSerif.ttf} works on my non-Linux machine, and it is not listed as "installed" in my Mac's FontBook, which means it's being used solely within the TeXosystem. DejaVuSerif is a variable-width font. Is there a similar fixed-width OpenType/TrueType font distributed with TeXLive that would work? Doug McKenna Mathemaesthetics, Inc. - Original Message - From: "news3" To: "xetex" Sent: Saturday, August 21, 2021 10:39:25 AM Subject: Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX installation? Am Sat, 21 Aug 2021 09:25:14 -0600 (MDT) schrieb Doug McKenna: > Thanks all for your interesting responses. > > Unfortunately, my possibly poorly worded question remains unanswered. Let me > try again. > > Consider the short example just used: > > \documentclass{article} > \usepackage{fontspec} > \setmainfont{DejaVu Serif} > > \begin{document} > fffd > \end{document} > > When I run it, fontspec complains that it can't find the font. So obviously > "DejaVu Serif" is not installed, either on my system or anywhere in the > bowels of all the ~150,000 TeXLive (2019) files that have been installed in > the TDS on my machine. No, it only says that it is not found by fontname. Something that happens often on linux. Try with \setmainfont{DejaVuSerif.ttf} > So, is there a font name I can use in the \setmainfont{} command > that is ALWAYS available (upon TeX installation) when processing > this LaTeX file with XeTeX? Or always available after a certain > version of a TeX installation? I have no idea when DejaVu was added but it is in texlive 2019. If you want to support also older systems try e.g. on overleaf. -- Ulrike Fischer http://www.troubleshooting-tex.de/
Re: [XeTeX] Guaranteed Unicode replacement glyph in every TeX installation?
Thanks all for your interesting responses. Unfortunately, my possibly poorly worded question remains unanswered. Let me try again. Consider the short example just used: \documentclass{article} \usepackage{fontspec} \setmainfont{DejaVu Serif} \begin{document} fffd \end{document} When I run it, fontspec complains that it can't find the font. So obviously "DejaVu Serif" is not installed, either on my system or anywhere in the bowels of all the ~150,000 TeXLive (2019) files that have been installed in the TDS on my machine. So, is there a font name I can use in the \setmainfont{} command that is ALWAYS available (upon TeX installation) when processing this LaTeX file with XeTeX? Or always available after a certain version of a TeX installation? I want to automatedly create such a LaTeX file that permits its user to declare or override a default font for typesetting the Unicode Replacement character, but which doesn't require the user to search for or declare such fonts at all in the simplest case. One would think that every OpenType font supporting Unicode glyphs would include a glyph for U+FFFD, but it doesn't appear to me that that is the case. All I want is for the computer-generated LaTeX file "to just work" out of the box (so to speak), so that a naive user isn't faced with an error message the first time they typeset it with something like TeXShop. Doug McKenna Mathemaesthetics, Inc.
[XeTeX] Guaranteed Unicode replacement glyph in every TeX installation?
Using XeTeX, I want to typeset a LaTeX document into a PDF file. The LaTeX source code in UTF-8 expressly includes the Unicode Replacement character (� = U+FFFD) (a black diamond with a question mark in it). I want to typeset it in this single document using a monospaced font in one place, and in another place in a variable-width font. I understand that XeTeX can take advantage of one's system's installed fonts, but my LaTeX file is being generated by another program that doesn't know what those fonts are or what glyphs they support. I simply want to guarantee that the fonts used are always available when processing that LaTeX file. I also understand that it's possible to synthesize the glyph graphically without using a font, but I'd rather not go that route. So ... What fixed-width and variable-width OpenType (or other) fonts, if any, are always distributed with TeX or TeXLive or whatever that one can rely upon to be available for placing this particular glyph in a final PDF file? What would be the correct incantation to doing so? Thanks. Doug McKenna Mathemaesthetics, Inc.
Re: [XeTeX] A LaTeX Unicode initialization desire/question/suggestion
loc() or whatever the equivalent might be on some system. This makes it harder to create a \dump format file, though not impossible. But it wouldn't be (or need to be) compatible with anything in the official TeX world. Regardless, my goal is to see how far one can get without needing format files. Also, see below. >| The pressure to load more into a >| format is likely to increase rather than decrease, people often >| routinely make custom formats preloading large packages like tikz or >| pstricks for example. True, but there is a fundamental difference between what I'm working toward, and what the TeX infrastructure does. In the TeX world, every job is a single process. Every time a TeX job is done, a process is launched, the job gets done, and the program ends. It's the Unix/command-line way. So the format has to be loaded (fast) on every job. Makes perfect sense. But when your engine is just a library linked into another program the lives for a long time, perhaps measured in days, and when the user is running multiple jobs from the same program, then there ought to be a way to load the format from its source code >once<, and have it live in the engine's memory even while job after job is executing on top, with a clean-up after each job ends. This is, after all, completely conformant with everyday use of TeX (edit...run job...edit...run job...), not to mention every other computer language. I'm pretty sure that I've architected my code to allow this, although it's untested for now. One step at a time. >| As noted above, with latex-dev releases you are still going to need >| the unicode data files to be read using tex macros. Are these files read more than once, and if so, why? If not, I don't understand why I'm still going to need to read them. >| Before making any >| changes to the tex macros you may want to do timings with the these >| versions. It may be that you choose to reconsider not making (the >| equivalent of) format files, as just saving the time for setting the >| lccodes may be a less significant proportion of the startup time. Agreed. >| To be in the core tex macros we would need to have the engine >| incorporated into texlive so that it could be tested as part of our >| test suite and continuous integration tests. That doesn't make sense to me. Adding a couple of lines of code to "load-unicode.data.tex" and then determining with regression tests that absolutely nothing has changed doesn't involve any third party at all. >| However as already discussed in this thread there are several >| possibilities for you to build something along those lines without >| requiring any changes to the core macro files, so lack of change here >| shouldn't be seen as a discouragement and anyway gives you more >| flexibility with changing names etc while jsbox is being developed. Duly noted. >| Returning to your original question as to what constitutes a "Unicode" >| TeX for LaTeX, we have put some data on the requirements for extended >| TeX features in the draft ltnews31 which will be part of next week's >| latex-dev release, but you can see the sources now at >| >| Primitive Requirements: >| https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L596 >| >| see also >| >| Improved load-times for expl3: >| https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L169 >| >| on the additional items preloaded in the format. Many thanks! This is very helpful. Doug McKenna Mathemaesthetics, Inc.
Re: [XeTeX] [EXT] A LaTeX Unicode initialization desire/question/suggestion
Phil Taylor wrote: >| So because JSBox is required/designed to incorporate all of XeTeX's >| features, it must (by definition) implement/provide \Umathcode. Just to be clear, JSBox can eventually incorporate all of XeTeX's features (primitives), but does not do so now. It doesn't even incorporate pdfTeX's features, but it is set up to. I'm merely adding XeTeX features as necessary to get the LaTeX macro library installed and then typeset a LaTeX document containing no Unicode at all. The problem is that somewhere in the LaTeX format initialization the ability to recognize a Unicode character (as opposed to a UTF-8 byte sequence) is equated with the assumption that it's being run under XeTeX, and that therefore at least some of XeTeX's features are there and can be relied upon at format initialization time. >| But could not JSbox perform (or simulate) the following : >| \let \Umathschar = \Umathchar % use British spelling as synonym >| \let \Umathchar = \undefined % inhibit "load-unicode-data.tex"'s special >treatment of engines that implement \Umathchar >| \input load-unicode-data % since it would seem that you cannot simply skip >this step >| \let \Umathchar = \Umathschar % restore canonical meaning of \Umathchar It could, but it's not my code that's issuing "\input load-unicode-data". The reading of "load-unicode-data.tex" is embedded within my version of LaTeX's own initialization code, and there's no guarantee that elsewhere in that code there isn't some dependence on \Umathchar that such a re-definition might interfere with. LaTeX's code has several tests that rely on whether |\Umathchar| is defined or not, and even in the latest versions, it is declared that \Umathchar existence is the official way to test. Indeed, the latest official comments, as David Carlisle brought to my attention in this thread, declare that \Umathchar existence testing is the current way to go in all sorts of places. Such negative "let's fool some other code to get something done" hacks are fragile because they render the other, affected TeX code impossible to understand when reading it. Far better and safer is an affirmative addition to the various checks already being made that facially means what it says: if Unicode character mapping data has been loaded, don't bother. Here is perhaps a slightly better hack: If it's acceptable as the very first executable line in latex.ltx (or other format source files) to test the catcode value of `{ to determine whether a format has already been loaded or not, then it should be acceptable within "load-unicode-data.tex" (or the like) to include a similar test to determine whether to proceed with the TeX parse of the Unicode data, or to bail because it's presumable that the tables are already initialized. For example, the first non-8-bit Unicode character is: 0100;LATIN CAPITAL LETTER A WITH MACRON;Lu;0;L;0041 0304N;LATIN CAPITAL LETTER A MACRON;;;0101; It is safe, I think, to assume that this Unicode character will forever be classified as an uppercase letter (with a lowercase mapping value of U+0101). When the XeTeX engine begins running, before any TeX source code is interpreted, the engine initializes its internal |cat_code| array (all 1,114,112 slots) with the value |other_char| (12). It then does the usual classic TeX initialization to declare ASCII letters as such, etc. Later, during the LaTeX format's reading of "load-unicode-data.tex", a simple test to determine whether to continue reading the file could be made based on whether the catcode value of U+0100 is 11 (letter) or 12 (other). If it's already known as a letter, then the catcode table is not in its initial default state, and a second initialization is unnecessary. If it's still an |other_char| (12), then things need initializing for letter characters and the rest of "load-unicode-data.tex" should be executed. >>| Furthermore, the purpose of executing "load-unicode-data.tex" is precisely >>to >>| populate the \Umathchar table, as well as other Unicode character tables. >>| So these tables have to exist prior to executing the file. >| Well, do they, in the case of JSBox? From what you wrote in your original >| query, I thought that that [1] was the very thing that you were trying to >avoid ... >| [1] "executing "load-unicode-data.tex" [in order] to populate the \Umathchar >table". >| So specifically, does the \Umathchar table have to exist, in JSBox, at the >point >| that "load-unicode-data.tex" is loaded ? I'm trying to avoid initializing these character mapping tables twice, especially when the second time (reading this file) rather inefficiently takes 30 times longer than the first, and accomplishes nothing new. Thanks for thinking about my questions, I appreciate it. Doug McKenna
Re: [XeTeX] [EXT] A LaTeX Unicode initialization desire/question/suggestion
Phil Taylor wrote: >| How about delaying the definition of \Umathcode until after >| "load-unicode-data.tex" has been processed ? Is that possible, and >| would it have undesirable side-effects ? \Umathcode is a XeTeX primitive extension to the TeX language in the service of solving a problem in classic TeX, which was that the machinery and syntax of the classic TeX \mathchar primitive could not handle 21-bit Unicode values, or more than 16 math font families, etc. So \Umathchar (and a bunch of other related extensions all starting with 'U') is defined in XeTeX's WEB source code; it exists the moment the engine is launched. Thus it's not possible to delay \Umathchar's definition. Furthermore, the purpose of executing "load-unicode-data.tex" is precisely to populate the \Umathchar table, as well as other Unicode character tables. So these tables have to exist prior to executing the file. Perhaps I'm misunderstanding your question. In any case, my point is that a TeX engine interested in initializing itself as fast as possible (using a different form of the exact same official Unicode character data) should be able to avoid processing "load-unicode-data.tex" altogether, because doing so ends up being a completely redundant waste of time (and, depending upon implementation, space). XeTeX does not have to care about this, but other Unicode engines, certainly the one I'm working on, will care. A couple of lines of TeX code added to the file appears to me to solve the problem, with no downside to creating the XeTeX LaTeX format. Doug McKenna Mathemaesthetics, Inc.
[XeTeX] A LaTeX Unicode initialization desire/question/suggestion
t \Umathcode but has no need nor desire to execute this file because JSBox's mapping tables have *already* been initialized before any TeX code is ever pushed onto its execution stack, the same as classic TeX does for simple one-byte characters. A solution is a dedicated, read-only "last_item" integer value, called, e.g., \Unicodedataloaded, whose existence or value prevents "load-unicode-data.tex" (or similar) from being executed (further). The primitive doesn't even have to have a value, the fact that it exists can be sufficient to test against. So adding the following lines after the eTeX test at the start of "load-unicode-data.tex" would solve the problem, not just for JSBox, but for any other future Unicode TeX engine faced with a similar situation. % Give any Unicode engine the ability to initialize its mapping % tables in its own way instead of relying on this file, as long % as it implements a primitive named \Unicodedataloaded. \ifdefined\Unicodedataloaded \expandafter\endinput \fi For current XeTeX LaTeX format initialization, there should be no change to how things are built. I implemented this primitive today in JSBox (as a read-only value of 1), and made the above change in my local copy of "load-unicode-data.tex". Executing "latex.ini" now takes about .5 second, which is a considerable improvement over 1.25 seconds, certainly now within the bounds of what might be an acceptable user experience typesetting a Unicode LaTeX document after reading the format's source code. Are there any downsides to this minor change that I'm missing? Is there a better name for the primitive? What can I do to encourage that the above test be officially added to "load-unicode-data.tex"? Doug McKenna Mathemaesthetics, Inc.
[XeTeX] Clarification on XeTeX documentation
Two questions: Question #1: In the latest document describing XeTeX extensions, dated 2019-12-09, for instance, at <https://ctan.math.illinois.edu/info/xetexref/xetex-reference.pdf>, in section 2.3 "Maths fonts" (currently on page 14), the following sentence needs clarification: >| In the following commands, ⟨fam.⟩ is a number (0–255) representing >| font to use in maths. ⟨math type⟩ is the 0–7 number corresponding to >| the type of math symbol ... But is not a font number (or index). As denotes, it is a font family number (or index), where each font family represents a triplet of loaded fonts, one each for text, script, and scriptscript situations. And throughout other TeX documentation, the word "class" is used to describe the purpose of a math character, a 3-bit number between 0 and 7. I suggest this be amended to read: In the following commands, ⟨fam.⟩ is a number (0–255) of the math font family. ⟨math type⟩ is the 0–7 number corresponding to the class of math symbol ... Question #2: Later on, in various syntax declarations, e.g., >| \Umathcode⟨char slot⟩ [=] ⟨math type⟩ ⟨fam.⟩ ⟨glyph slot⟩ one finds the term . This is curious, because XeTeX's source code parses this integer as an integer, using a procedure named scan_usv_num ("usv" stands for Unicode scalar value). That routine complains about any value outside the Unicode range of 0 to "10 as illegal. But glyph slot is a term usually used to describe the innards of a font, and is not the same as a Unicode character/code point/scalar value, which the font would internally map to a glyph slot (or index). Also, every OpenType font is limited to no more than 2^{16} (65536) glyph slots, so it's concerning that this routine accepts a number that is outside of that range. If this is the case, another problem is that it is then formally possible that a font contains a glyph whose internal slot number, for example, might be "D800 (a legal 16-bit value that scan_usv_num won't complain about). But "D800 is not a legal Unicode character value, it's a high-surrogate value for forming a full 21-bit Unicode character value with another low surrogate value. "D800 might be a Unicode scalar value, but it is not a character value. So my question is: What is a proper legal value for a ? Alternatively, should be changed in this documentation to something less ambiguous, such as or or ? Doug McKenna Mathemaesthetics, Inc.
[XeTeX] How much time to build LaTeX format for XeTeX
Given all the parsing of the Unicode character data files during INITEX, and all the inputting and creation of the hyphenation trees, how much CPU time elapses while building the XeTeX format file for LateX? I'm going to assume that the writing out of the format at the final \dump command is negligible, though I don't really know. - Doug McKenna
Re: [XeTeX] Math class initialization in Unicde-aware engine
Joseph - A similar ambiguity occurs later in the README.md file. It says - \Umathcode for all letters as TeX class 7 (var) Does "letters" mean those code points on the TeX side with \catcode 11, or those Unicode code points labeled with 'L' in UnicodeData.txt? If the former, then combining marks (Unicode 'M') should be entered into \Umathcode as TeX class 7; if the latter, then presumably not, though it's not clear why a math variable name can't have a combining mark. - Doug McKenna
Re: [XeTeX] Math class initialization in Unicde-aware engine
Joseph Wright wrote: >| Er, I thought the README was reasonably clear, ah well! Here's an example of something that's not so clear to me. The README.md file displayed at <https://ctan.org/tex-archive/macros/generic/unicode-data> says - \lccode and/or \uccode for non-letter code points for which an upper or lower case mapping is given The problem with this is that earlier, it is stated that all combining mark code points (class code starting with 'M' in the UnicodeData.txt file) are to be considered letters (\catcode set to 11). So there's an ambiguity here that needs clearing up. Does the above apply to combining mark code points or not? It may be that none of the combining marks in the data file have any case mappings, but there's no guarantee that is true. So the question is, if a combining mark has an uppercase or lowercase mapping, does that get installed in \lccode and/or \uccode? Also, there's a confusing typo ("can"?) in - \lccode and \uccode for all of class "Lt" (title case letters) to the lower can upper case mappings (or if not given to the code point itself) Should "can' be "and/or"? Doug McKenna
Re: [XeTeX] Math class initialization in Unicde-aware engine
Ross wrote: >| If by ignoring you mean removing the character entirely, then that is surely >not best at all. >| >| Most N Class (Normal) characters would be simply of the default \mathord >class. The parsing code in load-unicode-math-classes.tex installs values in the \Umathcode table that comport with some rule, which without too much of a close look seems to me to be whether the character code math class read from MathClass.txt is one of the eight possibilities that parsing code pays attention to, out of the 15 possible ones in the file. Therefore it appears to me that all entries in MathClass.txt that are marked with, for instance, 'N', are ignored with respect to installing any entry in the \Umathcode table. It may be that such characters in MatClass.txt marked with 'N' take on the \mathOrd attribute by default when TeX finds them within math mode, I'm not sure without looking at its code. Doug McKenna From: "Ross Moore" To: "xetex" Sent: Wednesday, November 27, 2019 5:16:44 PM Subject: Re: [XeTeX] Math class initialization in Unicde-aware engine Hi Joe, Doug On 28 Nov 2019, at 10:27 am, Joseph Wright < [ mailto:joseph.wri...@morningstar2.co.uk | joseph.wri...@morningstar2.co.uk ] > wrote: BQ_BEGIN > # N - Normal - includes all digits and symbols requiring only one form BQ_END BQ_BEGIN > # D - Diacritic BQ_END BQ_BEGIN > # F - Fence - unpaired delimiter (often used as opening or closing) BQ_END BQ_BEGIN > # G - Glyph_Part - piece of large operator BQ_END BQ_BEGIN > # S - Space > # U - Unary - operators that are only unary BQ_END BQ_BEGIN > # X - Special - characters not covered by other classes BQ_END BQ_BEGIN > Unfortunately, the documentation/comments don't say what happens to entries > having these other Unicode math codes (N, D, F, G, S, U, and X). Are they > completely ignored, or are they mapped to one of the other eight codes that > matches what TeX is interested in or only capable of handing? > > I can imagine that the space character, given Unicode math class 'S' in > MathClass.txt, is ignored during this parse. But what happens to the '¬' > character (U+00AC) ("NOT SIGN"), which is assigned 'U' (Unary Operator). > Surely the logical not sign is not being ignored during initialization of a > Unicode-aware engine, yet the comments in load-unicode-math-classes.tex don't > say one way or the other, and it appears to me that the parsing code is > ignoring it. BQ_END BQ_BEGIN The other Unicode math classes don't really map directly to TeX ones, so they are currently ignored. Suggestions for improvements here are of course welcome. BQ_END If by ignoring you mean removing the character entirely, then that is surely not best at all. Most N Class (Normal) characters would be simply of the default \mathord class. I’d expect others to be mapped instead into a macro that corresponds to something that TeX does support. e.g. space characters for thinspace, 2-em space, etc. in U+2000 – U+200A can expand into things like: \, \; \> \quad \qquad etc. ( even to constructions like \mskip1mu ) After all, this is essentially what happens when pdfTeX reads raw Unicode input. The G class (Glyph_Part) is a lot harder, as those glyph parts don’t correspond to any single TeX macro. Think about a very large opening brace spanning 3+ ordinary line widths, say, as may be generated by \left\{ ... \right\} surrounding some (inner-) displayed math alignment. On input, the whole grouping would need to be identified and mapped to appropriate TeX coding. Basically there is a lot here that needs to be looked more or less individually. I’ve been through this kind of exercise, in reverse, to decide what to specify as /Alt and /ActualText replacements (for accessibility) for what TeX produces with various math constructions. I don’t have definitive answers for everything, but have tried some possibilities for many things. BQ_BEGIN Joseph BQ_END Hope this helps. Ross Dr Ross Moore Department of Mathematics and Statistics 12 Wally’s Walk, Level 7, Room 734 Macquarie University, NSW 2109, Australia T: +61 2 9850 8955 | F: +61 2 9850 8114 M:+61 407 288 255 | E: [ mailto:ross.mo...@mq.edu.au | ross.mo...@mq.edu.au ] [ http://www.maths.mq.edu.au/ | http://www.maths.mq.edu.au ] [ http://mq.edu.au/ | CRICOS Provider Number 2J. Think before you print. Please consider the environment before printing this email. This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie University. ] [ http://mq.edu.au/ ]
[XeTeX] Math class initialization in Unicde-aware engine
Another question about Unicode-aware TeX engine (e.g., XeTeX) initialization files. The Unicode Consortium provides a file, MathClass.txt, e.g., ./texmf-dist/tex/generic/unicode-data/MathClass.txt It contains a list of lines (and comments). Field 0 of an entry line is a Unicode code point or a range of code points, and field 1 is a single ASCII character that declares the Unicode math class to which the code point or range of code points belongs. Comments in that file say that there are (currently) 15 different Unicode math class codes: # N - Normal - includes all digits and symbols requiring only one form # A - Alphabetic # B - Binary # C - Closing - usually paired with opening delimiter # D - Diacritic # F - Fence - unpaired delimiter (often used as opening or closing) # G - Glyph_Part - piece of large operator # L - Large - n-ary or large operator, often takes limits # O - Opening - usually paired with closing delimiter # P - Punctuation # R - Relation - includes arrows # S - Space # U - Unary - operators that are only unary # V - Vary - operators that can be unary or binary depending on context # X - Special - characters not covered by other classes During XeTeX format initialization, the file load-unicode-math-classes.tex in that same directory is executed, in order to declare to the engine which Unicode code points belong to which TeX math classes. The comments in that file say that the classes it pays attention to are those with the following Unicode math codes: % This file parses MathClass.txt, provided by the Unicode Consortium, and sets % up the following mapping between Unicode classes and TeX math types % - "L" (large) \mathop % - "B" (binary) \mathbin % - "V" (vary)\mathbin % - "R" (relation)\mathrel % - "O" (opening) \mathopen % - "C" (closing) \mathclose % - "P" (punctuation) \mathpunct % - "A" (alphabetic) \mathalpha That means that there are 7 other Unicode math classes that are unaccounted for. Unfortunately, the documentation/comments don't say what happens to entries having these other Unicode math codes (N, D, F, G, S, U, and X). Are they completely ignored, or are they mapped to one of the other eight codes that matches what TeX is interested in or only capable of handing? I can imagine that the space character, given Unicode math class 'S' in MathClass.txt, is ignored during this parse. But what happens to the '¬' character (U+00AC) ("NOT SIGN"), which is assigned 'U' (Unary Operator). Surely the logical not sign is not being ignored during initialization of a Unicode-aware engine, yet the comments in load-unicode-math-classes.tex don't say one way or the other, and it appears to me that the parsing code is ignoring it. The ReadMe.md file <https://ctan.org/tex-archive/macros/generic/unicode-data> is also deficient in answering this question. TIA, Doug McKenna
Re: [XeTeX] Lowercase Unicode code points in hyphenation patterns
Is xgreek.sty loaded as part of creating the LaTeX format? If not, my understanding is that its corrections wouldn't affect any of the hyphenation patterns installed from xetex.ini during the format build. Perhaps this doesn't matter. - Doug McKenna - Original Message - From: "David Carlisle" To: "Apostolos Syropoulos" , "xetex" Sent: Sunday, November 24, 2019 11:48:32 AM Subject: Re: [XeTeX] Lowercase Unicode code points in hyphenation patterns On Sun, 24 Nov 2019 at 18:41, Apostolos Syropoulos via XeTeX wrote: > Of course these tables are all wrong but this is another problem. Yes there is that. However it seems better to start from a known standardised base shared with basically everyone then fix as needed rather than try to come up with a tex-specific set of mappings covering the whole Unicode code range and having to document and maintain them and extend each year as more characters are added. > I have added the correct \uccodes and \lccodes in xgreek.sty thanks:-) David
[XeTeX] Lowercase Unicode code points in hyphenation patterns
When the LaTeX format is built, there are tests for whether or not a Unicode-aware TeX engine is doing the work. I presume that XeTeX is such a Unicode-aware engine, though I'm not familiar with what the definition of "Unicode-aware TeX engine" actually is (separate issue). During the input of various hyphenation pattern files (a group for each language code), the first such file that uses non-ASCII Unicode code points is for Ancient Greek, in the file /usr/local/texlive/2017/texmf-dist/tex/generic/hyph-utf8/patterns/texhyph-grc.tex at line 61, which starts out α1 ε1 η1 ι1 ο1 υ1 ω1 ϊ1 ... TeX's code and specification says that only lowercase letters can appear in pattern words, and the definition within TeX's source code of a lowercase letter is any entry in the \lccode table that, when indexed by a character, delivers itself. But as near as I can tell, during the building of the LaTeX format (i.e., running "latex.ltx") there is no TeX source code that installs any of these Greek letters into the \lccode table. Therefore, I'm concluding that the XeTeX engine does this itself when it initializes, rather than awaiting any TeX source code to do it. But there are a whole lot of lowercase letters in Unicode, so I'm wondering how XeTeX determines legal lowercase letters for initial pattern files? I've tried looking at some version of the xetex.web code, but without illumination, I'm afraid. TIA, Doug McKenna
Re: [XeTeX] Hyphenation of strings of more than 63 characters
There could be some subtle problems that simply changing the character count constant causes. In particular, the allocation size of a "whatsit" language node might also need changing, which would require adjusting other code in the core engine that assumes a default small size for that language node sub-type of a "whatsit". Or not. I can't tell from the TeX source what the bit sizes of these node fields are. But if they're too small to fit a pair of enhanced character count limits for hyphenation, there will likely be bugs elsewhere due to truncation or wraparound in the arithmetic. FWIW, Doug McKenna - Original Message - From: "Peter Mukunda Pasedach" <peter.pased...@googlemail.com> To: "XeTeX (Unicode-based TeX) discussion." <xetex@tug.org> Sent: Tuesday, March 15, 2016 9:13:08 AM Subject: Re: [XeTeX] Hyphenation of strings of more than 63 characters Dear Jonathan, yes, recompiling xetex is fine! At 255 characters I still have 32 occurences left, at 500 two, and at 1000 zero. Thanks for looking into this! Peter On Tue, Mar 15, 2016 at 3:46 PM, Jonathan Kew <jfkth...@gmail.com> wrote: > On 15/3/16 14:24, Peter Mukunda Pasedach wrote: >> >> Dear XeTeX list, >> >> I am dealing with a collection of texts in Sanskrit, for which the >> builtin limitation of TeX to not perform hyphenation after the 63rd >> character of a string is imposing a serious limitation, as such >> strings do occur. One reason for this is that one can freely form very >> long compounds, another one is sandhi, in which due to euphonic >> changes ending and beginning vowels fuse, another one that in Indic >> scripts if one word ends in a consonant and the next one starts with a >> vowel they are written together, another reason can be that scribes >> simply do not use spaces consistently. Thus in the collection of texts >> that I'm working on, currently comprising of 37 files, strings of more >> than 63 characters occur 1823 times. >> >> Is this limitation of 63 characters just an odd remnant of the time >> TeX was written in, then necessary because of hardware limitations, or >> does it still make sense? Is there a reasonable way to remove it, or >> set it significantly higher? > > > I suspect (without actually checking the code) that it would be fairly > trivial to make it significantly higher (less so to remove it entirely; but > something like 255 or even 1000-plus would probably be simple). > > A change like this would need to be optional, however, so that the > typesetting of existing documents does not change unless the user > deliberately chooses the modified behavior. > > It's probably too late to be adding a new feature for the TL'16 release; are > you prepared to recompile xetex yourself from source in order to make such a > change? > > JK > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Latin Modern, from TFM to Unicode
the font what to draw. Since there's no mapping from Unicode, then the outside process either needs to know the absolute glyph IDs inside the font, or it needs to cause the font to go into some internal construction mode, like building a ligature, where the font itself knows the sequence and position of the glyphs to use to construct the tall symbol. The latter seems impossible, because the font can't know the threshold height at which to stop construction. The former means hard coding internal glyph IDs somewhere outside the font, which I'm hoping is not fragile, but worrying might be. Sorry for the reams of details, but I'm trying to be explain my confusion exactly. Doug McKenna -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex