Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-22 Thread David Carlisle
On Mon, 22 Feb 2021 at 01:28, Ross Moore  wrote:

> Hi Jonathan, and others.
>
>
> There’s actually a pretty easy fix, at least for XeLaTeX.
> The package contains 2 files only:   xstring.sty  and  xstring.tex .
> The .sty is just a 1-liner to load the .tex .
>
> It could be beefed up with:
>
>  \RequirePackage{ifxetex} %   is this still the best package for
>  \ifxetex   ?
> \ifxetex
>   \XeTeXdefaultencoding "iso-8859-1"
> \input{xstring.tex}
>   \XeTeXdefaultencoding "utf8"
> \else
>  \input{xstring.tex}
> \fi
>
>
That would sort of work but be a suboptimal fix, it imposes a run-time test
on everyone just to save the file in a legacy encoding when saving the file
as
utf-8 (or as ascii with accents shown as commands) has no run time cost and
places the file in the default text encoding used by almost all current
systems.

David


Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Ross Moore
Hi Jonathan, and others.

On 22 Feb 2021, at 10:39 am, Jonathan Kew 
mailto:jfkth...@gmail.com>> wrote:

On 21/02/2021 22:55, Ross Moore wrote:
The file reading has failed  before any tex accessible processing has happened 
(see the ebcdic example in the TeXBook)
OK.

Also pdfTeX has no trouble with an xstring example.
It just seems pretty crazy that the comments need to be altered
for that package to be used with XeTeX.

Well as long as the Latin-1 accented characters are only in comments, it 
arguably doesn't "really" matter; xetex logs a warning that it can't interpret 
them, but if you know that part of the line is going to be ignored anyway, you 
can ignore the warning.

There’s actually a pretty easy fix, at least for XeLaTeX.
The package contains 2 files only:   xstring.sty  and  xstring.tex .
The .sty is just a 1-liner to load the .tex .

It could be beefed up with:

 \RequirePackage{ifxetex} %   is this still the best package for  \ifxetex  
 ?
\ifxetex
  \XeTeXdefaultencoding "iso-8859-1"
\input{xstring.tex}
  \XeTeXdefaultencoding "utf8"
\else
 \input{xstring.tex}
\fi

(ignore if straight quotes have become curly ones in my email editor!)



Even nicer would be to beef it up further by:
 1. record the current default encoding – is this possible ?
 then restore from this.
 2. use a grouping while deciding what to do,
 expanding the required commands before ending the group.

\showthe\XeTeXdefaultencoding  doesn’t work,
so is there another container that tells what is the default encoding?
Or should we always assume it is UTF-8 and revert to that afterwards?

e.g. something like:

\RequirePackage{ifxetex}
\begingroup
 \def\next{\endgroup \input{xstring.tex}}%
 \ifxetex
  \XeTeXdefaultencoding "iso-8859-1"
  \def\next{\endgroup
   \input{xstring.tex}%
   \XeTeXdefaultencoding "utf8"}%
 \fi
\next




(pdfTeX doesn't care because it simply reads the bytes from the file; any 
interpretation of bytes as one encoding or another is handled at the TeX macro 
level.)

Right.
Which is why I do my PDF development work in pdfTeX before
testing whether it can be adapted also to XeTeX and/or LuaTeX.


JK



Cheers.

Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.mo...@mq.edu.au
http://www.maths.mq.edu.au
[cid:image001.png@01D030BE.D37A46F0]
CRICOS Provider Number 2J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. 




Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Jonathan Kew

On 21/02/2021 22:55, Ross Moore wrote:
The file reading has failed  before any tex accessible processing has 
happened (see the ebcdic example in the TeXBook)


OK.
But that’s changing the meaning of bit-order, yes?
Surely we can be past that.


No, it's not about bit-order; it's about changing the mapping of code 
units in the external file to character codes in TeX's internal 
(ASCII-based) code.







\danger \TeX\ always uses the internal character code of Appendix~C
for the standard ASCII characters,
regardless of what external coding scheme actually appears in the files
being read.  Thus, |b| is 98 inside of \TeX\ even when your computer
normally deals with ^{EBCDIC} or some other non-ASCII scheme; the \TeX\
software has been set up to convert text files to internal code, and to
convert back to the external code when writing text files.


the file encoding is failing at the  "convert text files to internal 
code" stage which is before the line buffer of characters is consulted 
to produce the stream of tokens based on catcodes.


Yes, OK; so my model isn’t up to it, as Bruno said.
  … And Jonathan has commented.

Also pdfTeX has no trouble with an xstring example.
It just seems pretty crazy that the comments need to be altered
for that package to be used with XeTeX.



Well as long as the Latin-1 accented characters are only in 
comments, it arguably doesn't "really" matter; xetex logs a warning that 
it can't interpret them, but if you know that part of the line is going 
to be ignored anyway, you can ignore the warning.


(pdfTeX doesn't care because it simply reads the bytes from the file; 
any interpretation of bytes as one encoding or another is handled at the 
TeX macro level.)


JK


Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Jonathan Kew

On 21/02/2021 22:55, Ross Moore wrote:

Hi David,

On 22 Feb 2021, at 8:43 am, David Carlisle > wrote:



Surely the line-end characters are already known, and the bits
have been read up to that point *before* tokenisation.


This is not a pdflatex inputenc style utf-8 error failing to map a 
stream of tokens.


It is at the file reading stage and if you have the file encoding 
wrong you do not know reliably what are the ends of lines and you 
haven't interpreted it as tex at all, so the comment character really 
can't have an effect here.


Ummm. Is that really how XeTeX does it?
How then does Jonathan’s
    \XeTeXdefaultencoding "iso-8859-1”
ever work ?
Just a rhetorical question; don’t bother answering.   :-)

This mapping is invisible to the tex macro layer just as you can 
change the internal character code mapping in classic tex to take an 
ebcdic stream, if you do that then read an ascii file you get rubbish 
with no hope to recover.






So I don't think such a switch should be automatic to avoid
reporting encoding errors.

I reported the issue at xstring here
https://framagit.org/unbonpetit/xstring/-/issues/4




I looked at what you said here, and some of it doesn’t seem to be in 
accord with

my TeXLive installations.

viz.

/usr/local/texlive/2016/.../xstring.tex:\expandafter\ifx\csname 
@latexerr\endcsname\relax% on n'utilise pas LaTeX ?

/usr/local/texlive/2016/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
/usr/local/texlive/2016/.../xstring.tex:%   - Le package ne n\'ecessite 
plus LaTeX et est d\'esormais utilisable sous

/usr/local/texlive/2016/.../xstring.tex:%     Plain eTeX.
/usr/local/texlive/2017/.../xstring.tex:% conditions of the LaTeX 
Project Public License, either version 1.3
/usr/local/texlive/2017/.../xstring.tex:% and version 1.3 or later is 
part of all distributions of LaTeX
/usr/local/texlive/2017/.../xstring.tex:\expandafter\ifx\csname 
@latexerr\endcsname\relax% on n'utilise pas LaTeX ?

/usr/local/texlive/2017/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
/usr/local/texlive/2017/.../xstring.tex:%   - Le package ne n\'ecessite 
plus LaTeX et est d\'esormais utilisable sous

/usr/local/texlive/2017/.../xstring.tex:%     Plain eTeX.
/usr/local/texlive/2018/.../xstring.tex:% !TeX encoding = ISO-8859-1
/usr/local/texlive/2018/.../xstring.tex:% Licence    : Released under 
the LaTeX Project Public License v1.3c %

/usr/local/texlive/2018/.../xstring.tex:%     Plain eTeX.
/usr/local/texlive/2019/.../xstring.tex:% !TeX encoding = ISO-8859-1
/usr/local/texlive/2019/.../xstring.tex:% Licence    : Released under 
the LaTeX Project Public License v1.3c %

/usr/local/texlive/2019/.../xstring.tex:     Plain eTeX.

prior to 2018, the accents in comments used ASCII, so UTF-8, but not 
intentionally so.


In 2017, the accents in comments became  latin-1 chars.
A 1st line was added: % !TeX encoding = ISO-8859-1
to indicate this.

Such directive comments are useless, except at the beginning of the main 
document source.

They are for Front-End software, not TeX processing, right?


They're for front-end software, but not only for the main document 
source; any file could have an encoding directive to tell the editor how 
to load/save it.




Jonathan, David,
so far as I can tell, it was *never* in UTF-8 with preformed accents.




I have a copy of xstring.tex here (in an old TeXlive tree) that is dated

  \def\xstringversion {1.7c}
  \def\xstringdate{2013/10/13}

where many of the accents (in comments) are encoded "TeX-style" with 
control sequences, but there are also some that are literal accented 
letters -- and they're in utf-8. If I load this file as Latin-1 in my 
editor, those letters are garbled.


(They're even mixed with the TeX-style sequences within a single line, 
sometimes:


% 2) Ensuite, on d\'etokenize ce d\'eveloppement de façon n'avoir plus que

Notice what happened to "façon" there when read as Latin-1...)

It does sound like they later did a deliberate conversion to Latin-1 
(contrary to what I was guessing); this is unfortunate, in that it means 
the file will be mis-read by software that expects UTF-8, which is the 
de facto default encoding for text these days.


So I think switching to UTF-8 would be a better choice; if they don't 
want to do that, adding a \XeTeXinputencoding line would be helpful.


JK


Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Ross Moore
Hi David,

On 22 Feb 2021, at 8:43 am, David Carlisle 
mailto:d.p.carli...@gmail.com>> wrote:

Surely the line-end characters are already known, and the bits
have been read up to that point *before* tokenisation.

This is not a pdflatex inputenc style utf-8 error failing to map a stream of 
tokens.

It is at the file reading stage and if you have the file encoding wrong you do 
not know reliably what are the ends of lines and you haven't interpreted it as 
tex at all, so the comment character really can't have an effect here.

Ummm. Is that really how XeTeX does it?
How then does Jonathan’s
   \XeTeXdefaultencoding "iso-8859-1”
ever work ?
Just a rhetorical question; don’t bother answering.   :-)

This mapping is invisible to the tex macro layer just as you can change the 
internal character code mapping in classic tex to take an ebcdic stream, if you 
do that then read an ascii file you get rubbish with no hope to recover.



So I don't think such a switch should be automatic to avoid reporting encoding 
errors.

I reported the issue at xstring here
https://framagit.org/unbonpetit/xstring/-/issues/4


I looked at what you said here, and some of it doesn’t seem to be in accord with
my TeXLive installations.

viz.

/usr/local/texlive/2016/.../xstring.tex:\expandafter\ifx\csname 
@latexerr\endcsname\relax% on n'utilise pas LaTeX ?
/usr/local/texlive/2016/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
/usr/local/texlive/2016/.../xstring.tex:%   - Le package ne n\'ecessite plus 
LaTeX et est d\'esormais utilisable sous
/usr/local/texlive/2016/.../xstring.tex:% Plain eTeX.
/usr/local/texlive/2017/.../xstring.tex:% conditions of the LaTeX Project 
Public License, either version 1.3
/usr/local/texlive/2017/.../xstring.tex:% and version 1.3 or later is part of 
all distributions of LaTeX
/usr/local/texlive/2017/.../xstring.tex:\expandafter\ifx\csname 
@latexerr\endcsname\relax% on n'utilise pas LaTeX ?
/usr/local/texlive/2017/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
/usr/local/texlive/2017/.../xstring.tex:%   - Le package ne n\'ecessite plus 
LaTeX et est d\'esormais utilisable sous
/usr/local/texlive/2017/.../xstring.tex:% Plain eTeX.
/usr/local/texlive/2018/.../xstring.tex:% !TeX encoding = ISO-8859-1
/usr/local/texlive/2018/.../xstring.tex:% Licence: Released under the LaTeX 
Project Public License v1.3c %
/usr/local/texlive/2018/.../xstring.tex:% Plain eTeX.
/usr/local/texlive/2019/.../xstring.tex:% !TeX encoding = ISO-8859-1
/usr/local/texlive/2019/.../xstring.tex:% Licence: Released under the LaTeX 
Project Public License v1.3c %
/usr/local/texlive/2019/.../xstring.tex: Plain eTeX.

prior to 2018, the accents in comments used ASCII, so UTF-8, but not 
intentionally so.

In 2017, the accents in comments became  latin-1 chars.
A 1st line was added:  % !TeX encoding = ISO-8859-1
to indicate this.

Such directive comments are useless, except at the beginning of the main 
document source.
They are for Front-End software, not TeX processing, right?

Jonathan, David,
so far as I can tell, it was *never* in UTF-8 with preformed accents.



David


that says what follows next is to be interpreted in a different way to what 
came previously?
Until the next switch that returns to UTF-8 or whatever?


If XeTeX is based on eTeX, then this should be possible in that setting.


Even replacing by U+FFFD
is being lenient.

Why has the mouth not realised that this information is to be discarded?
Then no replacement is required at all.

The file reading has failed  before any tex accessible processing has happened 
(see the ebcdic example in the TeXBook)

OK.
But that’s changing the meaning of bit-order, yes?
Surely we can be past that.



\danger \TeX\ always uses the internal character code of Appendix~C
for the standard ASCII characters,
regardless of what external coding scheme actually appears in the files
being read.  Thus, |b| is 98 inside of \TeX\ even when your computer
normally deals with ^{EBCDIC} or some other non-ASCII scheme; the \TeX\
software has been set up to convert text files to internal code, and to
convert back to the external code when writing text files.


the file encoding is failing at the  "convert text files to internal code" 
stage which is before the line buffer of characters is consulted to produce the 
stream of tokens based on catcodes.

Yes, OK; so my model isn’t up to it, as Bruno said.
 … And Jonathan has commented.

Also pdfTeX has no trouble with an xstring example.
It just seems pretty crazy that the comments need to be altered
for that package to be used with XeTeX.





David




Cheers, and thanks for this discussion.


Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.mo...@mq.edu.au
http://www.maths.mq.edu.au

Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Jonathan Kew

On 21/02/2021 21:48, Bruno Le Floch wrote:

I think your model of what XeTeX is doing is missing a step.  It's important to
distinguish two steps, which are a bit mixed up in some of the comments here.
I'm not 100\% sure either, so perhaps more knowledgeable people can chime in.

- The file is read line by line; this step requires finding the end of lines,
hence must depend on some encoding (possibly XeTeX allows changing the encoding
for lines that are not yet read).  This puts *characters* (not bytes) in a
buffer.  This is also the step where the \endlinechar is inserted, so any change
to \endlinechar on a given line can only affect the next line.

- The characters are then turned into tokens, one token at a time.  Catcodes can
be changed within a line, and they affect what characters will combine into
tokens, even within the same line.

The problem here is at the first step, where XeTeX cannot find a valid line of
characters in the given encoding.  It might be possible to use package hooks to
change the encoding state for that particular package, but I haven't followed
carefully these new LaTeX developments.



Thanks for this explanation, Bruno -- you're quite right, this is an 
issue at the initial step of reading the external file into the input 
buffer (of characters, not bytes), one line at a time. For this, the 
encoding must be known, and at this stage nothing TeX-ish such as 
\catcode values is yet in play.


Each input file has an encoding associated with it at the time it is 
opened. By default this will be UTF-8, but a different default can be 
set using \XeTeXdefaultencoding; so a workaround for this specific 
problem is to change the default before loading the package, and then 
reset it afterwards.


The encoding used to interpret the *current* input file can also be 
changed on the fly, using \XeTeXinputencoding. This will take effect for 
the *next* line after the line on which it occurs (which has, after all, 
already been decoded from bytes to characters on its way in to the 
buffer, before the \XeTeXinputencoding command could be recognized at all).


This means that if the xstring package maintainers *really* want to keep 
their file in Latin-1 (which I doubt), they could avoid the issue here 
by putting something like


  \ifXeTeX
\XeTeXinputencoding "iso-8859-1"
  \fi

at the top of the file, before any non-ASCII characters occur. But I 
suspect the change of encoding was inadvertent and they should just 
change it back to utf-8, and the problem will go away.


JK


Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Bruno Le Floch
Hi Ross,

On 2/21/21 10:42 PM, Ross Moore wrote:
> Hi Ulrike,
> 
>> On 22 Feb 2021, at 7:52 am, Ulrike Fischer wrote:
>>
>> Am Sun, 21 Feb 2021 20:26:04 + schrieb Ross Moore:
>>
>> > Once you have encountered the (correct) comment character,
>> > what follows on the rest of the line is going to be discarded,
>> > so its encoding is surely irrelevant.
>> >
>> > Why should the whole line need to be fully tokenised,
>> > before the decision is taken as to what part of it is retained?
>>
>> Well you need to find the end of the line to know where to stop with
>> the discarding don't you? So you need to inspect the part after the
>> comment char until you find something that says "newline”.
> 
> My understanding is that this *is* done first.
> Similarly to TeX's  \read  to    which grabs a line of input from a 
> file, 
> before doing the tokenisation and storing the result in the .
>    page 217 of The TeXbook
> 
> If I’m wrong with this, for high-speed input, then yes you need to know where 
> to
> stop.
> But that’s just as easy, since you stop when a byte is to be tokenised
> as an end-of-line character, and these are known. 
> You need this anyway, even when you have tokenised every byte.
> 
> 
> So all we are saying is that when handling the bytes between
> a comment and its end-of-line, just be a bit more careful.
> 
> It’s not necessary for each byte to be tokenised as valid for UTF-8.
> Maybe change the (Warning) message when you know that you are within
> such a comment, to say so.  That would be more meaningful to a 
> package-writer, 
> and to an author who uses the package, looks in the .log file, and sees the 
> message.
> 
> None of this is changing how the file is ultimately processed;
> it’s just about being friendlier in the human interface.


I think your model of what XeTeX is doing is missing a step.  It's important to
distinguish two steps, which are a bit mixed up in some of the comments here.
I'm not 100\% sure either, so perhaps more knowledgeable people can chime in.

- The file is read line by line; this step requires finding the end of lines,
hence must depend on some encoding (possibly XeTeX allows changing the encoding
for lines that are not yet read).  This puts *characters* (not bytes) in a
buffer.  This is also the step where the \endlinechar is inserted, so any change
to \endlinechar on a given line can only affect the next line.

- The characters are then turned into tokens, one token at a time.  Catcodes can
be changed within a line, and they affect what characters will combine into
tokens, even within the same line.

The problem here is at the first step, where XeTeX cannot find a valid line of
characters in the given encoding.  It might be possible to use package hooks to
change the encoding state for that particular package, but I haven't followed
carefully these new LaTeX developments.

Best,
Bruno


Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread David Carlisle
On Sun, 21 Feb 2021 at 20:27, Ross Moore  wrote:

> Hi David,
>
> Surely the line-end characters are already known, and the bits
> have been read up to that point *before* tokenisation.
>

This is not a pdflatex inputenc style utf-8 error failing to map a stream
of tokens.

It is at the file reading stage and if you have the file encoding wrong you
do not know reliably what are the ends of lines and you haven't interpreted
it as tex at all, so the comment character really can't have an effect
here. This mapping is invisible to the tex macro layer just as you can
change the internal character code mapping in classic tex to take an ebcdic
stream, if you do that then read an ascii file you get rubbish with no hope
to recover.


So provided the tokenisation of the comment character has occurred before
> tackling what comes after it, why would there be a problem?
>
> ... just guessing the encoding (which means guessing where the line and so
> the comment ends)
> is just guesswork.
>
>
> No guesswork intended.
>
>
>> The file encoding specifies the byte stream interpretation before any tex
>> tokenization
>> If the file can not be interpreted as utf-8 then it can't be interpreted
>> at all.
>>
>>
>> Why not?
>> Why can you not have a macro — presumably best on a single line by itself
>> –
>>
>
> there is an xetex   primitive that switches the encoding as Jonathan
> showed, but  guessing a different encoding
> if a file fails to decode properly against a specified encoding is a
> dangerous game to play.
>
>
> I don’t think anyone is asking for that.
>
> I can imagine situations where coding for packages that used to work well
> without UTF-8 may well be commented involving non-UTF-8 characters.
> (Indeed, there could even be binary bit-mapped images within comment
> sections;
> having bytes not intended to represent any characters at all, in any
> encoding.)
>

That really isn't possible. You are decoding a byte stream as UTF-8, once
you get to a section that does not decode you could delete it or replace it
byte by byte by the Unicode replacement character but after that everything
is guesswork and heuristics: just because some later section happens to
decode without error doesn't mean it was correctly decoded as intended.
Imagine if the section had been in UTF-16 rather than latin-1 it is quite
possible to have a stream of bytes that is valid utf8 and valid utf-16
there is no way to step over a commented out utf-16 section and know when
to switch back to utf-8.



> If such files are now subjected to constraints that formerly did not exist,
> then this is surely not a good thing.
>

That is not what happened here.  the constraints always existed. It is not
that the processing changed, the file, which used to be distributed in
UTF-8, is now distributed in latin-1 so gives warnings if read as UTF-8.



>
> Besides, not all the information required to build PDFs need be related to
> putting characters onscreen, through the typesetting engine.
>
> For example, when building fully-tagged PDFs, there can easily be more
> information
> overall within the tagging (both structure and content) than in the visual
> content itself.
> Thank goodness for Heiko’s packages that allow for re-encoding strings
> between
> different formats that are valid for inclusion within parts of a PDF.
>

But the packages require the files to be read correctly, and that is what
is not happening.


> I’m thinking here about how a section-title appears in:
>  bookmarks, ToC entries, tag-titles, /Alt strings, annotation text for
> hyperlinking, etc.
> as well as visually typeset for on-screen.
> These different representations need to be either derivable from a common
> source,
> or passed in as extra information, encoded appropriately (and not
> necessarily UTF-8).
>
> Sure but that is not related to the problem here, which is that the source
file  can not be read or rather that it is being incorrectly read as UTF-8
when it is latin-1.

So I don't think such a switch should be automatic to avoid reporting
> encoding errors.
>
> I reported the issue at xstring here
> https://framagit.org/unbonpetit/xstring/-/issues/4
>
>
> David
>
>
> that says what follows next is to be interpreted in a different way to
>> what came previously?
>> Until the next switch that returns to UTF-8 or whatever?
>>
>>
>> If XeTeX is based on eTeX, then this should be possible in that setting.
>>
>>
>> Even replacing by U+FFFD
>> is being lenient.
>>
>>
> Why has the mouth not realised that this information is to be discarded?
> Then no replacement is required at all.
>

The file reading has failed  before any tex accessible processing has
happened (see the ebcdic example in the TeXBook)

\danger \TeX\ always uses the internal character code of Appendix~C
for the standard ASCII characters,
regardless of what external coding scheme actually appears in the files
being read.  Thus, |b| is 98 inside of \TeX\ even when your computer
normally deals with ^{EBCDIC} or some 

Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Ross Moore
Hi Ulrike,

On 22 Feb 2021, at 7:52 am, Ulrike Fischer 
mailto:ne...@nililand.de>> wrote:

Am Sun, 21 Feb 2021 20:26:04 + schrieb Ross Moore:

> Once you have encountered the (correct) comment character,
> what follows on the rest of the line is going to be discarded,
> so its encoding is surely irrelevant.
>
> Why should the whole line need to be fully tokenised,
> before the decision is taken as to what part of it is retained?

Well you need to find the end of the line to know where to stop with
the discarding don't you? So you need to inspect the part after the
comment char until you find something that says "newline”.

My understanding is that this *is* done first.
Similarly to TeX's  \read  towhich grabs a line of input from a 
file,
before doing the tokenisation and storing the result in the .
   page 217 of The TeXbook

If I’m wrong with this, for high-speed input, then yes you need to know where 
to stop.
But that’s just as easy, since you stop when a byte is to be tokenised
as an end-of-line character, and these are known.
You need this anyway, even when you have tokenised every byte.


So all we are saying is that when handling the bytes between
a comment and its end-of-line, just be a bit more careful.

It’s not necessary for each byte to be tokenised as valid for UTF-8.
Maybe change the (Warning) message when you know that you are within
such a comment, to say so.  That would be more meaningful to a package-writer,
and to an author who uses the package, looks in the .log file, and sees the 
message.

None of this is changing how the file is ultimately processed;
it’s just about being friendlier in the human interface.




--
Ulrike Fischer
https://www.troubleshooting-tex.de/


All the best.

Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.mo...@mq.edu.au
http://www.maths.mq.edu.au
[cid:image001.png@01D030BE.D37A46F0]
CRICOS Provider Number 2J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. 




Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Ulrike Fischer
Am Sun, 21 Feb 2021 20:26:04 + schrieb Ross Moore:

> Once you have encountered the (correct) comment character,
> what follows on the rest of the line is going to be discarded,
> so its encoding is surely irrelevant.
> 
> Why should the whole line need to be fully tokenised,
> before the decision is taken as to what part of it is retained?

Well you need to find the end of the line to know where to stop with
the discarding don't you? So you need to inspect the part after the
comment char until you find something that says "newline".



-- 
Ulrike Fischer 
https://www.troubleshooting-tex.de/



Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Ross Moore
Hi David,

On 21 Feb 2021, at 11:02 pm, David Carlisle 
mailto:d.p.carli...@gmail.com>> wrote:


I don't think there is any reasonable way to say you can comment out parts of a 
file in a different encoding.

I’m not convinced that this ought to be correct for TeX-based software.

TeX (not necessarily XeTeX) has always operated as a finite-state machine.
It *should* be possible to say that this part is encoded as such-and-such,
and a later part encoded differently.

I fully understand that editor software external to TeX might well have 
difficulties
with files that mix encodings this way, but TeX itself has always been 
byte-based
and should remain that way.

A comment character is meant to be viewed as saying that:
 *everything else on this line is to be ignored*
– that’s the impression given by TeX documentation.


But you only know it is a comment character if you can interpret the incoming 
byte stream
If there are encoding errors in that byte stream then everything ls is guess 
work.

Who said anything about errors in the byte stream?
Once you have encountered the (correct) comment character,
what follows on the rest of the line is going to be discarded,
so its encoding is surely irrelevant.

Why should the whole line need to be fully tokenised,
before the decision is taken as to what part of it is retained?

In the case of a package file, rather than author input for typesetting,
the intention of the coding is completely unknown,
is probably all ASCII anyway, except (as in this case) for comments intended
for human eyes only, following a properly declared comment-character.


In this particular case with mostly ascii text and a few latin-1 characters it 
may be that you can guess that
the invalid utf-8 is in fact valid latin1 and interpret it that way,

You don’t need to interpret it as anything; that part is to be discarded.

and the guess would be right for this file
but what if the non-utf8 file were utf-16 or latin-2  or

Surely the line-end characters are already known, and the bits
have been read up to that point *before* tokenisation.
So provided the tokenisation of the comment character has occurred before
tackling what comes after it, why would there be a problem?

... just guessing the encoding (which means guessing where the line and so the 
comment ends)
is just guesswork.

No guesswork intended.


The file encoding specifies the byte stream interpretation before any tex 
tokenization
If the file can not be interpreted as utf-8 then it can't be interpreted at all.

Why not?
Why can you not have a macro — presumably best on a single line by itself –

there is an xetex   primitive that switches the encoding as Jonathan showed, 
but  guessing a different encoding
if a file fails to decode properly against a specified encoding is a dangerous 
game to play.

I don’t think anyone is asking for that.

I can imagine situations where coding for packages that used to work well
without UTF-8 may well be commented involving non-UTF-8 characters.
(Indeed, there could even be binary bit-mapped images within comment sections;
having bytes not intended to represent any characters at all, in any encoding.)

If such files are now subjected to constraints that formerly did not exist,
then this is surely not a good thing.


Besides, not all the information required to build PDFs need be related to
putting characters onscreen, through the typesetting engine.

For example, when building fully-tagged PDFs, there can easily be more 
information
overall within the tagging (both structure and content) than in the visual 
content itself.
Thank goodness for Heiko’s packages that allow for re-encoding strings between
different formats that are valid for inclusion within parts of a PDF.

I’m thinking here about how a section-title appears in:
 bookmarks, ToC entries, tag-titles, /Alt strings, annotation text for 
hyperlinking, etc.
as well as visually typeset for on-screen.
These different representations need to be either derivable from a common 
source,
or passed in as extra information, encoded appropriately (and not necessarily 
UTF-8).


So I don't think such a switch should be automatic to avoid reporting encoding 
errors.

I reported the issue at xstring here
https://framagit.org/unbonpetit/xstring/-/issues/4


David


that says what follows next is to be interpreted in a different way to what 
came previously?
Until the next switch that returns to UTF-8 or whatever?


If XeTeX is based on eTeX, then this should be possible in that setting.


Even replacing by U+FFFD
is being lenient.

Why has the mouth not realised that this information is to be discarded?
Then no replacement is required at all.


David





Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.mo...@mq.edu.au

Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread David Carlisle
On Sun, 21 Feb 2021 at 11:47, Ross Moore  wrote:

> Hi David.
>
> On 21 Feb 2021, at 10:12 pm, David Carlisle 
> wrote:
>
> I think that should be taken up with the xstring maintainers.
>
>
> Is  xstring  intended for use with XeTeX ?
> I suspect not.
> But anyway, there are still issues with this.
>
> (BTW, I wrote this before Jonathan Kew’s response.)
>
>
> I don't think there is any reasonable way to say you can comment out parts
> of a file in a different encoding.
>
>
> I’m not convinced that this ought to be correct for TeX-based software.
>
> TeX (not necessarily XeTeX) has always operated as a finite-state machine.
> It *should* be possible to say that this part is encoded as such-and-such,
> and a later part encoded differently.
>
> I fully understand that editor software external to TeX might well have
> difficulties
> with files that mix encodings this way, but TeX itself has always been
> byte-based
> and should remain that way.
>
> A comment character is meant to be viewed as saying that:
>  *everything else on this line is to be ignored*
> – that’s the impression given by TeX documentation.
>


But you only know it is a comment character if you can interpret the
incoming byte stream
If there are encoding errors in that byte stream then everything ls is
guess work.

In this particular case with mostly ascii text and a few latin-1 characters
it may be that you can guess that
the invalid utf-8 is in fact valid latin1 and interpret it that way, and
the guess would be right for this file
but what if the non-utf8 file were utf-16 or latin-2  or ... just guessing
the encoding (which means guessing where the line and so the comment ends)
is just guesswork.



> If it is the documentation that is incorrect, then it should certainly be
> clarified.
>
> For XeTeX and this particular example, it’s probably just a matter of
> checking
> that the non-UTF8 characters occur *after* a UTF-8  ‘%' , and not issuing
> an error message under these conditions.
> A warning, maybe, but not an error.
>



>
> The file encoding specifies the byte stream interpretation before any tex
> tokenization
> If the file can not be interpreted as utf-8 then it can't be interpreted
> at all.
>
>
> Why not?
> Why can you not have a macro — presumably best on a single line by itself –
>

there is an xetex   primitive that switches the encoding as Jonathan
showed, but  guessing a different encoding
if a file fails to decode properly against a specified encoding is a
dangerous game to play.
So I don't think such a switch should be automatic to avoid reporting
encoding errors.

I reported the issue at xstring here
https://framagit.org/unbonpetit/xstring/-/issues/4


David


that says what follows next is to be interpreted in a different way to what
> came previously?
> Until the next switch that returns to UTF-8 or whatever?
>
>
> If XeTeX is based on eTeX, then this should be possible in that setting.
>
>
> Even replacing by U+FFFD
> is being lenient.
>
> David
>
>
>
>
> On Sun, 21 Feb 2021 at 11:04, jfbu  wrote:
>
>> Hi,
>>
>> consider this
>>
>> \documentclass{article}
>> \usepackage{xstring}
>> \begin{document}
>> \end{document}
>>
>> and call it xexstring.tex
>>
>> Then xelatex xexstring triggers 136 warnings of the type
>>
>> Invalid UTF-8 byte or sequence at line 35 replaced by U+FFFD.
>>
>> Looking at file
>>
>> /usr/local/texlive/2020/texmf-dist/tex/generic/xstring/xstring.tex
>>
>> I see that this matches with use of latin-1 encoded characters in
>> comments.
>>
>> Notice that it is a not a user decision here to use a latin-1
>> encoded file.
>>
>> In fact I encountered this in a file I was given where
>> xstring package was loaded by another package.
>>
>> Regards,
>>
>> Jean-François
>>
>
>
> Cheers.
>
> Ross
>
>
> Dr Ross Moore
> Department of Mathematics and Statistics
> 12 Wally’s Walk, Level 7, Room 734
> Macquarie University, NSW 2109, Australia
> T: +61 2 9850 8955  |  F: +61 2 9850 8114
> M:+61 407 288 255  |  E: ross.mo...@mq.edu.au
> http://www.maths.mq.edu.au
>
> CRICOS Provider Number 2J. Think before you print.
> Please consider the environment before printing this email.
>
> This message is intended for the addressee named and may
> contain confidential information. If you are not the intended
> recipient, please delete it and notify the sender. Views expressed
> in this message are those of the individual sender, and are not
> necessarily the views of Macquarie University. 
> 
>
>


Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Ross Moore
Hi David.

On 21 Feb 2021, at 10:12 pm, David Carlisle 
mailto:d.p.carli...@gmail.com>> wrote:

I think that should be taken up with the xstring maintainers.

Is  xstring  intended for use with XeTeX ?
I suspect not.
But anyway, there are still issues with this.

(BTW, I wrote this before Jonathan Kew’s response.)


I don't think there is any reasonable way to say you can comment out parts of a 
file in a different encoding.

I’m not convinced that this ought to be correct for TeX-based software.

TeX (not necessarily XeTeX) has always operated as a finite-state machine.
It *should* be possible to say that this part is encoded as such-and-such,
and a later part encoded differently.

I fully understand that editor software external to TeX might well have 
difficulties
with files that mix encodings this way, but TeX itself has always been 
byte-based
and should remain that way.

A comment character is meant to be viewed as saying that:
 *everything else on this line is to be ignored*
– that’s the impression given by TeX documentation.

If it is the documentation that is incorrect, then it should certainly be 
clarified.

For XeTeX and this particular example, it’s probably just a matter of checking
that the non-UTF8 characters occur *after* a UTF-8  ‘%' , and not issuing
an error message under these conditions.
A warning, maybe, but not an error.


The file encoding specifies the byte stream interpretation before any tex 
tokenization
If the file can not be interpreted as utf-8 then it can't be interpreted at all.

Why not?
Why can you not have a macro — presumably best on a single line by itself –
that says what follows next is to be interpreted in a different way to what 
came previously?
Until the next switch that returns to UTF-8 or whatever?


If XeTeX is based on eTeX, then this should be possible in that setting.


Even replacing by U+FFFD
is being lenient.

David




On Sun, 21 Feb 2021 at 11:04, jfbu mailto:j...@free.fr>> wrote:
Hi,

consider this

\documentclass{article}
\usepackage{xstring}
\begin{document}
\end{document}

and call it xexstring.tex

Then xelatex xexstring triggers 136 warnings of the type

Invalid UTF-8 byte or sequence at line 35 replaced by U+FFFD.

Looking at file

/usr/local/texlive/2020/texmf-dist/tex/generic/xstring/xstring.tex

I see that this matches with use of latin-1 encoded characters in comments.

Notice that it is a not a user decision here to use a latin-1
encoded file.

In fact I encountered this in a file I was given where
xstring package was loaded by another package.

Regards,

Jean-François


Cheers.

Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.mo...@mq.edu.au
http://www.maths.mq.edu.au
[cid:image001.png@01D030BE.D37A46F0]
CRICOS Provider Number 2J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. 




Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread Jonathan Kew

On 21/02/2021 11:12, David Carlisle wrote:
> I think that should be taken up with the xstring maintainers.

Yes, I would agree this is an xstring problem.

It looks like in an older version the file was utf-8. I suspect someone 
saved it as Latin-1 in the course of editing, probably without realising 
it at the time.


As a workaround you could try

  \documentclass{article}
  \XeTeXdefaultencoding "iso-8859-1"
  \usepackage{xstring}
  \XeTeXdefaultencoding "utf-8"
  \begin{document}
  \end{document}

to change xetex's default while loading the file.

JK

>
> I don't think there is any reasonable way to say you can comment out 
parts of a file in a different encoding.

>
> The file encoding specifies the byte stream interpretation before any 
tex tokenization
> If the file can not be interpreted as utf-8 then it can't be 
interpreted at all. Even replacing by U+FFFD

> is being lenient.
>
> David
>
>
>
>
> On Sun, 21 Feb 2021 at 11:04, jfbu > wrote:

>
> Hi,
>
> consider this
>
> \documentclass{article}
> \usepackage{xstring}
> \begin{document}
> \end{document}
>
> and call it xexstring.tex
>
> Then xelatex xexstring triggers 136 warnings of the type
>
> Invalid UTF-8 byte or sequence at line 35 replaced by U+FFFD.
>
> Looking at file
>
> /usr/local/texlive/2020/texmf-dist/tex/generic/xstring/xstring.tex
>
> I see that this matches with use of latin-1 encoded characters in
> comments.
>
> Notice that it is a not a user decision here to use a latin-1
> encoded file.
>
> In fact I encountered this in a file I was given where
> xstring package was loaded by another package.
>
> Regards,
>
> Jean-François
>
>



Re: [XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

2021-02-21 Thread David Carlisle
I think that should be taken up with the xstring maintainers.

I don't think there is any reasonable way to say you can comment out parts
of a file in a different encoding.

The file encoding specifies the byte stream interpretation before any tex
tokenization
If the file can not be interpreted as utf-8 then it can't be interpreted at
all. Even replacing by U+FFFD
is being lenient.

David




On Sun, 21 Feb 2021 at 11:04, jfbu  wrote:

> Hi,
>
> consider this
>
> \documentclass{article}
> \usepackage{xstring}
> \begin{document}
> \end{document}
>
> and call it xexstring.tex
>
> Then xelatex xexstring triggers 136 warnings of the type
>
> Invalid UTF-8 byte or sequence at line 35 replaced by U+FFFD.
>
> Looking at file
>
> /usr/local/texlive/2020/texmf-dist/tex/generic/xstring/xstring.tex
>
> I see that this matches with use of latin-1 encoded characters in comments.
>
> Notice that it is a not a user decision here to use a latin-1
> encoded file.
>
> In fact I encountered this in a file I was given where
> xstring package was loaded by another package.
>
> Regards,
>
> Jean-François
>
>
>