On Sun, 21 Feb 2021 at 20:27, Ross Moore <ross.mo...@mq.edu.au> wrote:
> Hi David, > > Surely the line-end characters are already known, and the bits&bytes > have been read up to that point *before* tokenisation. > This is not a pdflatex inputenc style utf-8 error failing to map a stream of tokens. It is at the file reading stage and if you have the file encoding wrong you do not know reliably what are the ends of lines and you haven't interpreted it as tex at all, so the comment character really can't have an effect here. This mapping is invisible to the tex macro layer just as you can change the internal character code mapping in classic tex to take an ebcdic stream, if you do that then read an ascii file you get rubbish with no hope to recover. So provided the tokenisation of the comment character has occurred before > tackling what comes after it, why would there be a problem? > > ... just guessing the encoding (which means guessing where the line and so > the comment ends) > is just guesswork. > > > No guesswork intended. > > >> The file encoding specifies the byte stream interpretation before any tex >> tokenization >> If the file can not be interpreted as utf-8 then it can't be interpreted >> at all. >> >> >> Why not? >> Why can you not have a macro — presumably best on a single line by itself >> – >> > > there is an xetex primitive that switches the encoding as Jonathan > showed, but guessing a different encoding > if a file fails to decode properly against a specified encoding is a > dangerous game to play. > > > I don’t think anyone is asking for that. > > I can imagine situations where coding for packages that used to work well > without UTF-8 may well be commented involving non-UTF-8 characters. > (Indeed, there could even be binary bit-mapped images within comment > sections; > having bytes not intended to represent any characters at all, in any > encoding.) > That really isn't possible. You are decoding a byte stream as UTF-8, once you get to a section that does not decode you could delete it or replace it byte by byte by the Unicode replacement character but after that everything is guesswork and heuristics: just because some later section happens to decode without error doesn't mean it was correctly decoded as intended. Imagine if the section had been in UTF-16 rather than latin-1 it is quite possible to have a stream of bytes that is valid utf8 and valid utf-16 there is no way to step over a commented out utf-16 section and know when to switch back to utf-8. > If such files are now subjected to constraints that formerly did not exist, > then this is surely not a good thing. > That is not what happened here. the constraints always existed. It is not that the processing changed, the file, which used to be distributed in UTF-8, is now distributed in latin-1 so gives warnings if read as UTF-8. > > Besides, not all the information required to build PDFs need be related to > putting characters onscreen, through the typesetting engine. > > For example, when building fully-tagged PDFs, there can easily be more > information > overall within the tagging (both structure and content) than in the visual > content itself. > Thank goodness for Heiko’s packages that allow for re-encoding strings > between > different formats that are valid for inclusion within parts of a PDF. > But the packages require the files to be read correctly, and that is what is not happening. > I’m thinking here about how a section-title appears in: > bookmarks, ToC entries, tag-titles, /Alt strings, annotation text for > hyperlinking, etc. > as well as visually typeset for on-screen. > These different representations need to be either derivable from a common > source, > or passed in as extra information, encoded appropriately (and not > necessarily UTF-8). > > Sure but that is not related to the problem here, which is that the source file can not be read or rather that it is being incorrectly read as UTF-8 when it is latin-1. So I don't think such a switch should be automatic to avoid reporting > encoding errors. > > I reported the issue at xstring here > https://framagit.org/unbonpetit/xstring/-/issues/4 > > > David > > > that says what follows next is to be interpreted in a different way to >> what came previously? >> Until the next switch that returns to UTF-8 or whatever? >> >> >> If XeTeX is based on eTeX, then this should be possible in that setting. >> >> >> Even replacing by U+FFFD >> is being lenient. >> >> > Why has the mouth not realised that this information is to be discarded? > Then no replacement is required at all. > The file reading has failed before any tex accessible processing has happened (see the ebcdic example in the TeXBook) \danger \TeX\ always uses the internal character code of Appendix~C for the standard ASCII characters, regardless of what external coding scheme actually appears in the files being read. Thus, |b| is 98 inside of \TeX\ even when your computer normally deals with ^{EBCDIC} or some other non-ASCII scheme; the \TeX\ software has been set up to convert text files to internal code, and to convert back to the external code when writing text files. the file encoding is failing at the "convert text files to internal code" stage which is before the line buffer of characters is consulted to produce the stream of tokens based on catcodes. David