Re: [NTG-context] Unicode question
On 3/12/2015 9:41 PM, luigi scarso wrote: On Thu, Mar 12, 2015 at 7:55 PM, Hans Hagen mailto:pra...@wxs.nl>> wrote: it's actually a bug ... it is ok to map an invalid character in the input to 0xFFFD, halt and continue when permitted, but the method used in luatex thereby obscures a valid 0xFFFD in the input FFFD REPLACEMENT CHARACTER • used to replace an incoming character whose value is unknown or unrepresentable in Unicode the question is not what to do when an invalid character comes in, in that case luatex can replace it by 0xFFFD and issue a error as now, but when the input hasn't an 0xFFFD then luatex should just carry on as 0xFFFD is a *valid* character it is quite easy for a macro package to trigger an error as \catcode"FFFD=15 will do thatm but it's impossible for a macro package to intercept the weird interception by luatex's input handler The meaning of FFFD is not "typeset a question mark on a black box" as in � (which depends to font in anycase so in principle it's possible to see something completely different in a new version of the font) but to signal something potentially wrong with a symbol that currently in most cases is �. Misusing the meaning is not bad di per se, but in this specific case I think luatex is correct to be conservative and ask to the user what to do; context --batchmode typesets the document, writes the messages on the log, and ends with -1 , so an automatic agent is also alerted. you cannot force a user to use \batchmode and -1 would abort a wrapper thereby leading to an invalid document; it means that luatex can never typeset a document where char 0xFFFD is being typeset and luatex should not be normative not accepting 0xFFFD in the input is a bug Hans - Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl - ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
Re: [NTG-context] Unicode question
On Thu, Mar 12, 2015 at 7:55 PM, Hans Hagen wrote: > it's actually a bug ... it is ok to map an invalid character in the input > to 0xFFFD, halt and continue when permitted, but the method used in luatex > thereby obscures a valid 0xFFFD in the input > > FFFD REPLACEMENT CHARACTER • used to replace an incoming character whose value is unknown or unrepresentable in Unicode The meaning of FFFD is not "typeset a question mark on a black box" as in � (which depends to font in anycase so in principle it's possible to see something completely different in a new version of the font) but to signal something potentially wrong with a symbol that currently in most cases is �. Misusing the meaning is not bad di per se, but in this specific case I think luatex is correct to be conservative and ask to the user what to do; context --batchmode typesets the document, writes the messages on the log, and ends with -1 , so an automatic agent is also alerted. -- luigi ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
Re: [NTG-context] Unicode question
On 3/12/2015 7:08 PM, Manfred Lotz wrote: Hi Arthur, On Thu, 12 Mar 2015 16:35:47 + Arthur Reutenauer wrote: The luatex code contains the lines (in unistring.w) if (val == 0xFFFD) utf_error(); return (val); in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": Interesting. This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+ for that. Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that. Yes, I'm aware of that. So I also think that it isn't correct to use U+FFFD for this. Your suggestion of using either U+FFFE or U+ sounds good as both are really invalid. it's an attempt to recover but in the process a normal 0xFFFD triggers an error too; recovering to 0xFFFD for a really invalid input is ok as tex does that in more cases: i expected a } so i insert one here ... cross your fingers etc Hans - Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl - ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
Re: [NTG-context] Unicode question
On 3/12/2015 6:57 PM, Manfred Lotz wrote: On Thu, 12 Mar 2015 16:41:59 +0100 Ulrike Fischer wrote: Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz: Hi all, If I run this minimal example \starttext � \stopluacode \stoptext I get tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence and some more lines. The character above is: Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd which is a valid utf8 character. Questions: 1. Why is it considered to be invalid? This is not a context question/problem but related to the binary (you would get the same error with lualatex or plain) Yes, I know. The luatex code contains the lines (in unistring.w) if (val == 0xFFFD) utf_error(); return (val); in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": If luatex encounters something that isn't valid utf8 it maps val to 0xFFFD and then test against 0xFFFD to rise an error. Took me a while to find the repository but finally I got it. 2. Are there other valid utf8 characters which are considered invalid? The comment in the code says /* the 5- and 6-byte UTF-8 sequences generate integers that are outside of the valid UCS range, and therefore unsupported */ Well, it is called REPLACEMENT CHARACTER, and it seems that this character will be used to replace invalid characters. Then it causes if (val == 0xFFFD) utf_error(); the error message tex_error("String contains an invalid utf-8 sequence", hlp); to be displayed. Ok, this answers my question. Thanks for the pointer. it's actually a bug ... it is ok to map an invalid character in the input to 0xFFFD, halt and continue when permitted, but the method used in luatex thereby obscures a valid 0xFFFD in the input Hans - Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl - ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
Re: [NTG-context] Unicode question
Hi Arthur, On Thu, 12 Mar 2015 16:35:47 + Arthur Reutenauer wrote: > > The luatex code contains the lines (in unistring.w) > > > > if (val == 0xFFFD) > > utf_error(); > > return (val); > > > > in a function str2uni. I didn't really try to understand the code > > but it looks as if 0xFFFD is used as "invalid marker": > > Interesting. This is not actually correct, U+FFFD is a valid Unicode > character; it would be better to use U+FFFE or U+ for that. > > Note that U+FFFD is the recommended character to use when a character > can't be recognised while converting to Unicode from another > encoding, so its presence is usually a sign that something went wrong > upstream, but I assume Manfred is aware of that. > Yes, I'm aware of that. So I also think that it isn't correct to use U+FFFD for this. Your suggestion of using either U+FFFE or U+ sounds good as both are really invalid. -- Best, Manfred ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
Re: [NTG-context] Unicode question
On Thu, 12 Mar 2015 16:41:59 +0100 Ulrike Fischer wrote: > Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz: > > > Hi all, > > If I run this minimal example > > > > \starttext > > > > � > > > > \stopluacode > > > > \stoptext > > > > > > I get > > > > tex error > error on line 3 in file /data/tmp/u1.tex: ! String > > contains an invalid utf-8 sequence > > > > and some more lines. > > > > > > The character above is: > > > > Character: � > >Character name: REPLACEMENT CHARACTER > > Charblock: Specials > > Category: Other symbol > > Unicode: U+fffd > > UTF8: 0xefbfbd > > > > which is a valid utf8 character. > > > > Questions: > > > > 1. Why is it considered to be invalid? > > This is not a context question/problem but related to the binary > (you would get the same error with lualatex or plain) > Yes, I know. > The luatex code contains the lines (in unistring.w) > > if (val == 0xFFFD) > utf_error(); > return (val); > > in a function str2uni. I didn't really try to understand the code > but it looks as if 0xFFFD is used as "invalid marker": If luatex > encounters something that isn't valid utf8 it maps val to 0xFFFD and > then test against 0xFFFD to rise an error. > Took me a while to find the repository but finally I got it. > > 2. Are there other valid utf8 characters which are considered > > invalid? > > The comment in the code says > > /* the 5- and 6-byte UTF-8 sequences generate integers > > that are outside of the valid UCS range, and therefore > > unsupported > */ > Well, it is called REPLACEMENT CHARACTER, and it seems that this character will be used to replace invalid characters. Then it causes if (val == 0xFFFD) utf_error(); the error message tex_error("String contains an invalid utf-8 sequence", hlp); to be displayed. Ok, this answers my question. Thanks for the pointer. -- Manfred ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
Re: [NTG-context] Unicode question
> The luatex code contains the lines (in unistring.w) > > if (val == 0xFFFD) > utf_error(); > return (val); > > in a function str2uni. I didn't really try to understand the code > but it looks as if 0xFFFD is used as "invalid marker": Interesting. This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+ for that. Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that. > The comment in the code says > > /* the 5- and 6-byte UTF-8 sequences generate integers > > that are outside of the valid UCS range, and therefore > > unsupported > */ That's correct, the longest valid UTF-8 sequence is 4 bytes. Best, Arthur___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
Re: [NTG-context] Unicode question
Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz: > Hi all, > If I run this minimal example > > \starttext > > � > > \stopluacode > > \stoptext > > > I get > > tex error > error on line 3 in file /data/tmp/u1.tex: ! String > contains an invalid utf-8 sequence > > and some more lines. > > > The character above is: > > Character: � >Character name: REPLACEMENT CHARACTER > Charblock: Specials > Category: Other symbol > Unicode: U+fffd > UTF8: 0xefbfbd > > which is a valid utf8 character. > > Questions: > > 1. Why is it considered to be invalid? This is not a context question/problem but related to the binary (you would get the same error with lualatex or plain) The luatex code contains the lines (in unistring.w) if (val == 0xFFFD) utf_error(); return (val); in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": If luatex encounters something that isn't valid utf8 it maps val to 0xFFFD and then test against 0xFFFD to rise an error. > 2. Are there other valid utf8 characters which are considered invalid? The comment in the code says /* the 5- and 6-byte UTF-8 sequences generate integers that are outside of the valid UCS range, and therefore unsupported */ -- Ulrike Fischer http://www.troubleshooting-tex.de/ ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___
[NTG-context] Unicode question
Hi all, If I run this minimal example \starttext � \stopluacode \stoptext I get tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence and some more lines. The character above is: Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd which is a valid utf8 character. Questions: 1. Why is it considered to be invalid? 2. Are there other valid utf8 characters which are considered invalid? Just wanting to understand. -- Manfred ___ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___