Re: [NTG-context] Unicode question

2015-03-12 Thread Hans Hagen

On 3/12/2015 9:41 PM, luigi scarso wrote:



On Thu, Mar 12, 2015 at 7:55 PM, Hans Hagen mailto:pra...@wxs.nl>> wrote:

it's actually a bug ... it is ok to map an invalid character in the
input to 0xFFFD, halt and continue when permitted, but the method
used in luatex thereby obscures a valid 0xFFFD in the input

  FFFD  REPLACEMENT CHARACTER
• used to replace an incoming character whose
value is unknown or unrepresentable in
Unicode


the question is not what to do when an invalid character comes in, in 
that case luatex can replace it by 0xFFFD and issue a error as now,


but when the input hasn't an 0xFFFD then luatex should just carry on as 
0xFFFD is a *valid* character


it is quite easy for a macro package to trigger an error as

  \catcode"FFFD=15

will do thatm but it's impossible for a macro package to intercept the 
weird interception by luatex's input handler



The meaning of FFFD is not "typeset a question mark on a black box" as in �
(which depends to font in anycase so in principle it's possible to see
something completely different in a new version of the font)
but to signal  something potentially wrong with a symbol that currently
in most cases is �.
Misusing the meaning  is not  bad di per se, but in this specific case
I think luatex is correct to be conservative and ask to the user what to do;
context --batchmode
typesets the document,
writes the messages on the log,
and ends with -1 , so an automatic agent is also alerted.


you cannot force a user to use \batchmode and -1 would abort a wrapper 
thereby leading to an invalid document; it means that luatex can never 
typeset a document where char 0xFFFD is being typeset and luatex should 
not be normative


not accepting 0xFFFD in the input is a bug

Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
 | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Unicode question

2015-03-12 Thread luigi scarso
On Thu, Mar 12, 2015 at 7:55 PM, Hans Hagen  wrote:

> it's actually a bug ... it is ok to map an invalid character in the input
> to 0xFFFD, halt and continue when permitted, but the method used in luatex
> thereby obscures a valid 0xFFFD in the input
>
>  FFFD  REPLACEMENT CHARACTER
• used to replace an incoming character whose
value is unknown or unrepresentable in
Unicode

The meaning of FFFD is not "typeset a question mark on a black box" as in �
(which depends to font in anycase so in principle it's possible to see
something completely different in a new version of the font)
but to signal  something potentially wrong with a symbol that currently in
most cases is �.
Misusing the meaning  is not  bad di per se, but in this specific case
I think luatex is correct to be conservative and ask to the user what to do;
context --batchmode
typesets the document,
writes the messages on the log,
and ends with -1 , so an automatic agent is also alerted.




-- 
luigi
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Unicode question

2015-03-12 Thread Hans Hagen

On 3/12/2015 7:08 PM, Manfred Lotz wrote:

Hi Arthur,

On Thu, 12 Mar 2015 16:35:47 +
Arthur Reutenauer  wrote:


The luatex code contains the lines (in unistring.w)

if (val == 0xFFFD)
 utf_error();
 return (val);

in a function str2uni. I didn't really try to understand the code
but it looks as if 0xFFFD is used as "invalid marker":


Interesting.  This is not actually correct, U+FFFD is a valid Unicode
character; it would be better to use U+FFFE or U+ for that.

Note that U+FFFD is the recommended character to use when a character
can't be recognised while converting to Unicode from another
encoding, so its presence is usually a sign that something went wrong
upstream, but I assume Manfred is aware of that.



Yes, I'm aware of that. So I also think that it isn't correct to use
U+FFFD for this. Your suggestion of using either U+FFFE or U+
sounds good as both are really invalid.



it's an attempt to recover but in the process a normal 0xFFFD triggers 
an error too; recovering to 0xFFFD for a really invalid input is ok as 
tex does that in more cases: i expected a } so i insert one here ... 
cross your fingers etc


Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
 | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Unicode question

2015-03-12 Thread Hans Hagen

On 3/12/2015 6:57 PM, Manfred Lotz wrote:

On Thu, 12 Mar 2015 16:41:59 +0100
Ulrike Fischer  wrote:


Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:


Hi all,
If I run this minimal example

\starttext

�

\stopluacode

\stoptext


I get

tex error   > error on line 3 in file /data/tmp/u1.tex: ! String
contains an invalid utf-8 sequence

and some more lines.


The character above is:

   Character: �
Character name: REPLACEMENT CHARACTER
 Charblock: Specials
  Category: Other symbol
   Unicode: U+fffd
  UTF8: 0xefbfbd

which is a valid utf8 character.

Questions:

1. Why is it considered to be invalid?


This is not a context question/problem but related to the binary
(you would get the same error with lualatex or plain)



Yes, I know.


The luatex code contains the lines (in unistring.w)

if (val == 0xFFFD)
 utf_error();
 return (val);

in a function str2uni. I didn't really try to understand the code
but it looks as if 0xFFFD is used as "invalid marker": If luatex
encounters something that isn't valid utf8 it maps val to 0xFFFD and
then test against 0xFFFD to rise an error.



Took me a while to find the repository but finally I got it.



2. Are there other valid utf8 characters which are considered
invalid?


The comment in the code says

/* the 5- and 6-byte UTF-8 sequences generate integers

that are outside of the valid UCS range, and therefore

unsupported
  */



Well, it is called REPLACEMENT CHARACTER, and it seems that this
character will be used to replace invalid characters.  Then it causes
  if (val == 0xFFFD)
 utf_error();

the error message
 tex_error("String contains an invalid utf-8 sequence", hlp);

to be displayed.

Ok, this answers my question.

Thanks for the pointer.


it's actually a bug ... it is ok to map an invalid character in the 
input to 0xFFFD, halt and continue when permitted, but the method used 
in luatex thereby obscures a valid 0xFFFD in the input


Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
 | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Unicode question

2015-03-12 Thread Manfred Lotz
Hi Arthur,

On Thu, 12 Mar 2015 16:35:47 +
Arthur Reutenauer  wrote:

> > The luatex code contains the lines (in unistring.w)
> > 
> > if (val == 0xFFFD)
> > utf_error();
> > return (val);
> > 
> > in a function str2uni. I didn't really try to understand the code
> > but it looks as if 0xFFFD is used as "invalid marker":
> 
> Interesting.  This is not actually correct, U+FFFD is a valid Unicode
> character; it would be better to use U+FFFE or U+ for that.
> 
> Note that U+FFFD is the recommended character to use when a character
> can't be recognised while converting to Unicode from another
> encoding, so its presence is usually a sign that something went wrong
> upstream, but I assume Manfred is aware of that.
> 

Yes, I'm aware of that. So I also think that it isn't correct to use
U+FFFD for this. Your suggestion of using either U+FFFE or U+
sounds good as both are really invalid.


-- 
Best, Manfred



___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Unicode question

2015-03-12 Thread Manfred Lotz
On Thu, 12 Mar 2015 16:41:59 +0100
Ulrike Fischer  wrote:

> Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:
> 
> > Hi all,
> > If I run this minimal example
> > 
> > \starttext
> > 
> > �
> > 
> > \stopluacode
> > 
> > \stoptext
> > 
> > 
> > I get
> > 
> > tex error   > error on line 3 in file /data/tmp/u1.tex: ! String
> > contains an invalid utf-8 sequence
> > 
> > and some more lines.
> > 
> > 
> > The character above is:
> > 
> >   Character: �
> >Character name: REPLACEMENT CHARACTER
> > Charblock: Specials
> >  Category: Other symbol
> >   Unicode: U+fffd
> >  UTF8: 0xefbfbd
> > 
> > which is a valid utf8 character.
> > 
> > Questions:
> > 
> > 1. Why is it considered to be invalid?
> 
> This is not a context question/problem but related to the binary
> (you would get the same error with lualatex or plain)
> 

Yes, I know.

> The luatex code contains the lines (in unistring.w)
> 
> if (val == 0xFFFD)
> utf_error();
> return (val);
> 
> in a function str2uni. I didn't really try to understand the code
> but it looks as if 0xFFFD is used as "invalid marker": If luatex
> encounters something that isn't valid utf8 it maps val to 0xFFFD and
> then test against 0xFFFD to rise an error.
> 

Took me a while to find the repository but finally I got it.


> > 2. Are there other valid utf8 characters which are considered
> > invalid?
> 
> The comment in the code says 
> 
> /* the 5- and 6-byte UTF-8 sequences generate integers 
> 
> that are outside of the valid UCS range, and therefore
> 
> unsupported 
>  */
> 

Well, it is called REPLACEMENT CHARACTER, and it seems that this
character will be used to replace invalid characters.  Then it causes 
 if (val == 0xFFFD)
utf_error();

the error message 
tex_error("String contains an invalid utf-8 sequence", hlp);

to be displayed.

Ok, this answers my question. 

Thanks for the pointer.


-- 
Manfred








___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Unicode question

2015-03-12 Thread Arthur Reutenauer
> The luatex code contains the lines (in unistring.w)
> 
> if (val == 0xFFFD)
> utf_error();
> return (val);
> 
> in a function str2uni. I didn't really try to understand the code
> but it looks as if 0xFFFD is used as "invalid marker":

Interesting.  This is not actually correct, U+FFFD is a valid Unicode 
character; it would be better to use U+FFFE or U+ for that.

Note that U+FFFD is the recommended character to use when a character can't be 
recognised while converting to Unicode from another encoding, so its presence 
is usually a sign that something went wrong upstream, but I assume Manfred is 
aware of that.

> The comment in the code says 
> 
> /* the 5- and 6-byte UTF-8 sequences generate integers 
> 
> that are outside of the valid UCS range, and therefore
> 
> unsupported 
>  */

That's correct, the longest valid UTF-8 sequence is 4 bytes.

Best,

Arthur___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Unicode question

2015-03-12 Thread Ulrike Fischer
Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:

> Hi all,
> If I run this minimal example
> 
> \starttext
> 
> �
> 
> \stopluacode
> 
> \stoptext
> 
> 
> I get
> 
> tex error   > error on line 3 in file /data/tmp/u1.tex: ! String
> contains an invalid utf-8 sequence
> 
> and some more lines.
> 
> 
> The character above is:
> 
>   Character: �
>Character name: REPLACEMENT CHARACTER
> Charblock: Specials
>  Category: Other symbol
>   Unicode: U+fffd
>  UTF8: 0xefbfbd
> 
> which is a valid utf8 character.
> 
> Questions:
> 
> 1. Why is it considered to be invalid?

This is not a context question/problem but related to the binary
(you would get the same error with lualatex or plain)

The luatex code contains the lines (in unistring.w)

if (val == 0xFFFD)
utf_error();
return (val);

in a function str2uni. I didn't really try to understand the code
but it looks as if 0xFFFD is used as "invalid marker": If luatex
encounters something that isn't valid utf8 it maps val to 0xFFFD and
then test against 0xFFFD to rise an error.

> 2. Are there other valid utf8 characters which are considered invalid?

The comment in the code says 

/* the 5- and 6-byte UTF-8 sequences generate integers 

that are outside of the valid UCS range, and therefore

unsupported 
 */



-- 
Ulrike Fischer 
http://www.troubleshooting-tex.de/

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

[NTG-context] Unicode question

2015-03-12 Thread Manfred Lotz
Hi all,
If I run this minimal example

\starttext

�

\stopluacode

\stoptext


I get

tex error   > error on line 3 in file /data/tmp/u1.tex: ! String
contains an invalid utf-8 sequence

and some more lines.


The character above is:

  Character: �
   Character name: REPLACEMENT CHARACTER
Charblock: Specials
 Category: Other symbol
  Unicode: U+fffd
 UTF8: 0xefbfbd

which is a valid utf8 character.

Questions:

1. Why is it considered to be invalid?

2. Are there other valid utf8 characters which are considered invalid?


Just wanting to understand.


-- 
Manfred


___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___