Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread Sven Van Caekenberghe

> On 29 Jul 2016, at 00:15, monty <mon...@programmer.net> wrote:
> 
> Good for finding one of the fixes, but please use #parseURL:/#onURL: instead 
> of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc 
> eagerly decoding the response without looking at the  declaration as 
> the XML spec requires.
> 
> #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on 
> top of it.

Yes, you are right. Thanks for implementing all this logic, I known it is quite 
complicated and tricky.

>> Sent: Thursday, July 28, 2016 at 5:29 PM
>> From: "Sven Van Caekenberghe" <s...@stfx.eu>
>> To: "Any question about pharo is welcome" <pharo-users@lists.pharo.org>
>> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>> 
>> In my older work image, the following just works:
>> 
>> XMLDOMParser parse:
>> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
>> retrieveContents).
>> 
>> But I guess that is because my (older) XML parser version ignores the 
>> encoding, or is more lenient.
>> 
>> You could try to edit the incoming file, or have a look at 
>> #decodesCharacters: 
>> 
>> (XMLDOMParser on:
>> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
>> retrieveContents) readStream) decodesCharacters: false; parseDocument.
>> 
>> But I am no expert in the deeper aspects of XML Support.
>> 
>>> On 28 Jul 2016, at 22:29, Sean P. DeNigris <s...@clipperadams.com> wrote:
>>> 
>>> Sven Van Caekenberghe-2 wrote
>>>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
>>>> it is served from the URL you gave.
>>>> ..
>>>> You see ?
>>> 
>>> Unfortunately, no! ha ha. I didn't generate the file and I took it's
>>> assertion that it was UTF-8 at face value. How do I properly feed the file
>>> into XMLParser?
>>> 
>>> 
>>> 
>>> -
>>> Cheers,
>>> Sean
>>> --
>>> View this message in context: 
>>> http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>> 
>> 
>> 
>> 
> 




Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread monty
Also #parseURL:/#onURL: will use WebClient on Squeak (unless Zinc is present of 
course)

> Sent: Thursday, July 28, 2016 at 6:15 PM
> From: monty <mon...@programmer.net>
> To: pharo-users@lists.pharo.org
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> Good for finding one of the fixes, but please use #parseURL:/#onURL: instead 
> of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc 
> eagerly decoding the response without looking at the  declaration as 
> the XML spec requires.
> 
> #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on 
> top of it.
> 
> > Sent: Thursday, July 28, 2016 at 5:29 PM
> > From: "Sven Van Caekenberghe" <s...@stfx.eu>
> > To: "Any question about pharo is welcome" <pharo-users@lists.pharo.org>
> > Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
> >
> > In my older work image, the following just works:
> > 
> > XMLDOMParser parse:
> > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
> > retrieveContents).
> > 
> > But I guess that is because my (older) XML parser version ignores the 
> > encoding, or is more lenient.
> > 
> > You could try to edit the incoming file, or have a look at 
> > #decodesCharacters: 
> > 
> > (XMLDOMParser on:
> > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
> > retrieveContents) readStream) decodesCharacters: false; parseDocument.
> > 
> > But I am no expert in the deeper aspects of XML Support.
> > 
> > > On 28 Jul 2016, at 22:29, Sean P. DeNigris <s...@clipperadams.com> wrote:
> > > 
> > > Sven Van Caekenberghe-2 wrote
> > >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> > >> it is served from the URL you gave.
> > >> ..
> > >> You see ?
> > > 
> > > Unfortunately, no! ha ha. I didn't generate the file and I took it's
> > > assertion that it was UTF-8 at face value. How do I properly feed the file
> > > into XMLParser?
> > > 
> > > 
> > > 
> > > -
> > > Cheers,
> > > Sean
> > > --
> > > View this message in context: 
> > > http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> > > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> > > 
> > 
> > 
> >
> 
>



Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread monty
Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of 
#asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc 
eagerly decoding the response without looking at the  declaration as 
the XML spec requires.

#parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on 
top of it.

> Sent: Thursday, July 28, 2016 at 5:29 PM
> From: "Sven Van Caekenberghe" <s...@stfx.eu>
> To: "Any question about pharo is welcome" <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> In my older work image, the following just works:
> 
> XMLDOMParser parse:
> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
> retrieveContents).
> 
> But I guess that is because my (older) XML parser version ignores the 
> encoding, or is more lenient.
> 
> You could try to edit the incoming file, or have a look at 
> #decodesCharacters: 
> 
> (XMLDOMParser on:
> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
> retrieveContents) readStream) decodesCharacters: false; parseDocument.
> 
> But I am no expert in the deeper aspects of XML Support.
> 
> > On 28 Jul 2016, at 22:29, Sean P. DeNigris <s...@clipperadams.com> wrote:
> > 
> > Sven Van Caekenberghe-2 wrote
> >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> >> it is served from the URL you gave.
> >> ..
> >> You see ?
> > 
> > Unfortunately, no! ha ha. I didn't generate the file and I took it's
> > assertion that it was UTF-8 at face value. How do I properly feed the file
> > into XMLParser?
> > 
> > 
> > 
> > -
> > Cheers,
> > Sean
> > --
> > View this message in context: 
> > http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> > 
> 
> 
>



Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread Sean P. DeNigris
monty-3 wrote
> You're double decoding

And in public, no less! Thanks. It works now with #parseFileNamed:. Minus
side - half a day wasted; plus side - I wrote a compatibility layer for
Magritte-XMLBinding to accept SoupTags to #fromXmlNode:



-
Cheers,
Sean
--
View this message in context: 
http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908555.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.



Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread monty
You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM 
printToFileNamed: family of messages when writing) and let XMLParser take care 
this for you, or disable XMLParser decoding before parsing with 
#decodesCharacters:.

Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). 
You gave it a FileReference, but because the argument is tested with isString 
and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But 
XMLParser automatically attempts its own decoding too, if:

 The input starts with a BOM or it can be inferred by null bytes before or 
after the first non-null byte.

 There is an encoding declaration with a non-UTF-8 encoding.

 There is a UTF-8 encoding declaration but the stream is not a normal 
ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. 
I'll consider changing the heuristic to make less eager to decode.

> Sent: Thursday, July 28, 2016 at 4:05 PM
> From: "Sean P. DeNigris" <s...@clipperadams.com>
> To: pharo-users@lists.pharo.org
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> monty-3 wrote
> > Just to be sure, I manually recreated your file (with the great Bless hex
> > editor) and parsed it with no issue.
> 
> Thanks!
> 
> 
> monty-3 wrote
> > Please post your code and attach the actual source as a file separately.
> 
> The code is merely:
>   messageLog := FileLocator home / 'illegal-UTF-sms.xml'. 
>   doc := XMLDOMParser parse: messageLog.
> 
> File:  illegal-UTF-sms.xml
> <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>  
> 
> 
> 
> -
> Cheers,
> Sean
> --
> View this message in context: 
> http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> 
>



Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread Sven Van Caekenberghe
In my older work image, the following just works:

XMLDOMParser parse:
('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
retrieveContents).

But I guess that is because my (older) XML parser version ignores the encoding, 
or is more lenient.

You could try to edit the incoming file, or have a look at #decodesCharacters: 

(XMLDOMParser on:
('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
retrieveContents) readStream) decodesCharacters: false; parseDocument.

But I am no expert in the deeper aspects of XML Support.

> On 28 Jul 2016, at 22:29, Sean P. DeNigris  wrote:
> 
> Sven Van Caekenberghe-2 wrote
>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
>> it is served from the URL you gave.
>> ..
>> You see ?
> 
> Unfortunately, no! ha ha. I didn't generate the file and I took it's
> assertion that it was UTF-8 at face value. How do I properly feed the file
> into XMLParser?
> 
> 
> 
> -
> Cheers,
> Sean
> --
> View this message in context: 
> http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> 




Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread Sean P. DeNigris
Sven Van Caekenberghe-2 wrote
> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> it is served from the URL you gave.
> ..
> You see ?

Unfortunately, no! ha ha. I didn't generate the file and I took it's
assertion that it was UTF-8 at face value. How do I properly feed the file
into XMLParser?



-
Cheers,
Sean
--
View this message in context: 
http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.



Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread Sven Van Caekenberghe
Sean,

Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way it is 
served from the URL you gave.

(('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl 
retrieveContents) at: 72 ) = 160 asCharacter. 

  "true"

Like you said,

160 asCharacter asString utf8Encoded. 

  "#[194 160]"

But

#[ 160 ] utf8Decoded.

  Boom!

You specify UTF-8 encoding inside your XML, I assume the parser then switches 
to that encoding, but your pure Unicode contents is not UTF-8 encoded and 
results in an exception. You see ?

Sven

> On 28 Jul 2016, at 22:05, Sean P. DeNigris  wrote:
> 
> monty-3 wrote
>> Just to be sure, I manually recreated your file (with the great Bless hex
>> editor) and parsed it with no issue.
> 
> Thanks!
> 
> 
> monty-3 wrote
>> Please post your code and attach the actual source as a file separately.
> 
> The code is merely:
>  messageLog := FileLocator home / 'illegal-UTF-sms.xml'. 
>  doc := XMLDOMParser parse: messageLog.
> 
> File:  illegal-UTF-sms.xml
>   
> 
> 
> 
> -
> Cheers,
> Sean
> --
> View this message in context: 
> http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> 




Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread Sean P. DeNigris
monty-3 wrote
> Just to be sure, I manually recreated your file (with the great Bless hex
> editor) and parsed it with no issue.

Thanks!


monty-3 wrote
> Please post your code and attach the actual source as a file separately.

The code is merely:
  messageLog := FileLocator home / 'illegal-UTF-sms.xml'. 
  doc := XMLDOMParser parse: messageLog.

File:  illegal-UTF-sms.xml
  



-
Cheers,
Sean
--
View this message in context: 
http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.



Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

2016-07-28 Thread monty
Just to be sure, I manually recreated your file (with the great Bless hex 
editor) and parsed it with no issue.

Please post your code and attach the actual source as a file separately.

> Sent: Thursday, July 28, 2016 at 3:12 PM
> From: "Sean P. DeNigris" 
> To: pharo-users@lists.pharo.org
> Subject: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> Posted to StackOverflow
> (https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8):
> 
> 
> 
> Given the input:
> 
> 
> 
> 
> Where the character after the "." in the body attribute of the sms tag is
> U+00A0;
> 
> I get the error:
> 
> XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column
> 13)
> 
> IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia.
> Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.
> 
> This seems like a bug in XMLParser, or am I missing something?
> 
> 
> 
> 
> -
> Cheers,
> Sean
> --
> View this message in context: 
> http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> 
>