o.uk>
>> To: pharo-users@lists.pharo.org
>> Subject: Re: [Pharo-users] Problem with input to XML Parser - 'Invalid UTF8
>> encoding'
>>
>> Correction - I am misrepresenting Sven. What he said was that Zinc would not
>> look inside the HTML node to find out
I know what the problem is and will have it fixed shortly. Thanks for the
report.
> Sent: Monday, October 09, 2017 at 9:03 AM
> From: "Peter Kenny" <pe...@pbkresearch.co.uk>
> To: pharo-users@lists.pharo.org
> Subject: Re: [Pharo-users] Problem with input t
Correction - I am misrepresenting Sven. What he said was that Zinc would not
look inside the HTML node to find out about coding. It would of
course use information in the HTTP headers, if any.
Peter Kenny wrote
> Henry
>
> Thanks for the explanations. It's a bit clearer now. I'm still not sure
Henry
Thanks for the explanations. It's a bit clearer now. I'm still not sure
about how ZnUrl>>retrieveContents manages to decode correctly in this case;
I'm sure I recall Sven saying it didn't (and in his view shouldn't) look at
the HTTP declarations in the header. There is also the mystery of
In a class named XMLHTMLParser, you may expect that logic to be expanded a
bit beyond the basic XML spec though.
But since there are multiple potentially correct definitions, there will
always be failure cases.
Not to mention, in addition to XML/HTTP, HTML4/5 also define (different)
meta tags for
XML expects a prolog in the document itself defining the encoding, if absent,
the standard specifies utf-8.
So when you use an XML parser to parse an HTML page, it will disregard any
HTTP encodings, interpret the contents as an XML document with missing
prolog, and try to parse as utf8.
When you
Note: This was sent on Sunday at 19.45 but seems to have disappeared on its
way to pharo users. Re-sent just to complete the story.
_
Paul
Good to have found the charset discrepancy - that may have something to do
with it. But I don't think it
: Re: [Pharo-users] Problem with input to XML Parser - 'Invalid UTF8
encoding'
in the HEAD tag of that page with the article they declare it is ISO-8859-1 and
not UTF-8. In the page they have a
C’è
The little back-tick next to the C is UTF8 8217
(http://www.codetable.net/decimal/8217)
So
in the HEAD tag of that page with the article they declare it is ISO-8859-1
and not UTF-8. In the page they have a
C’è
The little back-tick next to the C is UTF8 8217
(http://www.codetable.net/decimal/8217)
So their encoding is messed up, and maybe the XMLHTMLParser should throw a
warning
In another thread (on SVG Icons) Sven referred to ways of getting input from
a URL for XMLDOMParser. I have recently had some problems doing this. I have
found a workaround, so it is not urgent, but I thought I should put it on
record in case anyone else is bitten by it, and so maybe Monty can
10 matches
Mail list logo