ICS V8.50 in the overnight zip  now includes various new functions to
assist with determining the character set and codepage for HTML content
received from HTTP servers, and to convert correctly to Delphi unicode
strings.  

The character set is determined according to the rules:

1 - HTTP Content-Type header always states the content type and more
rarely the character set. 

2 - HTML content bom, two, three or four bytes at the front.  

3 - HTML content meta charset.

4 - HTML auto detect for UTF8, note browsers don't do this and assume
ANSI if no charset specified.  

I created some unicode test pages that illustrate various characters
represented as symbols and/or entities (like £ or ☍), using
ANSI, UTF-8 and UTF-16 with and without boms and charset. Note that
Firefox has limited UTF-16 support and seems to ignore CSS.  The web
site uses the ICS web server. 

https://www.telecom-tariffs.co.uk/testing/

The new functions are in OverbyteIcsCharsetUtils.pas and
OverbyteIcsUtils.pas, IcsFindHtmlCharset, IcsFindHtmlCodepage,
IcsContentCodepage, IcsMoveTBytesToString  and IcsHtmlToStr, which take
either a TBytes buffer or stream as input.  Also IcsMoveStringToTBytes
which takes a unicode string as input and creates a TBytes buffer.  

To convert the received HTML stream to a unicode string with the
correct codepage, use IcsHtmlToStr in OverbyteIcsCharsetUtils, it's not
yet used in OverbyteIcsHttpProt.pas to avoid linking in various charset
tables applications may not need. The last argument determines whether
entities like & £ and ☍ are converted to characters for
display instead of HTML.

UnicodeStr := IcsHtmlToStr(RcvdStream, HdrContentType, true);

This is illustrated in the OverbyteIcsHttpsTst sample application, with
the separate functions that may be alternatively used.  

OverbyteIcsProxy.pas has also been updated to check body meta for
charset and convert html TBytes buffers to unicode instead of ANSI for
the events that allow bodies to be examined and updated, illustrated in
the OverbyteIcsProxySslServer sample. 

Angus
 




-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be

Reply via email to