Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
On Sat, Oct 11, 2008 at 8:27 PM, Arno Garrels [EMAIL PROTECTED] wrote: Fastream Technologies wrote: On Sat, Oct 11, 2008 at 6:24 PM, Arno Garrels [EMAIL PROTECTED] wrote: Fastream Technologies wrote: Hello Arno, If the function is ready, I would like to test it in our special unit for HTML folder listings. Can you post it here or send privately? Yes, I checked it in already. Install the TortoiseSVN client to get access to the ICS SVN repository. In Icsv6 URL encoding/decoding do not use UTF-8. The file names in directory listings are now _displayed_ correctly, but their links have not changed, they still are ANSI. I can easily change URL coding in v6 to UTF-8 as well, however I wonder whether that would break existing applications? Would it not work in BCB2007? ICSv7 should work with CB2007 as ICSv6 before (or even better), So you could test the UTF-8 URL stuff in Icsv7. Or is it just applications need to be changed? IMO possible, though the server checks for valid UTF-8 URLs and decodes them as ANSI if a check failed. The server sends however always UTF-8 URLs (same as IIS always sends UTF-8 URLs regardless whether a client accepts UTF-8 or not). In my view, since the client for parsing the html is 99% IE/FF/Opera/Safari/Chrome which support UTF-8, there should be no problem. With those clients there won't be IMO a problem, as long as the listed file names use only characters from the local default ANSI code page. If the server shall also list any possible Unicode file name you need CB2009 and ICSv7. Do you mean the filepath parameter of the TFileStream should be unicode/BCB2009 to be able to access files with chinese names when the local codepage is Turkish or English? Is that what you mean? SZ -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Fastream Technologies wrote: With those clients there won't be IMO a problem, as long as the listed file names use only characters from the local default ANSI code page. If the server shall also list any possible Unicode file name you need CB2009 and ICSv7. Do you mean the filepath parameter of the TFileStream should be unicode/BCB2009 to be able to access files with chinese names when the local codepage is Turkish or English? Is that what you mean? Yes, but that's just one requirement, you also need to call the w-versions of the Win32 API to create the file listing, FindFirstFileW() and FindNextFileW(). -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Hello Arno, If the function is ready, I would like to test it in our special unit for HTML folder listings. Can you post it here or send privately? Best Regards, SZ On Fri, Oct 10, 2008 at 4:41 PM, Arno Garrels [EMAIL PROTECTED] wrote: Arno Garrels wrote: Francois PIETTE wrote: But 3 bytes looks like UTF-8 ? I don't know. You said it was UTF-16 if not encoded. I installed IIS 7 on my Vista box and I found that IIS 7 uses UTF-7 in directory listings. Arrgh, typo above, IIS v7 uses UTF-8 of course! The HTTP header contains the charset=UTF-8 content-type extension. However I think the ICS server should continue to use HTML enitities. HTML entities represent both iso-8859-1 (Latin1) and Unicode character numbers (in Unicode the first 256 chars are the same as Latin1). So in order to create a _valid_ mapping a AnsiString MUST be converted with current ANSI code page to a UnicodeString/WideString first! This can be achieved easily in TextToHtmlText() by a local WideString variable that is assigned parameter Src : String. Characters above #255 must the be represented as numerical HTML entities (#;). That's all, fully backwards compatible and works in D2009 as well :) -- Arno Garrels - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 7:03 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois PIETTE wrote: The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! Yes, no encoding at all. Just the 3 bytes. So UTF-16. But 3 bytes looks like UTF-8 ? -- Arno Garrels -- [EMAIL PROTECTED] http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 5:26 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois Piette wrote: Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. Thanks! The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 That's UTF-8 URL-encoded. while they are binary in the text itself. The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! There is nothing in the html header to tell which code page or charset is used. -- Browsers seem to be very good in detecting the correct character set nowadays. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Fastream Technologies wrote: Hello Arno, If the function is ready, I would like to test it in our special unit for HTML folder listings. Can you post it here or send privately? Yes, I checked it in already. Install the TortoiseSVN client to get access to the ICS SVN repository. In Icsv6 URL encoding/decoding do not use UTF-8. The file names in directory listings are now _displayed_ correctly, but their links have not changed, they still are ANSI. I can easily change URL coding in v6 to UTF-8 as well, however I wonder whether that would break existing applications? -- Arno Garrels Best Regards, SZ On Fri, Oct 10, 2008 at 4:41 PM, Arno Garrels [EMAIL PROTECTED] wrote: Arno Garrels wrote: Francois PIETTE wrote: But 3 bytes looks like UTF-8 ? I don't know. You said it was UTF-16 if not encoded. I installed IIS 7 on my Vista box and I found that IIS 7 uses UTF-7 in directory listings. Arrgh, typo above, IIS v7 uses UTF-8 of course! The HTTP header contains the charset=UTF-8 content-type extension. However I think the ICS server should continue to use HTML enitities. HTML entities represent both iso-8859-1 (Latin1) and Unicode character numbers (in Unicode the first 256 chars are the same as Latin1). So in order to create a _valid_ mapping a AnsiString MUST be converted with current ANSI code page to a UnicodeString/WideString first! This can be achieved easily in TextToHtmlText() by a local WideString variable that is assigned parameter Src : String. Characters above #255 must the be represented as numerical HTML entities (#;). That's all, fully backwards compatible and works in D2009 as well :) -- Arno Garrels - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 7:03 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois PIETTE wrote: The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! Yes, no encoding at all. Just the 3 bytes. So UTF-16. But 3 bytes looks like UTF-8 ? -- Arno Garrels -- [EMAIL PROTECTED] http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 5:26 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois Piette wrote: Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. Thanks! The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 That's UTF-8 URL-encoded. while they are binary in the text itself. The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! There is nothing in the html header to tell which code page or charset is used. -- Browsers seem to be very good in detecting the correct character set nowadays. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Fastream Technologies wrote: On Sat, Oct 11, 2008 at 6:24 PM, Arno Garrels [EMAIL PROTECTED] wrote: Fastream Technologies wrote: Hello Arno, If the function is ready, I would like to test it in our special unit for HTML folder listings. Can you post it here or send privately? Yes, I checked it in already. Install the TortoiseSVN client to get access to the ICS SVN repository. In Icsv6 URL encoding/decoding do not use UTF-8. The file names in directory listings are now _displayed_ correctly, but their links have not changed, they still are ANSI. I can easily change URL coding in v6 to UTF-8 as well, however I wonder whether that would break existing applications? Would it not work in BCB2007? ICSv7 should work with CB2007 as ICSv6 before (or even better), So you could test the UTF-8 URL stuff in Icsv7. Or is it just applications need to be changed? IMO possible, though the server checks for valid UTF-8 URLs and decodes them as ANSI if a check failed. The server sends however always UTF-8 URLs (same as IIS always sends UTF-8 URLs regardless whether a client accepts UTF-8 or not). In my view, since the client for parsing the html is 99% IE/FF/Opera/Safari/Chrome which support UTF-8, there should be no problem. With those clients there won't be IMO a problem, as long as the listed file names use only characters from the local default ANSI code page. If the server shall also list any possible Unicode file name you need CB2009 and ICSv7. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Francois PIETTE wrote: But 3 bytes looks like UTF-8 ? I don't know. You said it was UTF-16 if not encoded. I installed IIS 7 on my Vista box and I found that IIS 7 uses UTF-7 in directory listings. The HTTP header contains the charset=UTF-8 content-type extension. However I think the ICS server should continue to use HTML enitities. HTML entities represent both iso-8859-1 (Latin1) and Unicode character numbers (in Unicode the first 256 chars are the same as Latin1). So in order to create a _valid_ mapping a AnsiString MUST be converted with current ANSI code page to a UnicodeString/WideString first! This can be achieved easily in TextToHtmlText() by a local WideString variable that is assigned parameter Src : String. Characters above #255 must the be represented as numerical HTML entities (#;). That's all, fully backwards compatible and works in D2009 as well :) -- Arno Garrels - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 7:03 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois PIETTE wrote: The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! Yes, no encoding at all. Just the 3 bytes. So UTF-16. But 3 bytes looks like UTF-8 ? -- Arno Garrels -- [EMAIL PROTECTED] http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 5:26 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois Piette wrote: Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. Thanks! The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 That's UTF-8 URL-encoded. while they are binary in the text itself. The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! There is nothing in the html header to tell which code page or charset is used. -- Browsers seem to be very good in detecting the correct character set nowadays. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Arno Garrels wrote: Francois PIETTE wrote: But 3 bytes looks like UTF-8 ? I don't know. You said it was UTF-16 if not encoded. I installed IIS 7 on my Vista box and I found that IIS 7 uses UTF-7 in directory listings. Arrgh, typo above, IIS v7 uses UTF-8 of course! The HTTP header contains the charset=UTF-8 content-type extension. However I think the ICS server should continue to use HTML enitities. HTML entities represent both iso-8859-1 (Latin1) and Unicode character numbers (in Unicode the first 256 chars are the same as Latin1). So in order to create a _valid_ mapping a AnsiString MUST be converted with current ANSI code page to a UnicodeString/WideString first! This can be achieved easily in TextToHtmlText() by a local WideString variable that is assigned parameter Src : String. Characters above #255 must the be represented as numerical HTML entities (#;). That's all, fully backwards compatible and works in D2009 as well :) -- Arno Garrels - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 7:03 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois PIETTE wrote: The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! Yes, no encoding at all. Just the 3 bytes. So UTF-16. But 3 bytes looks like UTF-8 ? -- Arno Garrels -- [EMAIL PROTECTED] http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 5:26 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois Piette wrote: Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. Thanks! The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 That's UTF-8 URL-encoded. while they are binary in the text itself. The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! There is nothing in the html header to tell which code page or charset is used. -- Browsers seem to be very good in detecting the correct character set nowadays. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Or am I missing something? I think so. Using html entities make sure the correct character is represented whatever the character set or character code is used by the browser. The character code shown in the comments is just for reference only and is only valid on some platforms. -- [EMAIL PROTECTED] Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: twsocket@elists.org Sent: Thursday, October 09, 2008 9:47 AM Subject: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() In function TextToHtmlText() the HTML encoding of characters above #127 assumes code page iso-8859-1. const HtmlSpecialChars : array [160..255] of String[6] = ( 'nbsp' , { #160 no-break space = non-breaking } 'iexcl' , { #161 inverted exclamation } 'cent' , { #162 cent sign .. This IMO should be exchanged by simple decimal notation: '#' + IntToStr(Ord(Char)) in both Icsv6 and Icsv7. Decimal notation of characters = #255 seems to work with all browsers, tested even with Netscape 3.01. With word-sized characters above #255 (D2009 and ICSv7) modern browsers render the correct Unicode characters. Or am I missing something? -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
DZ-Jay wrote: Actually, I think Arno is correct, but it's a bit more complex than that: The entities conversion depend strictly on the local character set. That is, each character set *may* map differently (as Arno just discovered for the cent character between CP-1252 and CP-1251); there is no universal conversion, that is, because the entities represent semantically equivalent characters in differing representations, not specific character codes. For this reason, the best solution is usually to use Unicode (UTF-8) in HTML output. Probably correct if speed doesn't matter and internationalization was the goal. I guess that decimal notated characters below #255 are treated as ANSI in the context of the content charset spezified in the HTML header, is that correct? If so, the fastest fix was just to add the correct content charset to the HTML header and to use decimal notation, provided that internationalization doesn't matter much. Character numbers above #255 are rendered as Unicode code points (tested). I only wonder whether browsers treat characters below #255 also as Unicode code points once they found one character above #255? If you specify UTF-8 as the content character set in the HTML header, then you only need to encode as entities the metacharacters: ampersand, non-breaking-space, and left- and right-angled brackets. Yep. -- Arno Garrels As for HttpSrv.TextToHtmlText() method, it should take the content character set into consideration. However, if the mappings are too different, maintaining many different tables may not be practical. dZ. On Oct 9, 2008, at 05:09, Arno Garrels wrote: Francois Piette wrote: Or am I missing something? I think so. Using html entities make sure the correct character is represented whatever the character set or character code is used by the browser. That's correct, but the server maps the wrong HTML entities if it doesn't run in a locale that uses CP 1252! For example: Currently char #162 is hard coded to represent the cent sign: HTML Entity: 'cent' , { #162 cent sign } In windows-1251 however #162 maps to the small kyrillic letter U (short). -- DZ-Jay [TeamICS] http://www.overbyte.be/eng/overbyte/teamics.html -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Francois Piette wrote: In your example, char #162 is replaced by cent; in the html output. This represent the cnet character whatever the code page is. Actually that is the bug, since #162 is the cent sign in CP 1252 but not in CP 1251. This function is used to generate directory listings, most file names including characters above #128 will be wrong when the server does not run on Windows CP 1252. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Arno Garrels wrote: Francois Piette wrote: In your example, char #162 is replaced by cent; in the html output. This represent the cnet character whatever the code page is. Actually that is the bug, since #162 is the cent sign in CP 1252 but not in CP 1251. This function is used to generate directory listings, most file names including characters above #128 will be wrong when the server does not run on Windows CP 1252. Looking at the file listing IIS 5.1 returns: Links are URL-encoded UTF-8 (as already added to Icsv7), but characters above #127 in file names are plain, non-encoded ANSI characters with current default ANSI code page. There's no charset specified in both HTTP and HTML header. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Fastream Technologies wrote: IIS5.1 is very old code (2001). Unfortunately my IIS7 Windows 2008 expired so I cannot check right now. Maybe somebody else can help?? Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. -- Arno Garrels On Thu, Oct 9, 2008 at 2:59 PM, Arno Garrels [EMAIL PROTECTED] wrote: Arno Garrels wrote: Francois Piette wrote: In your example, char #162 is replaced by cent; in the html output. This represent the cnet character whatever the code page is. Actually that is the bug, since #162 is the cent sign in CP 1252 but not in CP 1251. This function is used to generate directory listings, most file names including characters above #128 will be wrong when the server does not run on Windows CP 1252. Looking at the file listing IIS 5.1 returns: Links are URL-encoded UTF-8 (as already added to Icsv7), but characters above #127 in file names are plain, non-encoded ANSI characters with current default ANSI code page. There's no charset specified in both HTTP and HTML header. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Actually, I think Arno is correct, but it's a bit more complex than that: The entities conversion depend strictly on the local character set. That is, each character set *may* map differently (as Arno just discovered for the cent character between CP-1252 and CP-1251); there is no universal conversion, that is, because the entities represent semantically equivalent characters in differing representations, not specific character codes. For this reason, the best solution is usually to use Unicode (UTF-8) in HTML output. If you specify UTF-8 as the content character set in the HTML header, then you only need to encode as entities the metacharacters: ampersand, non-breaking-space, and left- and right-angled brackets. As for HttpSrv.TextToHtmlText() method, it should take the content character set into consideration. However, if the mappings are too different, maintaining many different tables may not be practical. dZ. On Oct 9, 2008, at 05:09, Arno Garrels wrote: Francois Piette wrote: Or am I missing something? I think so. Using html entities make sure the correct character is represented whatever the character set or character code is used by the browser. That's correct, but the server maps the wrong HTML entities if it doesn't run in a locale that uses CP 1252! For example: Currently char #162 is hard coded to represent the cent sign: HTML Entity: 'cent' , { #162 cent sign } In windows-1251 however #162 maps to the small kyrillic letter U (short). -- DZ-Jay [TeamICS] http://www.overbyte.be/eng/overbyte/teamics.html -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Using html entities make sure the correct character is represented whatever the character set or character code is used by the browser. That's correct, but the server maps the wrong HTML entities if it doesn't run in a locale that uses CP 1252! For example: Currently char #162 is hard coded to represent the cent sign: HTML Entity: 'cent' , { #162 cent } In windows-1251 however #162 maps to the small kyrillic letter U (short). TextToHtmlText do not use the character code. It use the html entity. It is the browser which replace the entity by the [hopefully] correct character. In your example, char #162 is replaced by cent; in the html output. This represent the cnet character whatever the code page is. Or is it me who miss something... -- [EMAIL PROTECTED] Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
In your example, char #162 is replaced by cent; in the html output. This represent the cnet character whatever the code page is. Actually that is the bug, since #162 is the cent sign in CP 1252 but not in CP 1251. This function is used to generate directory listings, most file names including characters above #128 will be wrong when the server does not run on Windows CP 1252. Ah ! I understand now what you mean. The table used by TextToHtml should be based on the CP. I suggest adding an optional second argument to TextToHtml so that the user can specify the code page to be used. This argument could be the array to be used for conversion. Maybe the array of strings should be replace by an array of record, each element having the character code and the entities so that the mechanism can be used for any number of character code and for unicode as well. TextToHtml is used outside of the http server component (at least I use it outside) and has no control on the html header where the code page could be specified. It's up to the user - in that case - that the correct table be supplied in his own context. -- [EMAIL PROTECTED] Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 while they are binary in the text itself. There is nothing in the html header to tell which code page or charset is used. -- [EMAIL PROTECTED] Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Francois Piette wrote: In your example, char #162 is replaced by cent; in the html output. This represent the cnet character whatever the code page is. Actually that is the bug, since #162 is the cent sign in CP 1252 but not in CP 1251. This function is used to generate directory listings, most file names including characters above #128 will be wrong when the server does not run on Windows CP 1252. Ah ! I understand now what you mean. The table used by TextToHtml should be based on the CP. But how? As far as I know, and I searched a lot, there are no entity tables available for other code pages than iso-8859-1. I suggest adding an optional second argument to TextToHtml so that the user can specify the code page to be used. This argument could be the array to be used for conversion. That would only work _IF_ different entity tables for different code pages existed which is IMO _not the case. That's why HTML entity encoding should not be used for our purpose (mapped to character numbers). Those entities are nice to have when you design webpages manually. Their only purpose was to display particular characters _independent from the browser's or the HTMl-page's ANSI code page. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Francois Piette wrote: Or am I missing something? I think so. Using html entities make sure the correct character is represented whatever the character set or character code is used by the browser. That's correct, but the server maps the wrong HTML entities if it doesn't run in a locale that uses CP 1252! For example: Currently char #162 is hard coded to represent the cent sign: HTML Entity: 'cent' , { #162 cent sign } In windows-1251 however #162 maps to the small kyrillic letter U (short). -- Arno Garrels -- [EMAIL PROTECTED] Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: twsocket@elists.org Sent: Thursday, October 09, 2008 9:47 AM Subject: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() In function TextToHtmlText() the HTML encoding of characters above #127 assumes code page iso-8859-1. const HtmlSpecialChars : array [160..255] of String[6] = ( 'nbsp' , { #160 no-break space = non-breaking } 'iexcl' , { #161 inverted exclamation } 'cent' , { #162 cent sign .. This IMO should be exchanged by simple decimal notation: '#' + IntToStr(Ord(Char)) in both Icsv6 and Icsv7. Decimal notation of characters = #255 seems to work with all browsers, tested even with Netscape 3.01. With word-sized characters above #255 (D2009 and ICSv7) modern browsers render the correct Unicode characters. Or am I missing something? -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Francois Piette wrote: Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. Thanks! The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 That's UTF-8 URL-encoded. while they are binary in the text itself. The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! There is nothing in the html header to tell which code page or charset is used. -- Browsers seem to be very good in detecting the correct character set nowadays. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! Yes, no encoding at all. Just the 3 bytes. So UTF-16. -- [EMAIL PROTECTED] http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 5:26 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois Piette wrote: Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. Thanks! The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 That's UTF-8 URL-encoded. while they are binary in the text itself. The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! There is nothing in the html header to tell which code page or charset is used. -- Browsers seem to be very good in detecting the correct character set nowadays. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
Francois PIETTE wrote: The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! Yes, no encoding at all. Just the 3 bytes. So UTF-16. But 3 bytes looks like UTF-8 ? -- Arno Garrels -- [EMAIL PROTECTED] http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 5:26 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois Piette wrote: Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. Thanks! The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 That's UTF-8 URL-encoded. while they are binary in the text itself. The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! There is nothing in the html header to tell which code page or charset is used. -- Browsers seem to be very good in detecting the correct character set nowadays. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be
Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
But 3 bytes looks like UTF-8 ? I don't know. You said it was UTF-16 if not encoded. - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 7:03 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois PIETTE wrote: The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! Yes, no encoding at all. Just the 3 bytes. So UTF-16. But 3 bytes looks like UTF-8 ? -- Arno Garrels -- [EMAIL PROTECTED] http://www.overbyte.be - Original Message - From: Arno Garrels [EMAIL PROTECTED] To: ICS support mailing twsocket@elists.org Sent: Thursday, October 09, 2008 5:26 PM Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText() Francois Piette wrote: Yes, if someone has Apache or a newer IIS installed he could help. Create a file name with characters not in current ANSI code page by copy those characters from the Windows application charmap.exe. Than start a packet sniffer and log a directory listing. Using IIS6 on W2K3. Thanks! The twothird character (U+2154) is sent in the dirlist as 3 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are expressed as %e2%85%94 That's UTF-8 URL-encoded. while they are binary in the text itself. The twothird character is not 'encoded' either as #8532; (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16! There is nothing in the html header to tell which code page or charset is used. -- Browsers seem to be very good in detecting the correct character set nowadays. -- Arno Garrels -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be