Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-12 Thread Fastream Technologies
On Sat, Oct 11, 2008 at 8:27 PM, Arno Garrels [EMAIL PROTECTED] wrote:
 Fastream Technologies wrote:
 On Sat, Oct 11, 2008 at 6:24 PM, Arno Garrels [EMAIL PROTECTED]
 wrote:

 Fastream Technologies wrote:
 Hello Arno,

 If the function is ready, I would like to test it in our special
 unit for HTML folder listings. Can you post it here or send
 privately?

 Yes, I checked it in already. Install the TortoiseSVN client to get
 access to the ICS SVN repository.

 In Icsv6 URL encoding/decoding do not use UTF-8. The file names in
 directory listings are now _displayed_ correctly, but their links
 have not changed, they still are ANSI. I can easily change URL coding
 in v6 to UTF-8 as well, however I wonder whether that would break
 existing applications?

 Would it not work in BCB2007?

 ICSv7 should work with CB2007 as ICSv6 before (or even better),
 So you could test the UTF-8 URL stuff in Icsv7.

 Or is it just applications need to be
 changed?

 IMO possible, though the server checks for valid UTF-8 URLs
 and decodes them as ANSI if a check failed. The server sends
 however always UTF-8 URLs (same as IIS always sends UTF-8 URLs
 regardless whether a client accepts UTF-8 or not).

 In my view, since the client for parsing the html is 99%
 IE/FF/Opera/Safari/Chrome which support UTF-8, there should be no
 problem.

 With those clients there won't be IMO a problem, as long as the
 listed file names use only characters from the local default
 ANSI code page. If the server shall also list any possible Unicode
 file name you need CB2009 and ICSv7.

Do you mean the filepath parameter of the TFileStream should be
unicode/BCB2009 to be able to access files with chinese names when the
local codepage is Turkish or English? Is that what you mean?

SZ
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-12 Thread Arno Garrels
Fastream Technologies wrote:
 
 With those clients there won't be IMO a problem, as long as the
 listed file names use only characters from the local default
 ANSI code page. If the server shall also list any possible Unicode
 file name you need CB2009 and ICSv7.
 
 Do you mean the filepath parameter of the TFileStream should be
 unicode/BCB2009 to be able to access files with chinese names when the
 local codepage is Turkish or English? Is that what you mean?
 
Yes, but that's just one requirement, you also need to call the 
w-versions of the Win32 API to create the file listing, 
FindFirstFileW() and FindNextFileW(). 

--
Arno Garrels 
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-11 Thread Fastream Technologies
Hello Arno,

If the function is ready, I would like to test it in our special unit for
HTML folder listings. Can you post it here or send privately?

Best Regards,
SZ
On Fri, Oct 10, 2008 at 4:41 PM, Arno Garrels [EMAIL PROTECTED] wrote:

 Arno Garrels wrote:
  Francois PIETTE wrote:
  But 3 bytes looks like UTF-8 ?
 
  I don't know. You said it was UTF-16 if not encoded.
 
  I installed IIS 7 on my Vista box and I found that IIS 7
  uses UTF-7 in directory listings.

 Arrgh, typo above, IIS v7 uses UTF-8 of course!

  The HTTP header contains
  the charset=UTF-8 content-type extension.
 
 
  However I think the ICS server should continue to use HTML
  enitities.
  HTML entities represent both iso-8859-1 (Latin1) and Unicode
  character numbers (in Unicode the first 256 chars are the same as
  Latin1). So in order to create a _valid_ mapping a AnsiString MUST be
  converted with current ANSI code page to a UnicodeString/WideString
  first! This can be achieved easily in TextToHtmlText() by a local
  WideString variable that is assigned parameter Src : String.
  Characters above #255 must the be represented as numerical HTML
  entities (#;). That's all, fully backwards compatible and
  works in D2009 as well :)
 
  --
  Arno Garrels
 
 
 
  - Original Message -
  From: Arno Garrels [EMAIL PROTECTED]
  To: ICS support mailing twsocket@elists.org
  Sent: Thursday, October 09, 2008 7:03 PM
  Subject: Re: [twsocket] HTML encoding in HttpSrv func.
  TextToHtmlText()
 
 
  Francois PIETTE wrote:
  The twothird character is not 'encoded' either as #8532;
  (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
  Yes, no encoding at all. Just the 3 bytes. So UTF-16.
 
  But 3 bytes looks like UTF-8 ?
 
  --
  Arno Garrels
 
 
  --
  [EMAIL PROTECTED]
  http://www.overbyte.be
 
 
  - Original Message -
  From: Arno Garrels [EMAIL PROTECTED]
  To: ICS support mailing twsocket@elists.org
  Sent: Thursday, October 09, 2008 5:26 PM
  Subject: Re: [twsocket] HTML encoding in HttpSrv func.
  TextToHtmlText()
 
 
  Francois Piette wrote:
  Yes, if someone has Apache or a newer IIS installed he could
  help. Create a file name with characters not in current ANSI
  code page by copy those characters from the Windows application
  charmap.exe. Than start a packet sniffer and log a directory
  listing.
 
  Using IIS6 on W2K3.
 
  Thanks!
 
  The twothird character (U+2154) is sent in the dirlist as 3
  characters : 0xE2 0x85 0x94. In the href link, the 3 characters
  are expressed as %e2%85%94
 
  That's UTF-8 URL-encoded.
 
  while they are binary in the text itself.
 
  The twothird character is not 'encoded' either as #8532;
  (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
  There is nothing in the html header to tell which code page or
  charset is used. --
 
  Browsers seem to be very good in detecting the correct character
  set nowadays.
 
  --
  Arno Garrels

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-11 Thread Arno Garrels
Fastream Technologies wrote:
 Hello Arno,
 
 If the function is ready, I would like to test it in our special unit
 for HTML folder listings. Can you post it here or send privately?

Yes, I checked it in already. Install the TortoiseSVN client to get 
access to the ICS SVN repository.

In Icsv6 URL encoding/decoding do not use UTF-8. The file names in
directory listings are now _displayed_ correctly, but their links
have not changed, they still are ANSI. I can easily change URL coding
in v6 to UTF-8 as well, however I wonder whether that would break 
existing applications?

--
Arno Garrels


 
 Best Regards,
 SZ
 On Fri, Oct 10, 2008 at 4:41 PM, Arno Garrels [EMAIL PROTECTED]
 wrote: 
 
 Arno Garrels wrote:
 Francois PIETTE wrote:
 But 3 bytes looks like UTF-8 ?
 
 I don't know. You said it was UTF-16 if not encoded.
 
 I installed IIS 7 on my Vista box and I found that IIS 7
 uses UTF-7 in directory listings.
 
 Arrgh, typo above, IIS v7 uses UTF-8 of course!
 
 The HTTP header contains
 the charset=UTF-8 content-type extension.
 
 
 However I think the ICS server should continue to use HTML
 enitities.
 HTML entities represent both iso-8859-1 (Latin1) and Unicode
 character numbers (in Unicode the first 256 chars are the same as
 Latin1). So in order to create a _valid_ mapping a AnsiString MUST
 be converted with current ANSI code page to a
 UnicodeString/WideString first! This can be achieved easily in
 TextToHtmlText() by a local WideString variable that is assigned
 parameter Src : String. Characters above #255 must the be
 represented as numerical HTML entities (#;). That's all, fully
 backwards compatible and 
 works in D2009 as well :)
 
 --
 Arno Garrels
 
 
 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: ICS support mailing twsocket@elists.org
 Sent: Thursday, October 09, 2008 7:03 PM
 Subject: Re: [twsocket] HTML encoding in HttpSrv func.
 TextToHtmlText()
 
 
 Francois PIETTE wrote:
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 Yes, no encoding at all. Just the 3 bytes. So UTF-16.
 
 But 3 bytes looks like UTF-8 ?
 
 --
 Arno Garrels
 
 
 --
 [EMAIL PROTECTED]
 http://www.overbyte.be
 
 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: ICS support mailing twsocket@elists.org
 Sent: Thursday, October 09, 2008 5:26 PM
 Subject: Re: [twsocket] HTML encoding in HttpSrv func.
 TextToHtmlText()
 
 
 Francois Piette wrote:
 Yes, if someone has Apache or a newer IIS installed he could
 help. Create a file name with characters not in current ANSI
 code page by copy those characters from the Windows
 application charmap.exe. Than start a packet sniffer and log
 a directory listing.
 
 Using IIS6 on W2K3.
 
 Thanks!
 
 The twothird character (U+2154) is sent in the dirlist as 3
 characters : 0xE2 0x85 0x94. In the href link, the 3 characters
 are expressed as %e2%85%94
 
 That's UTF-8 URL-encoded.
 
 while they are binary in the text itself.
 
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 There is nothing in the html header to tell which code page or
 charset is used. --
 
 Browsers seem to be very good in detecting the correct character
 set nowadays.
 
 --
 Arno Garrels
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-11 Thread Arno Garrels
Fastream Technologies wrote:
 On Sat, Oct 11, 2008 at 6:24 PM, Arno Garrels [EMAIL PROTECTED]
 wrote: 
 
 Fastream Technologies wrote:
 Hello Arno,
 
 If the function is ready, I would like to test it in our special
 unit for HTML folder listings. Can you post it here or send
 privately? 
 
 Yes, I checked it in already. Install the TortoiseSVN client to get
 access to the ICS SVN repository.
 
 In Icsv6 URL encoding/decoding do not use UTF-8. The file names in
 directory listings are now _displayed_ correctly, but their links
 have not changed, they still are ANSI. I can easily change URL coding
 in v6 to UTF-8 as well, however I wonder whether that would break
 existing applications?
 
 Would it not work in BCB2007? 

ICSv7 should work with CB2007 as ICSv6 before (or even better),
So you could test the UTF-8 URL stuff in Icsv7.
 
 Or is it just applications need to be
 changed? 

IMO possible, though the server checks for valid UTF-8 URLs 
and decodes them as ANSI if a check failed. The server sends
however always UTF-8 URLs (same as IIS always sends UTF-8 URLs 
regardless whether a client accepts UTF-8 or not).  

 In my view, since the client for parsing the html is 99%
 IE/FF/Opera/Safari/Chrome which support UTF-8, there should be no
 problem. 

With those clients there won't be IMO a problem, as long as the
listed file names use only characters from the local default
ANSI code page. If the server shall also list any possible Unicode
file name you need CB2009 and ICSv7. 

--
Arno Garrels
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-10 Thread Arno Garrels
Francois PIETTE wrote:
 But 3 bytes looks like UTF-8 ?
 
 I don't know. You said it was UTF-16 if not encoded.

I installed IIS 7 on my Vista box and I found that IIS 7
uses UTF-7 in directory listings. The HTTP header contains 
the charset=UTF-8 content-type extension.


However I think the ICS server should continue to use HTML 
enitities. 
HTML entities represent both iso-8859-1 (Latin1) and Unicode 
character numbers (in Unicode the first 256 chars are the same as
Latin1). So in order to create a _valid_ mapping a AnsiString MUST be 
converted with current ANSI code page to a UnicodeString/WideString
first! This can be achieved easily in TextToHtmlText() by a local 
WideString variable that is assigned parameter Src : String.  
Characters above #255 must the be represented as numerical HTML
entities (#;). That's all, fully backwards compatible and
works in D2009 as well :)

--
Arno Garrels


 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: ICS support mailing twsocket@elists.org
 Sent: Thursday, October 09, 2008 7:03 PM
 Subject: Re: [twsocket] HTML encoding in HttpSrv func.
 TextToHtmlText() 
 
 
 Francois PIETTE wrote:
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 Yes, no encoding at all. Just the 3 bytes. So UTF-16.
 
 But 3 bytes looks like UTF-8 ?
 
 --
 Arno Garrels
 
 
 --
 [EMAIL PROTECTED]
 http://www.overbyte.be
 
 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: ICS support mailing twsocket@elists.org
 Sent: Thursday, October 09, 2008 5:26 PM
 Subject: Re: [twsocket] HTML encoding in HttpSrv func.
 TextToHtmlText()
 
 
 Francois Piette wrote:
 Yes, if someone has Apache or a newer IIS installed he could
 help. Create a file name with characters not in current ANSI
 code page by copy those characters from the Windows application
 charmap.exe. Than start a packet sniffer and log a directory
 listing. 
 
 Using IIS6 on W2K3.
 
 Thanks!
 
 The twothird character (U+2154) is sent in the dirlist as 3
 characters : 0xE2 0x85 0x94. In the href link, the 3 characters
 are expressed as %e2%85%94
 
 That's UTF-8 URL-encoded.
 
 while they are binary in the text itself.
 
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 There is nothing in the html header to tell which code page or
 charset is used. --
 
 Browsers seem to be very good in detecting the correct character
 set nowadays.
 
 --
 Arno Garrels
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto
 http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit
 our website at http://www.overbyte.be 
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-10 Thread Arno Garrels
Arno Garrels wrote:
 Francois PIETTE wrote:
 But 3 bytes looks like UTF-8 ?
 
 I don't know. You said it was UTF-16 if not encoded.
 
 I installed IIS 7 on my Vista box and I found that IIS 7
 uses UTF-7 in directory listings. 

Arrgh, typo above, IIS v7 uses UTF-8 of course!

 The HTTP header contains
 the charset=UTF-8 content-type extension.
 
 
 However I think the ICS server should continue to use HTML
 enitities.
 HTML entities represent both iso-8859-1 (Latin1) and Unicode
 character numbers (in Unicode the first 256 chars are the same as
 Latin1). So in order to create a _valid_ mapping a AnsiString MUST be
 converted with current ANSI code page to a UnicodeString/WideString
 first! This can be achieved easily in TextToHtmlText() by a local
 WideString variable that is assigned parameter Src : String.
 Characters above #255 must the be represented as numerical HTML
 entities (#;). That's all, fully backwards compatible and
 works in D2009 as well :)
 
 --
 Arno Garrels
 
 
 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: ICS support mailing twsocket@elists.org
 Sent: Thursday, October 09, 2008 7:03 PM
 Subject: Re: [twsocket] HTML encoding in HttpSrv func.
 TextToHtmlText()
 
 
 Francois PIETTE wrote:
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 Yes, no encoding at all. Just the 3 bytes. So UTF-16.
 
 But 3 bytes looks like UTF-8 ?
 
 --
 Arno Garrels
 
 
 --
 [EMAIL PROTECTED]
 http://www.overbyte.be
 
 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: ICS support mailing twsocket@elists.org
 Sent: Thursday, October 09, 2008 5:26 PM
 Subject: Re: [twsocket] HTML encoding in HttpSrv func.
 TextToHtmlText()
 
 
 Francois Piette wrote:
 Yes, if someone has Apache or a newer IIS installed he could
 help. Create a file name with characters not in current ANSI
 code page by copy those characters from the Windows application
 charmap.exe. Than start a packet sniffer and log a directory
 listing.
 
 Using IIS6 on W2K3.
 
 Thanks!
 
 The twothird character (U+2154) is sent in the dirlist as 3
 characters : 0xE2 0x85 0x94. In the href link, the 3 characters
 are expressed as %e2%85%94
 
 That's UTF-8 URL-encoded.
 
 while they are binary in the text itself.
 
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 There is nothing in the html header to tell which code page or
 charset is used. --
 
 Browsers seem to be very good in detecting the correct character
 set nowadays.
 
 --
 Arno Garrels
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto
 http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit
 our website at http://www.overbyte.be
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto
 http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our
 website at http://www.overbyte.be 
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Francois Piette
 Or am I missing something?

I think so. Using html entities make sure the correct character is
represented whatever the character set or character code is used by the
browser.

The character code shown in the comments is just for reference only and is
only valid on some platforms.

--
[EMAIL PROTECTED]
Author of ICS (Internet Component Suite, freeware)
Author of MidWare (Multi-tier framework, freeware)
http://www.overbyte.be

- Original Message - 
From: Arno Garrels [EMAIL PROTECTED]
To: twsocket@elists.org
Sent: Thursday, October 09, 2008 9:47 AM
Subject: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()


 In function TextToHtmlText() the HTML encoding of characters above #127
 assumes code page  iso-8859-1.

 const
 HtmlSpecialChars : array [160..255] of String[6] = (
 'nbsp'   , { #160 no-break space = non-breaking
  }
 'iexcl'  , { #161 inverted exclamation
 }
 'cent'   , { #162 cent sign

 ..

 This IMO should be exchanged by simple decimal notation:
 '#' + IntToStr(Ord(Char)) in both Icsv6 and Icsv7.

 Decimal notation of characters = #255 seems to work with all browsers,
 tested even with Netscape 3.01.
 With word-sized characters above #255 (D2009 and ICSv7) modern browsers
 render the correct Unicode characters.

 Or am I missing something?

 --
 Arno Garrels

 -- 
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Arno Garrels
DZ-Jay wrote:
 Actually, I think Arno is correct, but it's a bit more complex than
 that:
 
 The entities conversion depend strictly on the local character set.
 That is, each character set *may* map differently (as Arno just
 discovered for the cent character between CP-1252 and CP-1251);
 there is no universal conversion, that is, because the entities
 represent semantically equivalent characters in differing
 representations, not specific character codes.
 
 For this reason, the best solution is usually to use Unicode (UTF-8)
 in HTML output.  

Probably correct if speed doesn't matter and internationalization
was the goal.  

I guess that decimal notated characters below #255 are treated as ANSI
in the context of the content charset spezified in the HTML header, is that 
correct? If so, the fastest fix was just to add the correct content charset
to the HTML header and to use decimal notation, provided that 
internationalization doesn't matter much. 

Character numbers above #255 are rendered as Unicode code points (tested). 
I only wonder whether browsers treat characters below #255 also as Unicode
code points once they found one character above #255?

 If you specify UTF-8 as the content character set in
 the HTML header, then you only need to encode as entities the
 metacharacters:  ampersand, non-breaking-space, and left- and
 right-angled brackets.

Yep.

--
Arno Garrels

 As for HttpSrv.TextToHtmlText() method, it should take the content
 character set into consideration.  However, if the mappings are too
 different, maintaining many different tables may not be practical.
 dZ.
 
 On Oct 9, 2008, at 05:09, Arno Garrels wrote:
 
 Francois Piette wrote:
 Or am I missing something?
 
 I think so. Using html entities make sure the correct character is
 represented whatever the character set or character code is used by
 the browser.
 
 That's correct, but the server maps the wrong HTML entities if it
 doesn't run
 in a locale that uses CP 1252!
 
 For example:
 Currently  char #162 is hard coded to represent the cent sign:
 HTML Entity: 'cent'   , { #162 cent sign
   }
 
 In windows-1251 however #162 maps to the small kyrillic letter U
 (short).
 
 
 --
 DZ-Jay [TeamICS]
 http://www.overbyte.be/eng/overbyte/teamics.html
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Arno Garrels
Francois Piette wrote:
 
 In your example, char #162 is replaced by cent; in the html
 output. This represent the cnet character whatever the code page is.

Actually that is the bug, since #162 is the cent sign in CP 1252 but 
not in CP 1251. This function is used to generate directory listings,
most file names including characters above #128 will be wrong
when the server does not run on Windows CP 1252.

--
Arno Garrels
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Arno Garrels
Arno Garrels wrote:
 Francois Piette wrote:
 
 In your example, char #162 is replaced by cent; in the html
 output. This represent the cnet character whatever the code page is.
 
 Actually that is the bug, since #162 is the cent sign in CP 1252 but
 not in CP 1251. This function is used to generate directory listings,
 most file names including characters above #128 will be wrong
 when the server does not run on Windows CP 1252.

Looking at the file listing IIS 5.1 returns:
Links are URL-encoded UTF-8 (as already added to Icsv7), but characters
above #127 in file names are plain, non-encoded ANSI characters with current 
default ANSI code page. There's no charset specified in both HTTP and HTML
header.

--
Arno Garrels   
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Arno Garrels
Fastream Technologies wrote:
 IIS5.1 is very old code (2001). Unfortunately my IIS7 Windows 2008
 expired so I cannot check right now. Maybe somebody else can help??

Yes, if someone has Apache or a newer IIS installed he could help. 
Create a file name with characters not in current ANSI code page by copy
those characters from the Windows application charmap.exe. 
Than start a packet sniffer and log a directory listing.

--
Arno Garrels

 
 On Thu, Oct 9, 2008 at 2:59 PM, Arno Garrels [EMAIL PROTECTED]
 wrote: 
 Arno Garrels wrote:
 Francois Piette wrote:
 
 In your example, char #162 is replaced by cent; in the html
 output. This represent the cnet character whatever the code page
 is. 
 
 Actually that is the bug, since #162 is the cent sign in CP 1252 but
 not in CP 1251. This function is used to generate directory
 listings, most file names including characters above #128 will be
 wrong 
 when the server does not run on Windows CP 1252.
 
 Looking at the file listing IIS 5.1 returns:
 Links are URL-encoded UTF-8 (as already added to Icsv7), but
 characters above #127 in file names are plain, non-encoded ANSI
 characters with current default ANSI code page. There's no charset
 specified in both HTTP and HTML header.
 
 --
 Arno Garrels
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread DZ-Jay
Actually, I think Arno is correct, but it's a bit more complex than 
that:

The entities conversion depend strictly on the local character set.  
That is, each character set *may* map differently (as Arno just 
discovered for the cent character between CP-1252 and CP-1251); there 
is no universal conversion, that is, because the entities represent 
semantically equivalent characters in differing representations, not 
specific character codes.

For this reason, the best solution is usually to use Unicode (UTF-8) in 
HTML output.  If you specify UTF-8 as the content character set in the 
HTML header, then you only need to encode as entities the 
metacharacters:  ampersand, non-breaking-space, and left- and 
right-angled brackets.

As for HttpSrv.TextToHtmlText() method, it should take the content 
character set into consideration.  However, if the mappings are too 
different, maintaining many different tables may not be practical.

dZ.

On Oct 9, 2008, at 05:09, Arno Garrels wrote:

 Francois Piette wrote:
 Or am I missing something?

 I think so. Using html entities make sure the correct character is
 represented whatever the character set or character code is used by
 the browser.

 That's correct, but the server maps the wrong HTML entities if it 
 doesn't run
 in a locale that uses CP 1252!

 For example:
 Currently  char #162 is hard coded to represent the cent sign:
 HTML Entity: 'cent'   , { #162 cent sign   
   }

 In windows-1251 however #162 maps to the small kyrillic letter U 
 (short).


-- 
DZ-Jay [TeamICS]
http://www.overbyte.be/eng/overbyte/teamics.html

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Francois Piette
  Using html entities make sure the correct character is
  represented whatever the character set or character code is used by
  the browser.

 That's correct, but the server maps the wrong HTML entities if it doesn't
run
 in a locale that uses CP 1252!

 For example:
 Currently  char #162 is hard coded to represent the cent sign:
 HTML Entity: 'cent'   , { #162 cent
 }

 In windows-1251 however #162 maps to the small kyrillic letter U (short).

TextToHtmlText do not use the character code. It use the html entity. It is
the browser which replace the entity by the [hopefully] correct character.

In your example, char #162 is replaced by cent; in the html output. This
represent the cnet character whatever the code page is. Or is it me who miss
something...

--
[EMAIL PROTECTED]
Author of ICS (Internet Component Suite, freeware)
Author of MidWare (Multi-tier framework, freeware)
http://www.overbyte.be

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Francois Piette
  In your example, char #162 is replaced by cent; in the html
  output. This represent the cnet character whatever the code page is.

 Actually that is the bug, since #162 is the cent sign in CP 1252 but
 not in CP 1251. This function is used to generate directory listings,
 most file names including characters above #128 will be wrong
 when the server does not run on Windows CP 1252.

Ah ! I understand now what you mean.
The table used by TextToHtml should be based on the CP.
I suggest adding an optional second argument to TextToHtml so that the user
can specify the code page to be used. This argument could be the array to be
used for conversion. Maybe the array of strings should be replace by an
array of record, each element having the character code and the entities so
that the mechanism can be used for any number of character code and for
unicode as well.

TextToHtml is used outside of the http server component (at least I use it
outside) and has no control on the html header where the code page could be
specified. It's up to the user - in that case - that the correct table be
supplied in his own context.

--
[EMAIL PROTECTED]
Author of ICS (Internet Component Suite, freeware)
Author of MidWare (Multi-tier framework, freeware)
http://www.overbyte.be

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Francois Piette
 Yes, if someone has Apache or a newer IIS installed he could help.
 Create a file name with characters not in current ANSI code page by copy
 those characters from the Windows application charmap.exe.
 Than start a packet sniffer and log a directory listing.

Using IIS6 on W2K3.
The twothird character (U+2154) is sent in the dirlist as 3 characters :
0xE2 0x85 0x94. In the href link, the 3 characters are expressed as
%e2%85%94 while they are binary in the text itself. There is nothing in the
html header to tell which code page or charset is used.
--
[EMAIL PROTECTED]
Author of ICS (Internet Component Suite, freeware)
Author of MidWare (Multi-tier framework, freeware)
http://www.overbyte.be


-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Arno Garrels
Francois Piette wrote:
 In your example, char #162 is replaced by cent; in the html
 output. This represent the cnet character whatever the code page is.
 
 Actually that is the bug, since #162 is the cent sign in CP 1252 but
 not in CP 1251. This function is used to generate directory listings,
 most file names including characters above #128 will be wrong
 when the server does not run on Windows CP 1252.
 
 Ah ! I understand now what you mean.
 The table used by TextToHtml should be based on the CP.

But how? As far as I know, and I searched a lot, there are no entity tables
available for other code pages than iso-8859-1.

 I suggest adding an optional second argument to TextToHtml so that
 the user can specify the code page to be used. This argument could be
 the array to be used for conversion. 

That would only work _IF_ different entity tables for different 
code pages existed which is IMO _not the case. That's why HTML entity
encoding should not be used for our purpose (mapped to character numbers).
Those entities are nice to have when you design webpages manually.
Their only purpose was to display particular characters _independent from
the browser's or the HTMl-page's ANSI code page.  

--
Arno Garrels


-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Arno Garrels
Francois Piette wrote:
 Or am I missing something?
 
 I think so. Using html entities make sure the correct character is
 represented whatever the character set or character code is used by
 the browser.

That's correct, but the server maps the wrong HTML entities if it doesn't run 
in a locale that uses CP 1252!

For example:
Currently  char #162 is hard coded to represent the cent sign:
HTML Entity: 'cent'   , { #162 cent sign
 }

In windows-1251 however #162 maps to the small kyrillic letter U (short).

--
Arno Garrels

 
 --
 [EMAIL PROTECTED]
 Author of ICS (Internet Component Suite, freeware)
 Author of MidWare (Multi-tier framework, freeware)
 http://www.overbyte.be
 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: twsocket@elists.org
 Sent: Thursday, October 09, 2008 9:47 AM
 Subject: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()
 
 
 In function TextToHtmlText() the HTML encoding of characters above
 #127 assumes code page  iso-8859-1.
 
 const
 HtmlSpecialChars : array [160..255] of String[6] = (
 'nbsp'   , { #160 no-break space = non-breaking
  }
 'iexcl'  , { #161 inverted exclamation
 }
 'cent'   , { #162 cent sign
 
 ..
 
 This IMO should be exchanged by simple decimal notation:
 '#' + IntToStr(Ord(Char)) in both Icsv6 and Icsv7.
 
 Decimal notation of characters = #255 seems to work with all
 browsers, tested even with Netscape 3.01.
 With word-sized characters above #255 (D2009 and ICSv7) modern
 browsers render the correct Unicode characters.
 
 Or am I missing something?
 
 --
 Arno Garrels
 
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Arno Garrels
Francois Piette wrote:
 Yes, if someone has Apache or a newer IIS installed he could help.
 Create a file name with characters not in current ANSI code page by
 copy those characters from the Windows application charmap.exe.
 Than start a packet sniffer and log a directory listing.
 
 Using IIS6 on W2K3.

Thanks!

 The twothird character (U+2154) is sent in the dirlist as 3
 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are
 expressed as %e2%85%94 

That's UTF-8 URL-encoded.

 while they are binary in the text itself.

The twothird character is not 'encoded' either as #8532; (decimal) or
as #x2154; (hex)? If so, IIS sends plain UTF-16! 

 There is nothing in the html header to tell which code page or
 charset is used. --

Browsers seem to be very good in detecting the correct character set
nowadays.

--
Arno Garrels
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Francois PIETTE
 The twothird character is not 'encoded' either as #8532; (decimal) or
 as #x2154; (hex)? If so, IIS sends plain UTF-16! 

Yes, no encoding at all. Just the 3 bytes. So UTF-16.

-- 
[EMAIL PROTECTED]
http://www.overbyte.be


- Original Message - 
From: Arno Garrels [EMAIL PROTECTED]
To: ICS support mailing twsocket@elists.org
Sent: Thursday, October 09, 2008 5:26 PM
Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()


 Francois Piette wrote:
 Yes, if someone has Apache or a newer IIS installed he could help.
 Create a file name with characters not in current ANSI code page by
 copy those characters from the Windows application charmap.exe.
 Than start a packet sniffer and log a directory listing.
 
 Using IIS6 on W2K3.
 
 Thanks!
 
 The twothird character (U+2154) is sent in the dirlist as 3
 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are
 expressed as %e2%85%94 
 
 That's UTF-8 URL-encoded.
 
 while they are binary in the text itself.
 
 The twothird character is not 'encoded' either as #8532; (decimal) or
 as #x2154; (hex)? If so, IIS sends plain UTF-16! 
 
 There is nothing in the html header to tell which code page or
 charset is used. --
 
 Browsers seem to be very good in detecting the correct character set
 nowadays.
 
 --
 Arno Garrels
 -- 
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Arno Garrels
Francois PIETTE wrote:
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 Yes, no encoding at all. Just the 3 bytes. So UTF-16.

But 3 bytes looks like UTF-8 ?

--
Arno Garrels

 
 --
 [EMAIL PROTECTED]
 http://www.overbyte.be
 
 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: ICS support mailing twsocket@elists.org
 Sent: Thursday, October 09, 2008 5:26 PM
 Subject: Re: [twsocket] HTML encoding in HttpSrv func.
 TextToHtmlText() 
 
 
 Francois Piette wrote:
 Yes, if someone has Apache or a newer IIS installed he could help.
 Create a file name with characters not in current ANSI code page by
 copy those characters from the Windows application charmap.exe.
 Than start a packet sniffer and log a directory listing.
 
 Using IIS6 on W2K3.
 
 Thanks!
 
 The twothird character (U+2154) is sent in the dirlist as 3
 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are
 expressed as %e2%85%94
 
 That's UTF-8 URL-encoded.
 
 while they are binary in the text itself.
 
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 There is nothing in the html header to tell which code page or
 charset is used. --
 
 Browsers seem to be very good in detecting the correct character set
 nowadays.
 
 --
 Arno Garrels
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()

2008-10-09 Thread Francois PIETTE
 But 3 bytes looks like UTF-8 ?

I don't know. You said it was UTF-16 if not encoded.

- Original Message - 
From: Arno Garrels [EMAIL PROTECTED]
To: ICS support mailing twsocket@elists.org
Sent: Thursday, October 09, 2008 7:03 PM
Subject: Re: [twsocket] HTML encoding in HttpSrv func. TextToHtmlText()


 Francois PIETTE wrote:
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 Yes, no encoding at all. Just the 3 bytes. So UTF-16.
 
 But 3 bytes looks like UTF-8 ?
 
 --
 Arno Garrels
 
 
 --
 [EMAIL PROTECTED]
 http://www.overbyte.be
 
 
 - Original Message -
 From: Arno Garrels [EMAIL PROTECTED]
 To: ICS support mailing twsocket@elists.org
 Sent: Thursday, October 09, 2008 5:26 PM
 Subject: Re: [twsocket] HTML encoding in HttpSrv func.
 TextToHtmlText() 
 
 
 Francois Piette wrote:
 Yes, if someone has Apache or a newer IIS installed he could help.
 Create a file name with characters not in current ANSI code page by
 copy those characters from the Windows application charmap.exe.
 Than start a packet sniffer and log a directory listing.
 
 Using IIS6 on W2K3.
 
 Thanks!
 
 The twothird character (U+2154) is sent in the dirlist as 3
 characters : 0xE2 0x85 0x94. In the href link, the 3 characters are
 expressed as %e2%85%94
 
 That's UTF-8 URL-encoded.
 
 while they are binary in the text itself.
 
 The twothird character is not 'encoded' either as #8532;
 (decimal) or as #x2154; (hex)? If so, IIS sends plain UTF-16!
 
 There is nothing in the html header to tell which code page or
 charset is used. --
 
 Browsers seem to be very good in detecting the correct character set
 nowadays.
 
 --
 Arno Garrels
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be
 -- 
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be