Re: [twsocket] URL encoding

2008-09-28 Thread Francois PIETTE
 Can somebody confirm that characters above #127 have to be
 encoded UTF-8 first before they are percent-encoded?
 If that's correct, Url.pas was and is currently buggy.

When I use IE to get the url http://www.myhost.com/Fête (note the lowercase 
e with circumflex), it sends GET /F%C3%AAte to the webserver. This 
probably answers your question if we assume IE is standard with URL.

--
[EMAIL PROTECTED]
The author of the freeware multi-tier middleware MidWare
The author of the freeware Internet Component Suite (ICS)
http://www.overbyte.be


-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] URL encoding

2008-09-28 Thread Arno Garrels
Francois PIETTE wrote:
 Can somebody confirm that characters above #127 have to be
 encoded UTF-8 first before they are percent-encoded?
 If that's correct, Url.pas was and is currently buggy.
 
 When I use IE to get the url http://www.myhost.com/Fête (note the
 lowercase e with circumflex), it sends GET /F%C3%AAte to the
 webserver. This probably answers your question if we assume IE is
 standard with URL. 

FireFox URLs are sent UTF-8 encoded as well. 

--
Arno Garrels


-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] URL encoding

2008-09-28 Thread DZ-Jay

On Sep 27, 2008, at 12:14, Arno Garrels wrote:

 Can somebody confirm that characters above #127 have to be
 encoded UTF-8 first before they are percent-encoded?
 If that's correct, Url.pas was and is currently buggy.

I can't find anything specific on the HTTP and URI RFCs regarding this 
specific scenario.  The HTTP protocol definition defers the syntax of 
the URL to RFC 2396 (Universal Resource Identifier).  But this RFC in 
turn does not mandate a specific character set; in fact says that each 
transport may use whatever character set they what, and if more than 
one can be allowed, that they should provide a mechanism for selection. 
  However, as I mentioned, the HTTP RFC seems to be quiet about this.  
Older versions of the URI RFC defined allowed only 7-bit ASCII, but 
this is not the case any more.

 From RFC 2396: http://www.ietf.org/rfc/rfc2396.txt

2.1 URI and non-ASCII characters

The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a character
(as a distinguishable semantic entity) and an octet (an 8-bit
byte). There are two mappings, one from URI characters to octets, and
a second from octets to original characters:

URI character sequence-octet sequence-original character sequence

A URI is represented as a sequence of characters, not as a sequence
of octets. That is because URI might be transported by means that
are not through a computer network, e.g., printed on paper, read over
the radio, etc.

A URI scheme may define a mapping from URI characters to octets;
whether this is done depends on the scheme. Commonly, within a
delimited component of a URI, a sequence of characters may be used to
represent a sequence of octets. For example, the character a
represents the octet 97 (decimal), while the character sequence %,
0, a represents the octet 10 (decimal).

There is a second translation for some resources: the sequence of
octets defined by a component of the URI is subsequently used to
represent a sequence of characters. A 'charset' defines this mapping.
There are many charsets in use in Internet protocols. For example,
UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
of characters in the repertoire of ISO 10646.

In the simplest case, the original character sequence contains only
characters that are defined in US-ASCII, and the two levels of
mapping are simple and easily invertible: each 'original character'
is represented as the octet for the US-ASCII code for it, which is,
in turn, represented as either the US-ASCII character, or else the
% escape sequence for that octet.

For original character sequences that contain non-ASCII characters,
however, the situation is more difficult. Internet protocols that
transmit octet sequences intended to represent character sequences
are expected to provide some way of identifying the charset used, if
there might be more than one [RFC2277].  However, there is currently
no provision within the generic URI syntax to accomplish this
identification. An individual URI scheme may require a single
charset, define a default charset, or provide a way to indicate the
charset used.

The idea is that a URI can be used in print and other media, not only 
in computer transport systems, so the character set it defined by the 
target medium (scheme).  In the example that Francois gave,
http://www.myhost.com/Fête
that URI is perfectly valid (according to the URI RFC), precisely 
because I should be able to print that text in a book or poster without 
having to encode it further.  The semantics (i.e. the meaning of the 
characters) are applied by the target client:  a french reader in this 
example, for he knows that the character set is the one allowed by his 
written language.

However, the issue in question is, what is the representation need for 
the HTTP protocol specifically, and I can't seem to find anything 
regarding this in the RFCs.  RFC 2616 goes through great length in 
defining Character-Encoding mechanisms for the content, but I can't 
find anything for the request URI itself.

As the aforementioned quote describes, there is a distinction between 
the semantic and the syntax definition of a URI.  Syntactically, an 
HTTP URL allows for only a subset of the visible characters of the 
US-ASCII set, and all other characters must be encoded using %HEX 
encoding, including any reserved characters.  However, semantically, I 
can't find any specification.  What I mean is what character set does 
the HTTP protocol uses outside the transport encoding?

For example, suppose you have a URL in japanese, and your application 
transforms it into a URL-Encoded string and gives it to the HTTP 
server.  When the server receives it and decodes it, it still 

Re: [twsocket] URL encoding

2008-09-28 Thread Arno Garrels
DZ-Jay wrote:
 I've seen UTF-8 used all the time (and that's what I've used, too),
 and in fact that's probably what IE uses--but I can't find it anywhere
 specified as the HTTP protocol character set--unless I'm missing
 something.  It may be that UTF-8, by convention or tradition, is the
 de facto character set, but is this the rule?
 
 Can anybody find anything else?

It doesn't seem to be mandatory, however suggested to use UTF-8 since 
January 2005, RFC 3986.

In my local copy I changed UrlEncode() to produce correct UTF-8 and 
UrlDecode() to assume UTF-8 in case of the byte sequence to be decoded 
has been checked for valid UTF-8 successfully, otherwise the function 
assumes local default code page in D2009 or does not change the encoding
in older Delphi versions. The fact that both IE and Firebird send UTF-8
URLs seems to confirm this change. 

--
Arno Garrels
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] URL encoding

2008-09-28 Thread Fastream Technologies
I can confirm both browsers also translate non-ansi Turkish chars as
unicode:

ğ = %C4%9F
This is soft g, specific to Turkish on all languages.

Regards,

SZ
On Sun, Sep 28, 2008 at 4:49 PM, Arno Garrels [EMAIL PROTECTED] wrote:

 DZ-Jay wrote:
  I've seen UTF-8 used all the time (and that's what I've used, too),
  and in fact that's probably what IE uses--but I can't find it anywhere
  specified as the HTTP protocol character set--unless I'm missing
  something.  It may be that UTF-8, by convention or tradition, is the
  de facto character set, but is this the rule?
 
  Can anybody find anything else?

 It doesn't seem to be mandatory, however suggested to use UTF-8 since
 January 2005, RFC 3986.

 In my local copy I changed UrlEncode() to produce correct UTF-8 and
 UrlDecode() to assume UTF-8 in case of the byte sequence to be decoded
 has been checked for valid UTF-8 successfully, otherwise the function
 assumes local default code page in D2009 or does not change the encoding
 in older Delphi versions. The fact that both IE and Firebird send UTF-8
 URLs seems to confirm this change.

 --
 Arno Garrels
  --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be

Re: [twsocket] URL encoding

2008-09-28 Thread DZ-Jay

On Sep 28, 2008, at 09:49, Arno Garrels wrote:

 It doesn't seem to be mandatory, however suggested to use UTF-8 since
 January 2005, RFC 3986

Thank you!  For some reason I missed that 3986 obsoletes 2396.

dZ.

-- 
DZ-Jay [TeamICS]
http://www.overbyte.be/eng/overbyte/teamics.html

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] URL encoding

2008-09-28 Thread Arno Garrels
DZ-Jay wrote:
 On Sep 28, 2008, at 09:49, Arno Garrels wrote:
 
 It doesn't seem to be mandatory, however suggested to use UTF-8 since
 January 2005, RFC 3986
 
 Thank you!  For some reason I missed that 3986 obsoletes 2396.

If you are interested, I just checked in my UTF-8 changes (v7).
The webserver demo lists all kind of arabic files now :) The links
do work however TextToHtmlText() still shows a ? when a character
cannot be HTML encoded.   

--
Arno
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] URL encoding

2008-09-27 Thread Arno Garrels
Arno Garrels wrote:
 Hi,
 
 Can somebody confirm that characters above #127 have to be
 encoded UTF-8 first before they are percent-encoded?
 If that's correct, Url.pas was and is currently buggy.

Same or similar functions are used in the HTTP server.
I have a fix for OverbyteIcsUrl.pas but won't check it in 
unless somebody confirms.

--
Arno


 
 Sources:
 rfc3986 http://tools.ietf.org/html/rfc3986
 http://en.wikipedia.org/wiki/Percent-encoding
 
 --
 Arno Garrels
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be


Re: [twsocket] URL encoding

2008-09-27 Thread Arno Garrels
Fastream Technologies wrote:
 I think this bug could be a reason why our web server customers have
 been getting complained with html file manipulation in non-ansi...

Can you confirm that it's a bug?

--
Arno Garrels

 
 On Sat, Sep 27, 2008 at 7:14 PM, Arno Garrels [EMAIL PROTECTED]
 wrote: 
 
 Hi,
 
 Can somebody confirm that characters above #127 have to be
 encoded UTF-8 first before they are percent-encoded?
 If that's correct, Url.pas was and is currently buggy.
 
 Sources:
 rfc3986 http://tools.ietf.org/html/rfc3986
 http://en.wikipedia.org/wiki/Percent-encoding
 
 --
 Arno Garrels
 --
 To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be



Re: [twsocket] URL encoding

2008-09-27 Thread Fastream Technologies
I am not sure of a 100% covering test scenario. Our server is now free, why
don't you test it yourself? (http://www.fastream.com/iqwebftpserver.php)

Regards,

Gorkem
On Sat, Sep 27, 2008 at 7:38 PM, Arno Garrels [EMAIL PROTECTED] wrote:

 Fastream Technologies wrote:
  I think this bug could be a reason why our web server customers have
  been getting complained with html file manipulation in non-ansi...

 Can you confirm that it's a bug?

 --
 Arno Garrels

 
  On Sat, Sep 27, 2008 at 7:14 PM, Arno Garrels [EMAIL PROTECTED]
  wrote:
 
  Hi,
 
  Can somebody confirm that characters above #127 have to be
  encoded UTF-8 first before they are percent-encoded?
  If that's correct, Url.pas was and is currently buggy.
 
  Sources:
  rfc3986 http://tools.ietf.org/html/rfc3986
  http://en.wikipedia.org/wiki/Percent-encoding
 
  --
  Arno Garrels
  --
  To unsubscribe or change your settings for TWSocket mailing list
  please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
  Visit our website at http://www.overbyte.be
 --
  To unsubscribe or change your settings for TWSocket mailing list
 please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
 Visit our website at http://www.overbyte.be




-- 
Gorkem Ates
Fastream Technologies
Software IQ: Innovation  Quality
www.fastream.com | Email: [EMAIL PROTECTED] | Tel: +90-312-223-2830 |
MSN: [EMAIL PROTECTED]
Join IQWF Server Yahoo group at http://groups.yahoo.com/group/IQWFServer
Join IQ Reverse Proxy Yahoo group at
http://groups.yahoo.com/group/IQReverseProxy
-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be