Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-14 Thread Jason Dusek
2012/3/12 Jeremy Shaw jer...@n-heptane.com:
 On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek jason.du...@gmail.com wrote:
 Well, to quote one example from RFC 3986:

  2.1.  Percent-Encoding

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.

 Right. This describes how to convert an octet into a sequence of characters,
 since the only thing that can appear in a URI is sequences of characters.

 The syntax of URIs is a mechanism for describing data octets,
 not Unicode code points. It is at variance to describe URIs in
 terms of Unicode code points.


 Not sure what you mean by this. As the RFC says, a URI is defined entirely
 by the identity of the characters that are used. There is definitely no
 single, correct byte sequence for representing a URI. If I give you a
 sequence of bytes and tell you it is a URI, the only way to decode it is to
 first know what encoding the byte sequence represents.. ascii, utf-16, etc.
 Once you have decoded the byte sequence into a sequence of characters, only
 then can you parse the URI.

Mr. Shaw,

Thanks for taking the time to explain all this. It's really
helped me to understand a lot of parts of the URI spec a lot
better. I have deprecated my module in the latest release

  http://hackage.haskell.org/package/URLb-0.0.1

because a URL parser working on bytes instead of characters
stands out to me now as a confused idea.

--
Jason Dusek
pgp  ///  solidsnack  1FD4C6C1 FED18A2B

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-14 Thread Graham Klyne

Hi,

I only just noticed this discussion.  Essentially, I think you have arrived at 
the right conclusion regarding URIs.


For more background, the IRI document makes interesting reading in this context: 
http://tools.ietf.org/html/rfc3987; esp. sections 2, 2.1.


The IRI is defined in terms of Unicode characters, which themselves may be 
described/referenced in terms of their code points, but the character encoding 
is not prescribed.


In practice, I think systems are increasingly using UTF-8 for transmitting IRIs 
and URIs, and using either UTF-8 or UTF-16 for internal storage.  There is still 
a legacy of ISO-8859-1 being defined asthe default charset for HTML (cf. 
http://www.w3.org/International/O-HTTP-charset for further discussiomn).


#g
--

On 14/03/2012 06:43, Jason Dusek wrote:

2012/3/12 Jeremy Shawjer...@n-heptane.com:

On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusekjason.du...@gmail.com  wrote:

Well, to quote one example from RFC 3986:

  2.1.  Percent-Encoding

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.


Right. This describes how to convert an octet into a sequence of characters,
since the only thing that can appear in a URI is sequences of characters.


The syntax of URIs is a mechanism for describing data octets,
not Unicode code points. It is at variance to describe URIs in
terms of Unicode code points.



Not sure what you mean by this. As the RFC says, a URI is defined entirely
by the identity of the characters that are used. There is definitely no
single, correct byte sequence for representing a URI. If I give you a
sequence of bytes and tell you it is a URI, the only way to decode it is to
first know what encoding the byte sequence represents.. ascii, utf-16, etc.
Once you have decoded the byte sequence into a sequence of characters, only
then can you parse the URI.


Mr. Shaw,

Thanks for taking the time to explain all this. It's really
helped me to understand a lot of parts of the URI spec a lot
better. I have deprecated my module in the latest release

   http://hackage.haskell.org/package/URLb-0.0.1

because a URL parser working on bytes instead of characters
stands out to me now as a confused idea.

--
Jason Dusek
pgp  ///  solidsnack  1FD4C6C1 FED18A2B

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe




___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-12 Thread Joey Hess
Jason Dusek wrote:
  :info System.Posix.Env.getEnvironment
 System.Posix.Env.getEnvironment :: IO [(String, String)]
 -- Defined in System.Posix.Env
 
 But there is no law that environment variables must be made of
 characters:

The recent ghc release provides
System.Posix.Env.ByteString.getEnvironment :: IO [(ByteString, ByteString)]

-- 
see shy jo


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-11 Thread Jason Dusek
2012/3/11 Jeremy Shaw jer...@n-heptane.com:
 Also, URIs are not defined in terms of octets.. but in terms
 of characters.  If you write a URI down on a piece of paper --
 what octets are you using?  None.. it's some scribbles on a
 paper. It is the characters that are important, not the bit
 representation.

Well, to quote one example from RFC 3986:

  2.1.  Percent-Encoding

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.

The syntax of URIs is a mechanism for describing data octets,
not Unicode code points. It is at variance to describe URIs in
terms of Unicode code points.

 If you render a URI in a utf-8 encoded document versus a
 utf-16 encoded document.. the octets will be different, but
 the meaning will be the same. Because it is the characters
 that are important. For a URI Text would be a more compact
 representation than String.. but ByteString is a bit dodgy
 since it is not well defined what those bytes represent.
 (though if you use a newtype wrapper around ByteString to
 declare that it is Ascii, then that would be fine).

This is all fine well and good for what a URI is parsed from
and what it is serialized too; but once parsed, the major
components of a URI are all octets, pure and simple. Like the
host part of the authority:

  host= IP-literal / IPv4address / reg-name
  ...
  reg-name= *( unreserved / pct-encoded / sub-delims )

The reg-name production is enough to show that, once the host
portion is parsed, it could contain any bytes whatever.
ByteString is the only correct representations for a parsed host
and userinfo, as well as a parsed path, query or fragment.

--
Jason Dusek
pgp  ///  solidsnack  1FD4C6C1 FED18A2B

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-11 Thread Brandon Allbery
On Sun, Mar 11, 2012 at 14:33, Jason Dusek jason.du...@gmail.com wrote:

 The syntax of URIs is a mechanism for describing data octets,
 not Unicode code points. It is at variance to describe URIs in
 terms of Unicode code points.


You might want to take a glance at RFC 3492, though.

-- 
brandon s allbery  allber...@gmail.com
wandering unix systems administrator (available) (412) 475-9364 vm/sms
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-11 Thread Jason Dusek
2012/3/11 Brandon Allbery allber...@gmail.com:
 On Sun, Mar 11, 2012 at 14:33, Jason Dusek jason.du...@gmail.com wrote:
  The syntax of URIs is a mechanism for describing data octets,
  not Unicode code points. It is at variance to describe URIs in
  terms of Unicode code points.

 You might want to take a glance at RFC 3492, though.

RFC 3492 covers Punycode, an approach to internationalized
domain names. The relationship of RFC 3986 to the restrictions
on the syntax of host names, as given by the DNS, is not simple.
On the one hand, we have:

   This specification does not mandate a particular registered
   name lookup technology and therefore does not restrict the
   syntax of reg-name beyond what is necessary for
   interoperability.

The production for reg-name is very liberal about allowable
octets:

  reg-name= *( unreserved / pct-encoded / sub-delims )

However, we also have:

  The reg-name syntax allows percent-encoded octets in order to
  represent non-ASCII registered names in a uniform way that is
  independent of the underlying name resolution technology.
  Non-ASCII characters must first be encoded according to
  UTF-8...

The argument for representing reg-names as Text is pretty strong
since the only representable data under these rules is, indeed,
Unicode code points.

--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-11 Thread Jason Dusek
2012/3/11 Thedward Blevins thedw...@barsoom.net:
 On Sun, Mar 11, 2012 at 13:33, Jason Dusek jason.du...@gmail.com wrote:
  The syntax of URIs is a mechanism for describing data octets,
  not Unicode code points. It is at variance to describe URIs in
  terms of Unicode code points.

 This claim is at odds with the RFC you quoted:

 2. Characters

 The URI syntax provides a method of encoding data, presumably for the sake
 of identifying a resource, as a sequence of characters. The URI characters
 are, in turn, frequently encoded as octets for transport or presentation.
 This specification does not mandate any particular character encoding for
 mapping between URI characters and the octets used to store or transmit
 those characters.

 (Emphasis is mine)

 The RFC is specifically agnostic about serialization. I generally agree that
 there are a lot of places where ByteString should be used, but I'm not
 convinced this is one of them.

Hi Thedward,

I am CC'ing the list since you raise a good point that, I think,
reflects on the discussion broadly. It is true that intent of
the spec is to allow encoding of characters and not of bytes: I
misread its intent, attending only to the productions. But due
to the way URIs interact with character encoding, a general URI
parser is constrained to work with ByteStrings, just the same.

The RFC ...does not mandate any particular character encoding
for mapping between URI characters and the octets used to store
or transmit those characters... and in Section 1.2.1 it is
allowed that the encoding of may depend on the scheme:

   In local or regional contexts and with improving technology, users
   might benefit from being able to use a wider range of characters;
   such use is not defined by this specification.  Percent-encoded
   octets (Section 2.1) may be used within a URI to represent characters
   outside the range of the US-ASCII coded character set if this
   representation is allowed by the scheme or by the protocol element in
   which the URI is referenced.

It seems possible for any octet, 0x00..0xFF, to show up in a
URI, and it is only after parsing the scheme that we can say
whether the octet belongs there are not. Thus a general URI
parser can only go as far as splitting into components and
percent decoding before handing off to scheme specific
validation rules (but that's a big help already!). I've
implemented a parser under these principles that handles
specifically URLs:

  http://hackage.haskell.org/package/URLb

Although the intent of the spec is to represent characters, I
contend it does not succeed in doing so. Is it wise to assume
more semantics than are actually there? The Internet and UNIX
are full of broken junk; but faithful representation would seem
to be better than idealization for those occasions where we must
deal with them. I'm not sure the assumption of textiness
really helps much in practice since the Universal Character Set
contains control codes and bidi characters -- data that isn't
really text.

--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-11 Thread Brandon Allbery
On Sun, Mar 11, 2012 at 23:05, Jason Dusek jason.du...@gmail.com wrote:

 Although the intent of the spec is to represent characters, I
 contend it does not succeed in doing so. Is it wise to assume
 more semantics than are actually there?


It is not; one of the reasons that many experts protested the acceptance of
this RFC is because of its incomplete specification (and as a result there
are a lot of implementations currently which *do* assume more semantics,
not always compatibly with each other).

Punycode is out there now, but it's a mess and a minefield.

-- 
brandon s allbery  allber...@gmail.com
wandering unix systems administrator (available) (412) 475-9364 vm/sms
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-11 Thread Jeremy Shaw
Argh. Email fail.

Hopefully this time I have managed to reply-all to the list *and* keep the
unicode properly intact.

Sorry about any duplicates you may have received.

On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek jason.du...@gmail.com wrote:

 2012/3/11 Jeremy Shaw jer...@n-heptane.com:
  Also, URIs are not defined in terms of octets.. but in terms
  of characters.  If you write a URI down on a piece of paper --
  what octets are you using?  None.. it's some scribbles on a
  paper. It is the characters that are important, not the bit
  representation.



To quote RFC1738:

   URLs are sequences of characters, i.e., letters, digits, and special
   characters. A URLs may be represented in a variety of ways: e.g., ink
   on paper, or a sequence of octets in a coded character set. The
   interpretation of a URL depends only on the identity of the
   characters used.


Well, to quote one example from RFC 3986:

  2.1.  Percent-Encoding

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.


Right. This describes how to convert an octet into a sequence of
characters, since the only thing that can appear in a URI is sequences of
characters.


 The syntax of URIs is a mechanism for describing data octets,
 not Unicode code points. It is at variance to describe URIs in
 terms of Unicode code points.


Not sure what you mean by this. As the RFC says, a URI is defined entirely
by the identity of the characters that are used. There is definitely no
single, correct byte sequence for representing a URI. If I give you a
sequence of bytes and tell you it is a URI, the only way to decode it is to
first know what encoding the byte sequence represents.. ascii, utf-16, etc.
Once you have decoded the byte sequence into a sequence of characters, only
then can you parse the URI.


  If you render a URI in a utf-8 encoded document versus a
  utf-16 encoded document.. the octets will be diffiFor example, let's say
 that we have a unicode string and we want to use it in the URI path.

  the meaning will be the same. Because it is the characters
  that are important. For a URI Text would be a more compact
  representation than String.. but ByteString is a bit dodgy
  since it is not well defined what those bytes represent.
  (though if you use a newtype wrapper around ByteString to
  declare that it is Ascii, then that would be fine).

 This is all fine well and good for what a URI is parsed from
 and what it is serialized too; but once parsed, the major
 components of a URI are all octets, pure and simple.


Not quite. We can not, for example, change uriPath to be a ByteString and
decode any percent encoded characters for the user, because that would
change the meaning of the path and break applications.

For example, let's say we have the path segments [foo, bar/baz] and we
wish to use them in the path info of a URI. Because / is a special
character it must be percent encoded as %2F. So, the path info for the url
would be:

 foo/bar%2Fbaz

If we had the path segments, [foo,bar,baz], however that would be
encoded as:

 foo/bar/baz

Now let's look at decoding the path. If we simple decode the percent
encoded characters and give the user a ByteString then both urls will
decode to:

 pack foo/bar/baz

Which is incorrect. [foo, bar/baz] and [foo,bar,baz] represent
different paths. The percent encoding there is required to distinguish
between to two unique paths.

Let's look at another example, Let's say we want to encode the path
segments:

 [I❤λ]

How do we do that?

Well.. the RFCs do not mandate a specific way. While a URL is a sequence of
characters -- the set of allow characters in pretty restricted. So, we must
use some application specific way to transform that string into something
that is allowed in a uri path. We could do it by converting all characters
to their unicode character numbers like:

 u73u2764u03BB

Since the string now only contains acceptable characters, we can easily
convert it to a valid uri path. Later when someone requests that url, our
application can convert it back to a unicode character sequence.

Of course, no one actually uses that method. The commonly used (and I
believe, officially endorsed, but not required) method is a bit more
complicated.

 1. first we take the string I❤λ and utf-8 encoded it to get a octet
sequence:

   49 e2 9d a4 ce bb

 2. next we percent encode the bytes to get *back* to a character sequence
(such as a String, Text, or Ascii)

 I%E2%9D%A4%CE%BB

So, that is character sequence that would appear in the URI. *But* we do
not yet have octets that we can transmit over the internet. We only have a
sequence of characters. We must now convert those characters into octets.
For example, let's say we put the url as an 'href' in an a tag in a web
page that is UTF-16 encoded.

 3. Now we must convert the character 

Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-11 Thread Jason Dusek
2012/3/12 Jeremy Shaw jer...@n-heptane.com:
  The syntax of URIs is a mechanism for describing data octets,
  not Unicode code points. It is at variance to describe URIs in
  terms of Unicode code points.

 Not sure what you mean by this. As the RFC says, a URI is defined entirely
 by the identity of the characters that are used. There is definitely no
 single, correct byte sequence for representing a URI. If I give you a
 sequence of bytes and tell you it is a URI, the only way to decode it is to
 first know what encoding the byte sequence represents.. ascii, utf-16, etc.
 Once you have decoded the byte sequence into a sequence of characters, only
 then can you parse the URI.

Hmm. Well, I have been reading the spec the other way around:
first you parse the URI to get the bytes, then you use encoding
information to interpret the bytes. I think this curious passage
from Section 2.5 is interesting to consider here:

   For most systems, an unreserved character appearing within a URI
   component is interpreted as representing the data octet corresponding
   to that character's encoding in US-ASCII.  Consumers of URIs assume
   that the letter X corresponds to the octet 01011000, and even
   when that assumption is incorrect, there is no harm in making it.  A
   system that internally provides identifiers in the form of a
   different character encoding, such as EBCDIC, will generally perform
   character translation of textual identifiers to UTF-8 [STD63] (or
   some other superset of the US-ASCII character encoding) at an
   internal interface, thereby providing more meaningful identifiers
   than those resulting from simply percent-encoding the original
   octets.

I am really not sure how to interpret this. I have been reading
'%' in productions as '0b00100101' and I have written my parser
this way; but that is probably backwards thinking.

 ...let's say we have the path segments [foo, bar/baz] and we wish to use
 them in the path info of a URI. Because / is a special character it must be
 percent encoded as %2F. So, the path info for the url would be:

  foo/bar%2Fbaz

 If we had the path segments, [foo,bar,baz], however that would be
 encoded as:

  foo/bar/baz

 Now let's look at decoding the path. If we simple decode the percent encoded
 characters and give the user a ByteString then both urls will decode to:

  pack foo/bar/baz

 Which is incorrect. [foo, bar/baz] and [foo,bar,baz] represent
 different paths. The percent encoding there is required to distinguish
 between to two unique paths.

I read the section on paths differently: a path is sequence of
bytes, wherein slash runs are not permitted, among other rules.
However, re-reading the section, a big todo is made about
hierarchical data and path normalization; it really seems your
interpretation is the correct one. I tried it out in cURL, for
example:

  http://www.ietf.org/rfc%2Frfc3986.txt # 404 Not Found
  http://www.ietf.org/rfc/rfc3986.txt   # 200 OK

My recently released released URL parser/pretty-printer is
actually wrong in its handling of paths and, when corrected,
will only amount to a parser of URLs that are encoded in
US-ASCII and supersets thereof.

--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-10 Thread Jason Dusek
The content of URIs is defined in terms of octets in the RFC,
and all Posix interfaces are byte streams and C strings, not
character strings. Yet in Haskell, we find these objects exposed
with String interfaces:

 :info Network.URI.URI
data URI
  = URI {uriScheme :: String,
 uriAuthority :: Maybe URIAuth,
 uriPath :: String,
 uriQuery :: String,
 uriFragment :: String}
-- Defined in Network.URI

 :info System.Posix.Env.getEnvironment
System.Posix.Env.getEnvironment :: IO [(String, String)]
-- Defined in System.Posix.Env

But there is no law that environment variables must be made of
characters:

 :; export x=$'\xFF' ; echo -n $x | xxd -p
  ff
 :; locale
  LANG=en_US.UTF-8

That the relationship between bytes and characters can be
confusing, both in working with UNIX and in dealing with web
protocols, is undeniable -- but it seems unwise to limit the
options available to Haskell programmers in dealing with these
systems.

--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012-03-10 Thread Jeremy Shaw
It is mostly because those libraries are far older than Text and
ByteString, so String was the only choice at the time. Modernizing them is
good.. but would also break a lot of code. And in many core libraries, the
functions are required to have String types in order to be Haskell 98
compliant.

So, modernization is good. But also requires significant effort, and
someone willing to make that effort.

Also, URIs are not defined in terms of octets.. but in terms of characters.
If you write a URI down on a piece of paper -- what octets are you using?
None.. it's some scribbles on a paper. It is the characters that are
important, not the bit representation. If you render a URI in a utf-8
encoded document versus a utf-16 encoded document.. the octets will be
different, but the meaning will be the same. Because it is the characters
that are important. For a URI Text would be a more compact representation
than String.. but ByteString is a bit dodgy since it is not well defined
what those bytes represent. (though if you use a newtype wrapper around
ByteString to declare that it is Ascii, then that would be fine).

- jeremy

On Sat, Mar 10, 2012 at 9:24 PM, Jason Dusek jason.du...@gmail.com wrote:

 The content of URIs is defined in terms of octets in the RFC,
 and all Posix interfaces are byte streams and C strings, not
 character strings. Yet in Haskell, we find these objects exposed
 with String interfaces:

  :info Network.URI.URI
 data URI
  = URI {uriScheme :: String,
 uriAuthority :: Maybe URIAuth,
 uriPath :: String,
 uriQuery :: String,
 uriFragment :: String}
-- Defined in Network.URI

  :info System.Posix.Env.getEnvironment
 System.Posix.Env.getEnvironment :: IO [(String, String)]
-- Defined in System.Posix.Env

 But there is no law that environment variables must be made of
 characters:

  :; export x=$'\xFF' ; echo -n $x | xxd -p
  ff
  :; locale
  LANG=en_US.UTF-8

 That the relationship between bytes and characters can be
 confusing, both in working with UNIX and in dealing with web
 protocols, is undeniable -- but it seems unwise to limit the
 options available to Haskell programmers in dealing with these
 systems.

 --
 Jason Dusek
 pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe