[Catalyst] Re: decoding in core

2009-02-23 Thread Aristotle Pagaltzis
* Neo [GC]  [2009-02-23 16:45]:
> Does anyone know a _safe_ method to convert _any_ string-scalar
> to utf8?

There isn’t. Strings in Perl are untyped. They are simply
sequences of arbitrarily large integers.

If a string only contains values between 0 and 255, then it can
be stored in an optimised form that uses exactly one byte per
integer and the UTF8 flag is off. Otherwise, it is stored in a
variable-width format that is identical to UTF-8 encoding, but
is not actually UTF-8. (There is no particular meaning implied
for these integers, and Perl strings can store integer values
that are undefined in Unicode.) The UTF8 flag simply means “this
is an unoptimised string”. It will sometimes be enabled on octet
strings (even though no integer value in the string is > 255) and
it will frequently be disabled on character strings. It tells you
nothing useful *at all* about the content of the string and you
should just forget that it exists. [^1]

If you have a string that corresponds to a sequence of octets
which store the encoded form of a string according to some
encoding, you have to manually keep track of this encoding,
because there is nothing about the string that tells you this.

The best approach is to simply decode strings as soon after input
as possible and encode them as late before output as possible. In
the middle of your code, then, you only have strings containing
Unicode codepoints.


[^1]: Almost. Unfortunately, there is quite a bit of broken XS
  code in modules out there which means you will have to
  `utf8::downgrade` strings to make sure they are stored in
  byte-wise optimised format before passing them in to such
  modules.

Regards,
-- 
Aristotle Pagaltzis // 

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Octavian Râşniţă

From: "Bill Moseley" 
n Mon, Feb 23, 2009 at 06:45:40PM +0200, Octavian Râşniţă wrote:

I understand that there are reasons for not transforming all the
encodings to UTF-8 in core, even though it seems to be not very
complicated, because maybe there are some tables that contain ISO-8859-2
chars and other tables that contain ISO-8859-1 chars, and when the data
need to be saved, it should keep its original encoding.


Don't think about transforming encodings to UTF-8.

In the vast majority of cases people expect to work with characters,
and that's what Perl works with internally.  UTF-8 is an encoding, not
characters.

The HTTP request is octets.  The HTTP request specifies what encoding
those octets represent and it's that encoding that is used to decode
the octets into characters.  The fact that Perl uses UTF-8 internally
is best ignored -- it's just characters inside Perl once decoded.

Conceptually it's not that much different than a request with
"Content-Encoding: gzip" -- before using the request body parameters
the gzipped octets must obviously be decoded.  Likewise, the body must
be url-decoded into separate parameters.  And again, the resulting
octets must be decoded into characters if the parameters are to be
used as character.  That last step has often been ignored.

Then when sending a response of (abstract) characters that are inside
Perl they must first be encoded into octets.

Those things should be handled at the edge of the application, and
that would be in the Engine (or the code the Engine uses).

Yes, the same thing has to happen with templates, the database, and
all external data sources.  Those are separate issues.  HTTP provides
a standard way to determine how to encode and decode.

Ok, but wouldn't be possible to need to specify this encoding only once in a 
single place?
Or better said, if the app uses C::P::Unicode module, it could consider as a 
default that the templates, controllers and other parts of the app use 
UTF-8, and use a different encoding for one or some of them only if the 
encoding is specified explicitly.


Octavian




___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Bill Moseley
On Mon, Feb 23, 2009 at 06:45:40PM +0200, Octavian Râşniţă wrote:
> I understand that there are reasons for not transforming all the 
> encodings to UTF-8 in core, even though it seems to be not very 
> complicated, because maybe there are some tables that contain ISO-8859-2 
> chars and other tables that contain ISO-8859-1 chars, and when the data 
> need to be saved, it should keep its original encoding.

Don't think about transforming encodings to UTF-8.

In the vast majority of cases people expect to work with characters,
and that's what Perl works with internally.  UTF-8 is an encoding, not
characters.

The HTTP request is octets.  The HTTP request specifies what encoding
those octets represent and it's that encoding that is used to decode
the octets into characters.  The fact that Perl uses UTF-8 internally
is best ignored -- it's just characters inside Perl once decoded.

Conceptually it's not that much different than a request with
"Content-Encoding: gzip" -- before using the request body parameters
the gzipped octets must obviously be decoded.  Likewise, the body must
be url-decoded into separate parameters.  And again, the resulting
octets must be decoded into characters if the parameters are to be
used as character.  That last step has often been ignored.

Then when sending a response of (abstract) characters that are inside
Perl they must first be encoded into octets.

Those things should be handled at the edge of the application, and
that would be in the Engine (or the code the Engine uses).

Yes, the same thing has to happen with templates, the database, and
all external data sources.  Those are separate issues.  HTTP provides
a standard way to determine how to encode and decode.


-- 
Bill Moseley
mose...@hank.org
Sent from my iMutt


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Octavian Râşniţă

From: "Peter Karman" 
Neo [GC] wrote on 02/23/2009 09:41 AM:



Does anyone know a _safe_ method to convert _any_ string-scalar to utf8?
Something like
anything_to_utf8($s)
, regardless if $s contains ascii, latin1, utf8, tasty hodgepodge or hot
fn0rd, utf8-flag is set or not and is neither affected by full moon nor
my horrorscope, _without_ doing double-encoding (there MUST be some way
to determine if it already is utf8... my silly java editor can do it and
perl makes difficult things at least possible).


I would greatly appreciate this philosophers stone and will send my hero
a bottle of finest bavarian (munich!) beer called Edelstoff ("precious
stuff" - tasty).



Search::Tools::UTF8::to_utf8() comes close. It won't handle mixed
encoding in a single string (which would be garbage anyway) but it does
try to prevent double-encoding and uses the Encode goodness under the 
hood.


--
Peter Karman  .  pe...@peknet.com  .  http://peknet.com/


I understand that there are reasons for not transforming all the encodings 
to UTF-8 in core, even though it seems to be not very complicated, because 
maybe there are some tables that contain ISO-8859-2 chars and other tables 
that contain ISO-8859-1 chars, and when the data need to be saved, it should 
keep its original encoding.


But if somebody wants to create a new Catalyst app, with a new database, new 
templates, controllers, etc, I think it could be very helpful if the 
programmer would only need to specify only once that he wants to use UTF-8 
everywhere - in the database, in the templates, in the configuration files 
of HTML::FormFu, in the controllers, and not in more places in the 
configuration file, or specify UTF8Columns in DBIC classes...

It could be a kind of default.

Octavian








___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Peter Karman
Neo [GC] wrote on 02/23/2009 09:41 AM:

> Does anyone know a _safe_ method to convert _any_ string-scalar to utf8?
> Something like
> anything_to_utf8($s)
> , regardless if $s contains ascii, latin1, utf8, tasty hodgepodge or hot
> fn0rd, utf8-flag is set or not and is neither affected by full moon nor
> my horrorscope, _without_ doing double-encoding (there MUST be some way
> to determine if it already is utf8... my silly java editor can do it and
> perl makes difficult things at least possible).
> 
> 
> I would greatly appreciate this philosophers stone and will send my hero
> a bottle of finest bavarian (munich!) beer called Edelstoff ("precious
> stuff" - tasty).
> 

Search::Tools::UTF8::to_utf8() comes close. It won't handle mixed
encoding in a single string (which would be garbage anyway) but it does
try to prevent double-encoding and uses the Encode goodness under the hood.

-- 
Peter Karman  .  pe...@peknet.com  .  http://peknet.com/


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Neo [GC]
Oh I forgot something... or more precisely, my boss named it while 
having a smoke. Maybe somewhat OT, but definetly interesting (maybe 
could be used to simplify the problem of double-enconding):


Does anyone know a _safe_ method to convert _any_ string-scalar to utf8?
Something like
anything_to_utf8($s)
, regardless if $s contains ascii, latin1, utf8, tasty hodgepodge or hot 
fn0rd, utf8-flag is set or not and is neither affected by full moon nor 
my horrorscope, _without_ doing double-encoding (there MUST be some way 
to determine if it already is utf8... my silly java editor can do it and 
perl makes difficult things at least possible).



I would greatly appreciate this philosophers stone and will send my hero 
a bottle of finest bavarian (munich!) beer called Edelstoff ("precious 
stuff" - tasty).



Greets and thanks!
Tom Weber

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Neo [GC]



Zbigniew Lukasiak schrieb:

Hmm - in my understanding it only changes literals in the code ( $var
= 'ą' ).  So I looked into the pod and it says:

Bytes in the source text that have their high-bit set will be
treated as being part of a literal
UTF-8 character.  This includes most literals such as identifier
names, string constants, and con-
stant regular expression patterns.
  

Ah SORRY! In my confusion I've confused it again...
So if I get it right, "use utf8" means you can do stuff like $s ~= 
s/a/ä/; (as the plain ä in the source will be treated as one character 
and not two octets), while the magical utf8-flag for $s tells perl, that 
the ä in the scalar really is an ä and not two strange octets.

Am I right or am I completely lost again?

Hmm - maybe I'll add UTF-8 handling in InstantCRUD.  I am waiting for
good sentences showing off the national characters.
Does it have to be a complete sentence? My favourite test-string is 
something like

äöüÄÖÜß"'+ (UTF-8)
C3 A4 C3 B6 C3 BC C3 84 C3 96 C3 9C C3 9F 22 27 2B (Hex)
If I can put this string into some html-form, post/get it, process it, 
save to and read from db, output it to browser _and_ still have exactly 
10 characters, the application _might_ work as it should.
The Umlauts and the Eszett are a pain of unicode, the " and ' are 
fun-with-html and escaping and the + ... well, URI-encoding, you know...


For even more fun, one should do a regex in the application using utf8 
(give me all those äÄs) and select it from the DB, first with "blahfield 
LIKE 'ä'", maybe "upper(blahfield) LIKE upper('ä')" and finally an 
"ORDER BY blahfield", where blahfield should contain one row starting 
with "a", one with "ä" and one with "b" and the output should have 
exactly this order and _not_ "a,b,ä" (hint hint: utf9 treated as ascii 
or latin1).



Greets and regards,
Tom Weber

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Zbigniew Lukasiak
On Mon, Feb 23, 2009 at 2:58 PM, Neo [GC]  wrote:
> Zbigniew Lukasiak schrieb:
>>
>> Some more things to consider.
>>
>> - 'use utf8' in the code generated by the helpers?
>>
>
> Reasonable, but only if documentet. It took weeks for us until we learned,
> that this changes _nothing_ but the behaviour of several perl-functions like
> regexp, sort aso.

Hmm - in my understanding it only changes literals in the code ( $var
= 'ą' ).  So I looked into the pod and it says:

Bytes in the source text that have their high-bit set will be
treated as being part of a literal
UTF-8 character.  This includes most literals such as identifier
names, string constants, and con-
stant regular expression patterns.

>>
>> - ENCODING: UTF-8 for the TT view helper?
>>
>> Maybe a global config option to choose the byte or character semantics?
>>
>> But with the DB it becomes a bit more complex - because BLOB columns
>> probably need to use byte sematic.
>>
>
> Uhm, of course, as BLOB is Binary and CLOB is Character. ;) This is even
> more complex, as the databases have different treating for this datatypes
> and some of Perls DBI-drivers are somewhat broken when it goes to unicode
> (according to our perl-saves-our-souls-guru).
> UTF-8 is ok in Perl itself (not easy, not coherent, but ok); but in
> combination of many modules (and as far as I learned, Perl is all about
> reusing modules) it is _hell_. Try to read UTF-8 from HTTP-request, store in
> database, select with correct order, write to XLS, convert to CSV, reimport
> it into the DB and output it to the browser, all with different subs in the
> same controller... and you know, what I mean.
> Even our most euphoric Perl-gurus don't have any clue how to handle UTF-8
> from the beginning to the end without hour-long trial&error in their
> programs (and remember - we Germans do only have those bloody Umlauts - try
> to imagine this in China >_<).
>
> Maybe the best thing for all average-and-below users would be a _really_
> good tutorial about Catalyst+UTF-8. What to do, what not to do. How to read
> UTF-8 from HTTP-request / uploaded file / local file / database, how to
> write it to client / downloadable file / local file / database. What
> catalystish variable is UTF-8-encoded when and why. How to determine what
> encoding a given scalar has and how to encode/decode/whatevercode it to a
> bloody nice scalar with shiny UTF-8 chars in it.
> Short: -- Umlauts with Catalyst for dummies --
>

Hmm - maybe I'll add UTF-8 handling in InstantCRUD.  I am waiting for
good sentences showing off the national characters.


-- 
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Neo [GC]

Zbigniew Lukasiak schrieb:

Some more things to consider.

- 'use utf8' in the code generated by the helpers?
  
Reasonable, but only if documentet. It took weeks for us until we 
learned, that this changes _nothing_ but the behaviour of several 
perl-functions like regexp, sort aso.

- ENCODING: UTF-8 for the TT view helper?

Maybe a global config option to choose the byte or character semantics?

But with the DB it becomes a bit more complex - because BLOB columns
probably need to use byte sematic.
  
Uhm, of course, as BLOB is Binary and CLOB is Character. ;) This is even 
more complex, as the databases have different treating for this 
datatypes and some of Perls DBI-drivers are somewhat broken when it goes 
to unicode (according to our perl-saves-our-souls-guru).
UTF-8 is ok in Perl itself (not easy, not coherent, but ok); but in 
combination of many modules (and as far as I learned, Perl is all about 
reusing modules) it is _hell_. Try to read UTF-8 from HTTP-request, 
store in database, select with correct order, write to XLS, convert to 
CSV, reimport it into the DB and output it to the browser, all with 
different subs in the same controller... and you know, what I mean.
Even our most euphoric Perl-gurus don't have any clue how to handle 
UTF-8 from the beginning to the end without hour-long trial&error in 
their programs (and remember - we Germans do only have those bloody 
Umlauts - try to imagine this in China >_<).


Maybe the best thing for all average-and-below users would be a _really_ 
good tutorial about Catalyst+UTF-8. What to do, what not to do. How to 
read UTF-8 from HTTP-request / uploaded file / local file / database, 
how to write it to client / downloadable file / local file / database. 
What catalystish variable is UTF-8-encoded when and why. How to 
determine what encoding a given scalar has and how to 
encode/decode/whatevercode it to a bloody nice scalar with shiny UTF-8 
chars in it.

Short: -- Umlauts with Catalyst for dummies --



(sorry for sounding so emotional afaik our company burned man-weeks 
on solving minor encoding-bugs :-/ every tutorial we found was like "you 
can do it so or so or another way 'round the house, so it's perfect and 
if you don't understand is, you're retard and should use 7bit-ASCII"... 
while lately even a colleague sounds like this - as he is enlinghtened 
by CPAN literature like "UTF-8 vs. utf8 vs. UTF8" ;)).



Greets and regards,
Tom Weber

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-23 Thread Zbigniew Lukasiak
On Fri, Feb 20, 2009 at 6:57 PM, Jonathan Rockway  wrote:
>
> Braindump follows.

snip
snip

>
> One last thing, if this becomes core, it will definitely break people's
> apps.  Many, many apps are blissfully unaware of characters and treat
> text as binary... and their apps kind-of appear to work.  As soon as
> they get some real characters in their app, though, they will have
> double-encoded nonsense all over the place, and will blame you for this.
> ("I loaded Catalyst::Plugin::Unicode, and my app broke!  It's all your
> fault."  Yup, people mail that to me privately all the time.  For some
> reason, they think I am going to personally fix their app, despite
> having written volumes of documentation about this.  Wrong.)
>

Some more things to consider.

- 'use utf8' in the code generated by the helpers?

- ENCODING: UTF-8 for the TT view helper?

Maybe a global config option to choose the byte or character semantics?

But with the DB it becomes a bit more complex - because BLOB columns
probably need to use byte sematic.

-- 
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-22 Thread Bill Moseley
On Fri, Feb 20, 2009 at 11:57:29AM -0600, Jonathan Rockway wrote:
> 
> The problem with writing a plugin or making this core is that people
> really really want to misuse Unicode, and will whine when you try to
> force correctness upon them.

I'm not sure what you mean by wanting to misuse Unicode.  You mean
like decode using a different encoding than what the charset is in the
HTTP headers?

> The only place where you are really allowed to use non-ASCII characters
> are in the request and response.  (HTTP has a way of representing the
> character encoding of its payload -- URLs and Cookies don't.)
> 
> C::P::Unicode handles this correct usage correctly.

I disagree there.  First, it assumes utf8 instead of what the
request states as the encoding.  That is generally okay (where you set
accept-encoding in your forms), but why not decode as the request
states?

Second, it only decodes the request parameters.  The body_parameters
and query_parameters are left undecoded.

Is that by design?  That is, is it expected that in a POST
$c->req->parameters->{foo} would be characters where
$c->req->body_parameters->{foo} is undecoded octets?  I would not want
or expect that.


> The problem is that
> people want Unicode to magically work where it's not allowed.  This
> includes HTTP headers (WTF!?), and URLs.  (BTW, when I say Unicode, I
> don't necessarily mean Unicode... I mean non-ASCII characters.  The
> Japanese character sets contain non-Unicode characters, and some people
> want to put these characters in their URLs or HTTP headers.  I wish I
> was making ths up, but I am not.  The Unicode process really fucked over
> the Asian languages.)

I'm not sure we want to go down that path.  Maybe a plugin for doing
crazy stuff with HTTP header encoding, but my initial email was really
just about moving decoding of the body (when we have a charset in the
request) and encoding on sending (again if there's a charset in the
response headers) into core.

Trying to do more than that is probably asking for headaches (and
whining).


I think there's reasonable debate at what point in the request
decoding should happen, though.  Frankly, I'm not sure Catalyst should
decode, rather HTTP::Body should.  HTTP::Body looks at the content
type header and if it's application/x-www-form-urlencoded it will
decode the body into separate parameters.  But, why should it ignore
decoding the charset also specified?



The query parameters are more troublesome, of course.  Seems the
common case is to use utf8 in URLs as the encoding, and in the end the
encoding just has to be assumed (or specified as a separate
parameter).  uri_for()'s current behavior of encoding to utf8 is
probably a good way to go and to just always decoded the query
parameters as utf8 in Catalyst.  I suppose uri_for() could add an
additional "_enc=utf8" parameter to allow for different encodings, but
I can't imagine where just assuming utf8 would not be fine.

Of course, someone will want to mix encodings in different query
parameters.


> There are subtle issues, like knowing not to touch XML (it's binary),
> dealing with $c->res->body(  ), and so on.

The layer can be set on the file handle.  XML will be decoded as
application/octet-stream by HTTP::Body, so that should be ok.
Although, if there's a chraset in the request I would still
probably argue that decoding would be the correct thing to do.

For custom processing I currently extend HTTP::Body.  For example:

$HTTP::Body::TYPES->{'text/xml'} = 'My::XML::Parser';

which does stream parsing of the XML and thus handles the XML
charset decoding.

> One last thing, if this becomes core, it will definitely break people's
> apps.  Many, many apps are blissfully unaware of characters and treat
> text as binary... and their apps kind-of appear to work.  As soon as
> they get some real characters in their app, though, they will have
> double-encoded nonsense all over the place, and will blame you for this.

That may be true for some.  For most they probably have simply ignored
encoding and don't realize they are working with octets instead of
characters, and thanks to Perl it just all works.  So working with
real characters instead will likely be transparent for them.

Catalyst::Plugin::Unicode blindly decodes using utf::decode() and I
think that's a no-op if the content has already been decoded (utf8
flag is already set).  Likewise, it only encodes if the utf8 flag is
set.  So, users of that plugin should be ok if character encoding
was handled in core and they don't remove the plugin.

-- 
Bill Moseley
mose...@hank.org
Sent from my iMutt


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-20 Thread Jonathan Rockway
* On Fri, Feb 20 2009, Jonathan Rockway wrote:
> Braindump follows.

Oh yeah, one other thing.  IDNs will need to be decoded/encoded,
probably.  ($c->req->host should contain perl characters, but links
should probably be punycoded.  Fun!)

--
print just => another => perl => hacker => if $,=$"

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core

2009-02-20 Thread Jonathan Rockway

Braindump follows.

* On Fri, Feb 20 2009, Tomas Doran wrote:
> On 6 Feb 2009, at 17:36, Bill Moseley wrote:
>>
>> Sure.  IIRC, I think there's already been some patches and code posted
>> so maybe I can dig that up again off the archives.
>
> Please do.
>
>> But, sounds like
>> it's not that important of an issue.
>
> The fact that nobody is working on it currently is not an indication
> that it isn't an important problem to try to solve.

I meant to write a plugin to do this a long time ago, but I guess I
never cared enough.

The problem with writing a plugin or making this core is that people
really really want to misuse Unicode, and will whine when you try to
force correctness upon them.

The only place where you are really allowed to use non-ASCII characters
are in the request and response.  (HTTP has a way of representing the
character encoding of its payload -- URLs and Cookies don't.)

C::P::Unicode handles this correct usage correctly.  The problem is that
people want Unicode to magically work where it's not allowed.  This
includes HTTP headers (WTF!?), and URLs.  (BTW, when I say Unicode, I
don't necessarily mean Unicode... I mean non-ASCII characters.  The
Japanese character sets contain non-Unicode characters, and some people
want to put these characters in their URLs or HTTP headers.  I wish I
was making ths up, but I am not.  The Unicode process really fucked over
the Asian languages.)

So anyway, the plugin basically needs to have the following config
options, so users can specify what they want.  Inside Catalyst, only
Perl characters should be allowed, unless you mark the string as binary
(there is a CPAN module that does this, Something::BLOB).

  * Input HTTP header encoding (ASCII default)
(this is for data in $c->req->headers, cookies, etc.)
(perhaps cookies should be separately configured)

  * Input URI encoding (probably UTF-8 default)
(the dispatcher will dispatch on the decoded characters)
(source code encoding is handled by Perl, hopefully)

  * Input request body encoding (read HTTP headers and decide)

  * Output HTTP headers encoding (maybe die if this happens, because
it's totally illegal to have non-ascii in the headers)

  * Output URI encoding ($c->uri_for and friends will use this to
translate the names of actions that are named with wide characters)

  * Output response body encoding (this needs to update the HTTP
headers, namely the charset= part of Content-type)

I think that is everything.

There are subtle issues, like knowing not to touch XML (it's binary),
dealing with $c->res->body(  ), and so on.

One last thing, if this becomes core, it will definitely break people's
apps.  Many, many apps are blissfully unaware of characters and treat
text as binary... and their apps kind-of appear to work.  As soon as
they get some real characters in their app, though, they will have
double-encoded nonsense all over the place, and will blame you for this.
("I loaded Catalyst::Plugin::Unicode, and my app broke!  It's all your
fault."  Yup, people mail that to me privately all the time.  For some
reason, they think I am going to personally fix their app, despite
having written volumes of documentation about this.  Wrong.)

Anyway, I just wanted to get this out of my head and onto paper, for
someone else to look at and perhaps implement. :)

Regards,
Jonathan Rockway

--
print just => another => perl => hacker => if $,=$"

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core (Was: [Announce] Catalyst-Runtime-5.8000_05)

2009-02-20 Thread Tomas Doran


On 6 Feb 2009, at 17:36, Bill Moseley wrote:


Sure.  IIRC, I think there's already been some patches and code posted
so maybe I can dig that up again off the archives.


Please do.


But, sounds like
it's not that important of an issue.


The fact that nobody is working on it currently is not an indication  
that it isn't an important problem to try to solve.


Cheers
t0m


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core (Was: [Announce] Catalyst-Runtime-5.8000_05)

2009-02-06 Thread Bill Moseley
On Fri, Feb 06, 2009 at 03:16:14PM +, Tomas Doran wrote:
>
> On 6 Feb 2009, at 14:46, Bill Moseley wrote:
>> Nobody responded to the main point of this email -- if Catalyst should
>> handle encoding in core instead of with a plugin.  Nobody has an
>> opinion about that?  Or is was it just ignored -- which is often how
>> people handle character encoding in applications. ;)
>
> Does it make a difference if its in core or in a plugin?
>
> In your original email you said that the existing plugins don't do it  
> right.. Which is quite possibly fair criticism, however I don't see how 
> moving the functionality into core would help the code be more correct.. 
> Saying 'Plugin X is broken', 'Lets move Plugin X into core' doesn't sound 
> very convincing from where I'm sat. :_)

Two different issues, although I would assume if you moved it into
core there would be more careful consideration and discussion about
how to do it.  Which is why I posted -- for a discussion.

The question is should encoding be a core function.  A plugin works,
but not everyone uses it.  My argument for doing it in core is that
inside Perl is character data so therefore it must be decoded at
some point, and it's Catalyst (and the engines) that load the
parameters.  And if it's character data on the inside it has to be
encoded when writing.

> Code speaks louder than words, so if you'd like to provide some failing 
> tests for what you think encoding _should_ be doing, that'd probably be a 
> better basis for further discussion.

Sure.  IIRC, I think there's already been some patches and code posted
so maybe I can dig that up again off the archives.  But, sounds like
it's not that important of an issue.




-- 
Bill Moseley
mose...@hank.org
Sent from my iMutt


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core (Was: [Announce] Catalyst-Runtime-5.8000_05)

2009-02-06 Thread Tomas Doran


On 6 Feb 2009, at 14:46, Bill Moseley wrote:

Nobody responded to the main point of this email -- if Catalyst should
handle encoding in core instead of with a plugin.  Nobody has an
opinion about that?  Or is was it just ignored -- which is often how
people handle character encoding in applications. ;)


Does it make a difference if its in core or in a plugin?

In your original email you said that the existing plugins don't do it  
right.. Which is quite possibly fair criticism, however I don't see  
how moving the functionality into core would help the code be more  
correct.. Saying 'Plugin X is broken', 'Lets move Plugin X into core'  
doesn't sound very convincing from where I'm sat. :_)


Code speaks louder than words, so if you'd like to provide some  
failing tests for what you think encoding _should_ be doing, that'd  
probably be a better basis for further discussion.


Cheers
t0m


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Re: decoding in core (Was: [Announce] Catalyst-Runtime-5.8000_05)

2009-02-06 Thread Bill Moseley
On Fri, Jan 30, 2009 at 11:44:57PM +0100, Aristotle Pagaltzis wrote:
> * Bill Moseley  [2009-01-29 17:05]:
> > Neither of the existing plugins do it correctly (IMO), as
> > they only decode parameters leaving body_parameters as octets,
> > and don't look at the request for the charset, IIRC. […]
> > uri_for() rightly encodes to octets before escaping, but it
> > always encodes to utf-8. Is it assumed that query parameters
> > are always utf-8 or should they be decoded with the charset
> > specified in the request?
> 
> The URI should always be assumed to be UTF-8 encoded octets.
> The body should be decoded according to the charset declared
> in the header by the browser.

Assume UTF-8 because that's how the application encoded the
URL in the first place?  Is UTF-8 specified in an RFC?  I thought it
URIs were defined as characters with ASCII encoding for transmitting.


Nobody responded to the main point of this email -- if Catalyst should
handle encoding in core instead of with a plugin.  Nobody has an
opinion about that?  Or is was it just ignored -- which is often how
people handle character encoding in applications. ;)

-- 
Bill Moseley
mose...@hank.org
Sent from my iMutt


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


[Catalyst] Re: decoding in core (Was: [Announce] Catalyst-Runtime-5.8000_05)

2009-01-30 Thread Aristotle Pagaltzis
* Bill Moseley  [2009-01-29 17:05]:
> Neither of the existing plugins do it correctly (IMO), as
> they only decode parameters leaving body_parameters as octets,
> and don't look at the request for the charset, IIRC. […]
> uri_for() rightly encodes to octets before escaping, but it
> always encodes to utf-8. Is it assumed that query parameters
> are always utf-8 or should they be decoded with the charset
> specified in the request?

The URI should always be assumed to be UTF-8 encoded octets.
The body should be decoded according to the charset declared
in the header by the browser.

Regards,
-- 
Aristotle Pagaltzis // 

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/