Re: [Python-Dev] Encoding detection in the standard library?

2008-05-11 Thread Greg Wilson
Bill Janssen: Since the site that receives the POST doesn't necessarily have access to the Web page that originally contained the form, that's not really helpful. However, POSTs can use the MIME type "multipart/form-data" for non-Latin-1 content, and should. That contains facilities for indi

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-23 Thread Martin v. Löwis
>> For web forms, I always encode the pages in UTF-8, and that always >> works. > > Should work, if you use the "multipart/form-data" format. Right - I was implicitly assuming that. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-23 Thread Bill Janssen
[EMAIL PROTECTED] writes: > > When a web browser POSTs data, there is no standard way of communicating > > which encoding it's using. > > That's just not true. Web browser should and do use the encoding of the > web page that originally contained the form. I wonder if the discussion is confusing

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-23 Thread M.-A. Lemburg
On 2008-04-23 07:26, Terry Reedy wrote: ""Martin v. Löwis"" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] |> I certainly agree that if the target set of documents is small enough it | | Ok. What advantage would you (or somebody working on a similar project) | gain if chardet was pa

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Terry Reedy
""Martin v. Löwis"" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] |> I certainly agree that if the target set of documents is small enough it | | Ok. What advantage would you (or somebody working on a similar project) | gain if chardet was part of the standard library? What if it wa

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Martin v. Löwis
> I certainly agree that if the target set of documents is small enough it > is possible to hand-code the encoding. There are many applications, > however, that need to examine the content of an arbitrary, or at least > non-small set of web documents. To name a few such applications: > > - web

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Martin v. Löwis
> Yup, but DrProject (the target application) also serves as a relay and > archive for email. We have no control over the agent used for > composition, and AFAIK there's no standard way to include encoding > information. That's not at all the case. MIME defines that in full detail, since 1993. R

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Stephen J. Turnbull
Guido van Rossum writes: > To the contrary, an encoding-guessing module is often needed, and > guessing can be done with a pretty high success rate. Other Unicode > libraries (e.g. ICU) contain guessing modules. I suppose the API could > return two values: the guessed encoding and a confidence

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > In any case, I'm very skeptical that a general "guess encoding" > module would do a meaningful thing when applied to incorrectly > encoded HTML pages. That depends on whether you can get meaningful information about the language from the fact that you're looking at

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Stephen J. Turnbull
Bill Janssen writes: > Internet-compliant email actually has well-specified mechanisms for > including encoding information; see RFCs 2047 and 2231. There's no > need to guess; you can just look. You must be very special to get only compliant email. About half my colleagues use RFC 2047 to e

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
> Yup, but DrProject (the target application) also serves as a relay and > archive for email. We have no control over the agent used for > composition, and AFAIK there's no standard way to include encoding > information. Greg, Internet-compliant email actually has well-specified mechanisms fo

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Mike Klaas
On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote: Any program that needs to examine the contents of documents/feeds/whatever on the web needs to deal with incorrectly-specified encodings That's not true. Most programs that need to examine the contents of a web page don't need to guess the enc

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
> Unless you're using a very broad scope, I don't think that > you'd need more than a few hundred LSEs for a typical > application - nothing you'd want to put in the Python stdlib, > though. I tend to agree with this (and I'm generally in favor of putting everything in the standard library!). For

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
> > When a web browser POSTs data, there is no standard way of communicating > > which encoding it's using. > > That's just not true. Web browser should and do use the encoding of the > web page that originally contained the form. Since the site that receives the POST doesn't necessarily have acc

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread M.-A. Lemburg
On 2008-04-22 18:33, Bill Janssen wrote: The 2002 paper "A language and character set determination method based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go about this. Thanks for the reference. Looks like

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Martin v. Löwis
>> Can you please explain why that is? Web programs should not normally >> have the need to detect the encoding; instead, it should be specified >> always - unless you are talking about browsers specifically, which >> need to support web pages that specify the encoding incorrectly. > > Any program

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread M.-A. Lemburg
[CCing python-dev again] On 2008-04-22 12:38, Greg Wilson wrote: I don't think that should be part of the standard library. People will mistake what it tells them for certain. [etc] These are all good arguments, but the fact remains that we can't control our inputs (e.g., we're archiving mail

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Mike Klaas
On 22-Apr-08, at 3:31 AM, M.-A. Lemburg wrote: I don't think that should be part of the standard library. People will mistake what it tells them for certain. +1 I also think that it's better to educate people to add (correct) encoding information to their text data, rather than give them a

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Martin v. Löwis
> When a web browser POSTs data, there is no standard way of communicating > which encoding it's using. That's just not true. Web browser should and do use the encoding of the web page that originally contained the form. > There are some hints which make it easier > (accept-charset attributes, th

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
The 2002 paper "A language and character set determination method based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go about this. They're looking at "LSE"s, language-script-encoding triples; a "script" is a way o

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread David Wolever
On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote: IMO, encoding estimation is something that many web programs will have to deal with Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you a

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
> IMHO, more research has to be done into this area before a > "standard" module can be added to the Python's stdlib... and > who knows, perhaps we're lucky and by the time everyone is > using UTF-8 anyway :-) I walked over to our computational linguistics group and asked. This is often combined

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread M.-A. Lemburg
On 2008-04-21 23:31, Martin v. Löwis wrote: This is useful when you get a hunk of data which _should_ be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content). I don't thin

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Martin v. Löwis
> IMO, encoding estimation is something that many web programs will have > to deal with Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you are talking about browsers specifically, which need

[Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Jim Jewett
David Wolever wrote: > IMO, encoding estimation is something that > many web programs will have to deal with, > so it might as well be built in; I would prefer > the option to run `text=input.encode('guess')` > (or something similar) than relying on an external > dependency or worse yet using a ha

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread David Wolever
On 21-Apr-08, at 5:31 PM, Martin v. Löwis wrote: This is useful when you get a hunk of data which _should_ be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content). I don't think

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Oleg Broytmann
On Mon, Apr 21, 2008 at 06:37:20PM -0300, Rodrigo Bernardo Pimentel wrote: > On Mon, Apr 21 2008 at 06:31:06PM BRT, "\"Martin v. L??wis\"" <[EMAIL > PROTECTED]> wrote: > > > This is useful when you get a hunk of data which _should_ be some > > > sort of intelligible text from the Big Scary Inter

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Rodrigo Bernardo Pimentel
On Mon, Apr 21 2008 at 06:31:06PM BRT, "\"Martin v. Löwis\"" <[EMAIL PROTECTED]> wrote: > > This is useful when you get a hunk of data which _should_ be some > > sort of intelligible text from the Big Scary Internet (say, a posted > > web form or email message), and you want to do something us

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Martin v. Löwis
> This is useful when you get a hunk of data which _should_ be some > sort of intelligible text from the Big Scary Internet (say, a posted > web form or email message), and you want to do something useful with > it (say, search the content). I don't think that should be part of the standard

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Tony Nelson
At 1:14 PM -0400 4/21/08, David Wolever wrote: >On 21-Apr-08, at 12:44 PM, [EMAIL PROTECTED] wrote: >> >> David> Is there some sort of text encoding detection module is the >> David> standard library? And, if not, is there any reason not >> to add >> David> one? >> No, there's not. I

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread skip
Guido> Note that the locale settings might figure in the guess. Alas, locale settings in a web server have little or nothing to do with the locale settings in the client submitting the form. Skip ___ Python-Dev mailing list Python-Dev@python.org ht

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread skip
Michael> The only approach I know of is a heuristic based approach. e.g. Michael> http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml Michael> (Which was 'borrowed' from docutils in the first place.) Yes, I implemented a heuristic approach for the Musi-Cal web server

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Guido van Rossum
To the contrary, an encoding-guessing module is often needed, and guessing can be done with a pretty high success rate. Other Unicode libraries (e.g. ICU) contain guessing modules. I suppose the API could return two values: the guessed encoding and a confidence indicator. Note that the locale setti

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Georg Brandl
Christian Heimes schrieb: > David Wolever schrieb: >> Is there some sort of text encoding detection module is the standard >> library? >> And, if not, is there any reason not to add one? > > You cannot detect the encoding unless it's explicitly defined through a > header (e.g. the UTF BOM). It's

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread David Wolever
On 21-Apr-08, at 12:44 PM, [EMAIL PROTECTED] wrote: > > David> Is there some sort of text encoding detection module is the > David> standard library? And, if not, is there any reason not > to add > David> one? > No, there's not. I suspect the fact that you can't correctly > determ

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Jean-Paul Calderone
On Mon, 21 Apr 2008 17:50:43 +0100, Michael Foord <[EMAIL PROTECTED]> wrote: >[EMAIL PROTECTED] wrote: >> David> Is there some sort of text encoding detection module is the >> David> standard library? And, if not, is there any reason not to add >> David> one? >> >> No, there's not. I

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Christian Heimes
David Wolever schrieb: > Is there some sort of text encoding detection module is the standard > library? > And, if not, is there any reason not to add one? You cannot detect the encoding unless it's explicitly defined through a header (e.g. the UTF BOM). It's technically impossible. The best you

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Michael Foord
[EMAIL PROTECTED] wrote: > David> Is there some sort of text encoding detection module is the > David> standard library? And, if not, is there any reason not to add > David> one? > > No, there's not. I suspect the fact that you can't correctly determine the > encoding of a chunk of te

Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread skip
David> Is there some sort of text encoding detection module is the David> standard library? And, if not, is there any reason not to add David> one? No, there's not. I suspect the fact that you can't correctly determine the encoding of a chunk of text 100% of the time mitigates again

[Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread David Wolever
Is there some sort of text encoding detection module is the standard library? And, if not, is there any reason not to add one? After some googling, I've come across this: http://mail.python.org/pipermail/python-3000/2006-September/003537.html But I can't find any changes that resulted from that