Bill Janssen:
Since the site that receives the POST doesn't necessarily have access to
the Web page that originally contained the form, that's not really
helpful. However, POSTs can use the MIME type "multipart/form-data" for
non-Latin-1 content, and should. That contains facilities for
indi
>> For web forms, I always encode the pages in UTF-8, and that always
>> works.
>
> Should work, if you use the "multipart/form-data" format.
Right - I was implicitly assuming that.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http
[EMAIL PROTECTED] writes:
> > When a web browser POSTs data, there is no standard way of communicating
> > which encoding it's using.
>
> That's just not true. Web browser should and do use the encoding of the
> web page that originally contained the form.
I wonder if the discussion is confusing
On 2008-04-23 07:26, Terry Reedy wrote:
""Martin v. Löwis"" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
|> I certainly agree that if the target set of documents is small enough it
|
| Ok. What advantage would you (or somebody working on a similar project)
| gain if chardet was pa
""Martin v. Löwis"" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
|> I certainly agree that if the target set of documents is small enough it
|
| Ok. What advantage would you (or somebody working on a similar project)
| gain if chardet was part of the standard library? What if it wa
> I certainly agree that if the target set of documents is small enough it
> is possible to hand-code the encoding. There are many applications,
> however, that need to examine the content of an arbitrary, or at least
> non-small set of web documents. To name a few such applications:
>
> - web
> Yup, but DrProject (the target application) also serves as a relay and
> archive for email. We have no control over the agent used for
> composition, and AFAIK there's no standard way to include encoding
> information.
That's not at all the case. MIME defines that in full detail, since
1993.
R
Guido van Rossum writes:
> To the contrary, an encoding-guessing module is often needed, and
> guessing can be done with a pretty high success rate. Other Unicode
> libraries (e.g. ICU) contain guessing modules. I suppose the API could
> return two values: the guessed encoding and a confidence
"Martin v. Löwis" writes:
> In any case, I'm very skeptical that a general "guess encoding"
> module would do a meaningful thing when applied to incorrectly
> encoded HTML pages.
That depends on whether you can get meaningful information about the
language from the fact that you're looking at
Bill Janssen writes:
> Internet-compliant email actually has well-specified mechanisms for
> including encoding information; see RFCs 2047 and 2231. There's no
> need to guess; you can just look.
You must be very special to get only compliant email.
About half my colleagues use RFC 2047 to e
> Yup, but DrProject (the target application) also serves as a relay and
> archive for email. We have no control over the agent used for
> composition, and AFAIK there's no standard way to include encoding
> information.
Greg,
Internet-compliant email actually has well-specified mechanisms fo
On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote:
Any program that needs to examine the contents of
documents/feeds/whatever on the web needs to deal with
incorrectly-specified encodings
That's not true. Most programs that need to examine the contents of
a web page don't need to guess the enc
> Unless you're using a very broad scope, I don't think that
> you'd need more than a few hundred LSEs for a typical
> application - nothing you'd want to put in the Python stdlib,
> though.
I tend to agree with this (and I'm generally in favor of putting
everything in the standard library!). For
> > When a web browser POSTs data, there is no standard way of communicating
> > which encoding it's using.
>
> That's just not true. Web browser should and do use the encoding of the
> web page that originally contained the form.
Since the site that receives the POST doesn't necessarily have acc
On 2008-04-22 18:33, Bill Janssen wrote:
The 2002 paper "A language and character set determination method
based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and
Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go
about this.
Thanks for the reference.
Looks like
>> Can you please explain why that is? Web programs should not normally
>> have the need to detect the encoding; instead, it should be specified
>> always - unless you are talking about browsers specifically, which
>> need to support web pages that specify the encoding incorrectly.
>
> Any program
[CCing python-dev again]
On 2008-04-22 12:38, Greg Wilson wrote:
I don't think that should be part of the standard library. People
will mistake what it tells them for certain.
[etc]
These are all good arguments, but the fact remains that we can't control
our inputs (e.g., we're archiving mail
On 22-Apr-08, at 3:31 AM, M.-A. Lemburg wrote:
I don't think that should be part of the standard library. People
will mistake what it tells them for certain.
+1
I also think that it's better to educate people to add (correct)
encoding information to their text data, rather than give them a
> When a web browser POSTs data, there is no standard way of communicating
> which encoding it's using.
That's just not true. Web browser should and do use the encoding of the
web page that originally contained the form.
> There are some hints which make it easier
> (accept-charset attributes, th
The 2002 paper "A language and character set determination method
based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and
Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go
about this. They're looking at "LSE"s, language-script-encoding
triples; a "script" is a way o
On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote:
IMO, encoding estimation is something that many web programs will
have
to deal with
Can you please explain why that is? Web programs should not normally
have the need to detect the encoding; instead, it should be specified
always - unless you a
> IMHO, more research has to be done into this area before a
> "standard" module can be added to the Python's stdlib... and
> who knows, perhaps we're lucky and by the time everyone is
> using UTF-8 anyway :-)
I walked over to our computational linguistics group and asked. This
is often combined
On 2008-04-21 23:31, Martin v. Löwis wrote:
This is useful when you get a hunk of data which _should_ be some
sort of intelligible text from the Big Scary Internet (say, a posted
web form or email message), and you want to do something useful with
it (say, search the content).
I don't thin
> IMO, encoding estimation is something that many web programs will have
> to deal with
Can you please explain why that is? Web programs should not normally
have the need to detect the encoding; instead, it should be specified
always - unless you are talking about browsers specifically, which
need
David Wolever wrote:
> IMO, encoding estimation is something that
> many web programs will have to deal with,
> so it might as well be built in; I would prefer
> the option to run `text=input.encode('guess')`
> (or something similar) than relying on an external
> dependency or worse yet using a ha
On 21-Apr-08, at 5:31 PM, Martin v. Löwis wrote:
This is useful when you get a hunk of data which _should_ be some
sort of intelligible text from the Big Scary Internet (say, a posted
web form or email message), and you want to do something useful with
it (say, search the content).
I don't think
On Mon, Apr 21, 2008 at 06:37:20PM -0300, Rodrigo Bernardo Pimentel wrote:
> On Mon, Apr 21 2008 at 06:31:06PM BRT, "\"Martin v. L??wis\"" <[EMAIL
> PROTECTED]> wrote:
> > > This is useful when you get a hunk of data which _should_ be some
> > > sort of intelligible text from the Big Scary Inter
On Mon, Apr 21 2008 at 06:31:06PM BRT, "\"Martin v. Löwis\"" <[EMAIL
PROTECTED]> wrote:
> > This is useful when you get a hunk of data which _should_ be some
> > sort of intelligible text from the Big Scary Internet (say, a posted
> > web form or email message), and you want to do something us
> This is useful when you get a hunk of data which _should_ be some
> sort of intelligible text from the Big Scary Internet (say, a posted
> web form or email message), and you want to do something useful with
> it (say, search the content).
I don't think that should be part of the standard
At 1:14 PM -0400 4/21/08, David Wolever wrote:
>On 21-Apr-08, at 12:44 PM, [EMAIL PROTECTED] wrote:
>>
>> David> Is there some sort of text encoding detection module is the
>> David> standard library? And, if not, is there any reason not
>> to add
>> David> one?
>> No, there's not. I
Guido> Note that the locale settings might figure in the guess.
Alas, locale settings in a web server have little or nothing to do with the
locale settings in the client submitting the form.
Skip
___
Python-Dev mailing list
Python-Dev@python.org
ht
Michael> The only approach I know of is a heuristic based approach. e.g.
Michael> http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml
Michael> (Which was 'borrowed' from docutils in the first place.)
Yes, I implemented a heuristic approach for the Musi-Cal web server
To the contrary, an encoding-guessing module is often needed, and
guessing can be done with a pretty high success rate. Other Unicode
libraries (e.g. ICU) contain guessing modules. I suppose the API could
return two values: the guessed encoding and a confidence indicator.
Note that the locale setti
Christian Heimes schrieb:
> David Wolever schrieb:
>> Is there some sort of text encoding detection module is the standard
>> library?
>> And, if not, is there any reason not to add one?
>
> You cannot detect the encoding unless it's explicitly defined through a
> header (e.g. the UTF BOM). It's
On 21-Apr-08, at 12:44 PM, [EMAIL PROTECTED] wrote:
>
> David> Is there some sort of text encoding detection module is the
> David> standard library? And, if not, is there any reason not
> to add
> David> one?
> No, there's not. I suspect the fact that you can't correctly
> determ
On Mon, 21 Apr 2008 17:50:43 +0100, Michael Foord <[EMAIL PROTECTED]> wrote:
>[EMAIL PROTECTED] wrote:
>> David> Is there some sort of text encoding detection module is the
>> David> standard library? And, if not, is there any reason not to add
>> David> one?
>>
>> No, there's not. I
David Wolever schrieb:
> Is there some sort of text encoding detection module is the standard
> library?
> And, if not, is there any reason not to add one?
You cannot detect the encoding unless it's explicitly defined through a
header (e.g. the UTF BOM). It's technically impossible. The best you
[EMAIL PROTECTED] wrote:
> David> Is there some sort of text encoding detection module is the
> David> standard library? And, if not, is there any reason not to add
> David> one?
>
> No, there's not. I suspect the fact that you can't correctly determine the
> encoding of a chunk of te
David> Is there some sort of text encoding detection module is the
David> standard library? And, if not, is there any reason not to add
David> one?
No, there's not. I suspect the fact that you can't correctly determine the
encoding of a chunk of text 100% of the time mitigates again
Is there some sort of text encoding detection module is the standard
library?
And, if not, is there any reason not to add one?
After some googling, I've come across this:
http://mail.python.org/pipermail/python-3000/2006-September/003537.html
But I can't find any changes that resulted from that
40 matches
Mail list logo