Re: [Python-Dev] Encoding detection in the standard library?

2008-05-11 Thread Greg Wilson

Bill Janssen:
Since the site that receives the POST doesn't necessarily have access to 
the Web page that originally contained the form, that's not really 
helpful.  However, POSTs can use the MIME type multipart/form-data for 
non-Latin-1 content, and should.  That contains facilities for 
indicating the encoding and other things as well.


Yup, but DrProject (the target application) also serves as a relay and 
archive for email.  We have no control over the agent used for 
composition, and AFAIK there's no standard way to include encoding 
information.


Thanks,
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-23 Thread M.-A. Lemburg

On 2008-04-23 07:26, Terry Reedy wrote:
Martin v. Löwis [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]

| I certainly agree that if the target set of documents is small enough it
|
| Ok. What advantage would you (or somebody working on a similar project)
| gain if chardet was part of the standard library? What if it was not
| chardet, but some other algorithm?

It seems to me that since there is not a 'correct' algorithm but only 
competing heuristics, encoding detection modules should be made available 
via PyPI and only be considered for stdlib after a best of breed emerges 
with community support. 


+1

Though in practice, determining the best of breed often becomes a
problem (see e.g. the JSON implementation discussion).

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 23 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-23 Thread Bill Janssen
[EMAIL PROTECTED] writes:
  When a web browser POSTs data, there is no standard way of communicating
  which encoding it's using.
 
 That's just not true. Web browser should and do use the encoding of the
 web page that originally contained the form.

I wonder if the discussion is confusing two different things.  Take a
look at
http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4.

There are two prescribed ways of sending form data:
application/x-www-form-urlencoded, which can only be used with ASCII
data, and multipart/form-data.  ``The content type
multipart/form-data should be used for submitting forms that contain
files, non-ASCII data, and binary data.''

It's true that the page containing the form may specify which of these
two forms to use, but the character encodings are determined by the
choice.

 For web forms, I always encode the pages in UTF-8, and that always
 works.

Should work, if you use the multipart/form-data format.

Bill

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-23 Thread Martin v. Löwis
 For web forms, I always encode the pages in UTF-8, and that always
 works.
 
 Should work, if you use the multipart/form-data format.

Right - I was implicitly assuming that.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread M.-A. Lemburg

On 2008-04-21 23:31, Martin v. Löwis wrote:
This is useful when you get a hunk of data which _should_ be some  
sort of intelligible text from the Big Scary Internet (say, a posted  
web form or email message), and you want to do something useful with  
it (say, search the content).


I don't think that should be part of the standard library. People
will mistake what it tells them for certain.


+1

I also think that it's better to educate people to add (correct)
encoding information to their text data, rather than give them a
guess mechanism...

http://chardet.feedparser.org/docs/faq.html#faq.yippie

chardet is based on the Mozilla algorithm and at least in
my experience that algorithm doesn't work too well.

The Mozilla algorithm may work for Asian encodings due to the fact
that those encodings are usually also bound to a specific language
(and you can then use character and word frequency analysis), but
for encodings which can encode far more than just a single language
(e.g. UTF-8 or Latin-1), the correct detection rate is rather low.

The problem becomes completely even more difficult when leaving
the normal text domain or when mixing languages in the same
text, e.g. when trying to detect source code with comments using
a non-ASCII encoding.

The trick to just pass the text through a codec and see whether
it roundtrips also doesn't necessarily help: Latin-1, for example,
will always round-trip, since Latin-1 is a subset of Unicode.

IMHO, more research has to be done into this area before a
standard module can be added to the Python's stdlib... and
who knows, perhaps we're lucky and by the time everyone is
using UTF-8 anyway :-)

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
 IMHO, more research has to be done into this area before a
 standard module can be added to the Python's stdlib... and
 who knows, perhaps we're lucky and by the time everyone is
 using UTF-8 anyway :-)

I walked over to our computational linguistics group and asked.  This
is often combined with language guessing (which uses a similar
approach, but using characters instead of bytes), and apparently can
usually be done with high confidence.  Of course, they're usually
looking at clean texts, not random stuff.  I'll see if I can get
some references and report back -- most of the research on this was
done in the 90's.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread David Wolever

On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote:
IMO, encoding estimation is something that many web programs will  
have

to deal with

Can you please explain why that is? Web programs should not normally
have the need to detect the encoding; instead, it should be specified
always - unless you are talking about browsers specifically, which
need to support web pages that specify the encoding incorrectly.

Two cases come immediately to mind: email and web forms.
When a web browser POSTs data, there is no standard way of  
communicating which encoding it's using.  There are some hints which  
make it easier (accept-charset attributes, the encoding used to send  
the page to the browser), but no guarantees.
Email is a smaller problem, because it usually has a helpful content- 
type header, but that's no guarantee.


Now, at the moment, the only data I have to support this claim is my  
experience with DrProject in non-English locations.
If I'm the only one who has had these sorts of problems, I'll go back  
to Unicode for Dummies.



so it might as well be built in; I would prefer the option
to run `text=input.encode('guess')` (or something similar) than  
relying

on an external dependency or worse yet using a hand-rolled algorithm.

Ok, let me try differently then. Please feel free to post a patch to
bugs.python.org, and let other people rip it apart.
For example, I don't think it should be a codec, as I can't imagine it
working on streams.


As things frequently are, it seems like this is a much larger problem  
that I originally believed.


I'll go back and take another look at the problem, then come back if  
new revelations appear.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Martin v. Löwis
 When a web browser POSTs data, there is no standard way of communicating
 which encoding it's using.

That's just not true. Web browser should and do use the encoding of the
web page that originally contained the form.

 There are some hints which make it easier
 (accept-charset attributes, the encoding used to send the page to the
 browser), but no guarantees.

Not true. The latter is guaranteed (unless you assume bugs - but if
you do, can you present a specific browser that has that bug?)

 Email is a smaller problem, because it usually has a helpful
 content-type header, but that's no guarantee.

Then assume windows-1252. Mailers who don't use MIME for non-ASCII
characters mostly died 10 years ago; those people who continue to
use them likely can accept occasional moji-bake (or else they would
have switched long ago).

 Now, at the moment, the only data I have to support this claim is my
 experience with DrProject in non-English locations.
 If I'm the only one who has had these sorts of problems, I'll go back to
 Unicode for Dummies.

For web forms, I always encode the pages in UTF-8, and that always
works.

For email, I once added encoding processing to the pipermail (the
mailman archiver), and that also always works.

 I'll go back and take another look at the problem, then come back if new
 revelations appear.

Good luck!

Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Mike Klaas


On 22-Apr-08, at 3:31 AM, M.-A. Lemburg wrote:



I don't think that should be part of the standard library. People
will mistake what it tells them for certain.


+1

I also think that it's better to educate people to add (correct)
encoding information to their text data, rather than give them a
guess mechanism...


That is a fallacious alternative: the programmers that need encoding  
detection are not the same people who are omitting encoding information.


I only have a small opinion on whether charset detection should appear  
in the stdlib, but I am somewhat perplexed by the arguments in this  
thread.  I don't see how inclusion in the stdlib would make people  
more inclined to think that the algorithm is always correct.  In terms  
of the need of this functionality:


Martin wrote:

Can you please explain why that is? Web programs should not normally
have the need to detect the encoding; instead, it should be specified
always - unless you are talking about browsers specifically, which
need to support web pages that specify the encoding incorrectly.


Any program that needs to examine the contents of documents/feeds/ 
whatever on the web needs to deal with incorrectly-specified encodings  
(which, sadly, is rather common).  The set of programs of programs  
that need this functionality is probably the same set that needs  
BeautifulSoup--I think that set is larger than just browsers grin


-Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread M.-A. Lemburg

[CCing python-dev again]

On 2008-04-22 12:38, Greg Wilson wrote:

I don't think that should be part of the standard library. People
will mistake what it tells them for certain.
[etc]


These are all good arguments, but the fact remains that we can't control 
our inputs (e.g., we're archiving mail messages sent to lists managed by 
DrProject), and some of those inputs *don't* tell us how they're encoded.

Under those circumstances, what would you recommend?


I haven't done much research into this, but in general, I think it's
better to:

 * first try to look at other characteristics of a text
   message, e.g. language, origin, topic, etc.,

 * then narrow down the number of encodings which could apply,

 * rank them to try to avoid ambiguities and

 * then try to see what percentage of the text you can decode using
   each of the encodings in reverse ranking order (ie. more specialized
   encodings should be tested first, latin-1 last).

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)

Python/Zope Consulting and Support ...http://www.egenix.com/
mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/



 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Martin v. Löwis
 Can you please explain why that is? Web programs should not normally
 have the need to detect the encoding; instead, it should be specified
 always - unless you are talking about browsers specifically, which
 need to support web pages that specify the encoding incorrectly.
 
 Any program that needs to examine the contents of
 documents/feeds/whatever on the web needs to deal with
 incorrectly-specified encodings

That's not true. Most programs that need to examine the contents of
a web page don't need to guess the encoding. In most such programs,
the encoding can be hard-coded if the declared encoding is not
correct. Most such programs *know* what page they are webscraping,
or else they couldn't extract the information out of it that they
want to get at.

As for feeds - can you give examples of incorrectly encoded one
(I don't ever use feeds, so I honestly don't know whether they
are typically encoded incorrectly. I've heard they are often XML,
in which case I strongly doubt they are incorrectly encoded)

As for whatever - can you give specific examples?

 (which, sadly, is rather common). The
 set of programs of programs that need this functionality is probably the
 same set that needs BeautifulSoup--I think that set is larger than just
 browsers grin

Again, can you give *specific* examples that are not web browsers?
Programs needing BeautifulSoup may still not need encoding guessing,
since they still might be able to hard-code the encoding of the web
page they want to process.

In any case, I'm very skeptical that a general guess encoding
module would do a meaningful thing when applied to incorrectly
encoded HTML pages.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread M.-A. Lemburg

On 2008-04-22 18:33, Bill Janssen wrote:

The 2002 paper A language and character set determination method
based on N-gram statistics by Izumi Suzuki and Yoshiki Mikami and
Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go
about this. 


Thanks for the reference.

Looks like the existing research on this just hasn't made it into the
mainstream yet.

Here's their current project: http://www.language-observatory.org/
Looks like they are focusing more on language detection.

Another interesting paper using n-grams:
Language Identification in Web Pages by Bruno Martins and Mário J. Silva
http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf

And one using compression:
Text Categorization Using Compression Models by 
Eibe Frank, Chang Chui, Ian H. Witten
http://portal.acm.org/citation.cfm?id=789742


They're looking at LSEs, language-script-encoding
triples; a script is a way of using a particular character set to
write in a particular language.

Their system has these requirements:

R1. the response must be either correct answer or unable to detect
where unable to detect includes other than registered [the
registered set of LSEs];

R2. Applicable to multi-LSE texts;

R3. never accept a wrong answer, even when the program does not have
enough data on an LSE; and

R4. applicable to any LSE text.

So, no wrong answers.

The biggest disadvantage would seem to be that the registration data
for a particular LSE is kind of bulky; on the order of 10,000
shift-codons, each of three bytes, about 30K uncompressed.

http://portal.acm.org/ft_gateway.cfm?id=772759type=pdf


For a server based application that doesn't sound too large.

Unless you're using a very broad scope, I don't think that
you'd need more than a few hundred LSEs for a typical
application - nothing you'd want to put in the Python stdlib,
though.


Bill


IMHO, more research has to be done into this area before a
standard module can be added to the Python's stdlib... and
who knows, perhaps we're lucky and by the time everyone is
using UTF-8 anyway :-)

I walked over to our computational linguistics group and asked.  This
is often combined with language guessing (which uses a similar
approach, but using characters instead of bytes), and apparently can
usually be done with high confidence.  Of course, they're usually
looking at clean texts, not random stuff.  I'll see if I can get
some references and report back -- most of the research on this was
done in the 90's.

Bill


--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
  When a web browser POSTs data, there is no standard way of communicating
  which encoding it's using.
 
 That's just not true. Web browser should and do use the encoding of the
 web page that originally contained the form.

Since the site that receives the POST doesn't necessarily have access
to the Web page that originally contained the form, that's not really
helpful.  However, POSTs can use the MIME type multipart/form-data
for non-Latin-1 content, and should.  That contains facilities for
indicating the encoding and other things as well.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
 Unless you're using a very broad scope, I don't think that
 you'd need more than a few hundred LSEs for a typical
 application - nothing you'd want to put in the Python stdlib,
 though.

I tend to agree with this (and I'm generally in favor of putting
everything in the standard library!).  For those of us doing
document-processing applications (Martin, it's not just about Web
browsers), this would be a very useful package to have up on PyPI.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Mike Klaas


On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote:



Any program that needs to examine the contents of
documents/feeds/whatever on the web needs to deal with
incorrectly-specified encodings


That's not true. Most programs that need to examine the contents of
a web page don't need to guess the encoding. In most such programs,
the encoding can be hard-coded if the declared encoding is not
correct. Most such programs *know* what page they are webscraping,
or else they couldn't extract the information out of it that they
want to get at.


I certainly agree that if the target set of documents is small enough  
it is possible to hand-code the encoding.  There are many  
applications, however, that need to examine the content of an  
arbitrary, or at least non-small set of web documents.  To name a few  
such applications:


 - web search engines
 - translation software
 - document/bookmark management systems
 - other kinds of document analysis (market research, seo, etc.)


As for feeds - can you give examples of incorrectly encoded one
(I don't ever use feeds, so I honestly don't know whether they
are typically encoded incorrectly. I've heard they are often XML,
in which case I strongly doubt they are incorrectly encoded)


I also don't have much experience with feeds.  My statement is based  
on the fact that chardet, the tool that has been cited most in this  
thread, was written specifically for use with the author's feed  
parsing package.



As for whatever - can you give specific examples?


Not that I can substantiate.  Documents  feeds covers a lot of what  
is on the web--I was only trying to make the point that on the web,  
whenever an encoding can be specified, it will be specified  
incorrectly for a significant chunk of exemplars.



(which, sadly, is rather common). The
set of programs of programs that need this functionality is  
probably the
same set that needs BeautifulSoup--I think that set is larger than  
just

browsers grin


Again, can you give *specific* examples that are not web browsers?
Programs needing BeautifulSoup may still not need encoding guessing,
since they still might be able to hard-code the encoding of the web
page they want to process.


Indeed, if it is only one site it is pretty easy to work around.  My  
main use of python is processing and analyzing hundreds of millions of  
web documents, so it is pretty easy to see applications (which I have  
listed above).  I think that libraries like Mark Pilgrim's FeedParser  
and BeautifulSoup are possible consumers of guessing as well.



In any case, I'm very skeptical that a general guess encoding
module would do a meaningful thing when applied to incorrectly
encoded HTML pages.


Well, it does.  I wish I could easily provide data on how often it is  
necessary over the whole web, but that would be somewhat difficult to  
generate.  I can say that it is much more important to be able to  
parse all the different kinds of encoding _specification_ on the web  
(Content-Type/Content-Encoding/meta http-equiv tags, etc), and the  
malformed cases of these.


I can also think of good arguments for excluding encoding detection  
for maintenance reasons: is every case of the algorithm guessing wrong  
a bug that needs to be fixed in the stdlib?  That is an unbounded  
commitment.


-Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Bill Janssen
 Yup, but DrProject (the target application) also serves as a relay and 
 archive for email.  We have no control over the agent used for 
 composition, and AFAIK there's no standard way to include encoding 
 information.

Greg,

Internet-compliant email actually has well-specified mechanisms for
including encoding information; see RFCs 2047 and 2231.  There's no
need to guess; you can just look.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Stephen J. Turnbull
Bill Janssen writes:

  Internet-compliant email actually has well-specified mechanisms for
  including encoding information; see RFCs 2047 and 2231.  There's no
  need to guess; you can just look.

You must be very special to get only compliant email.

About half my colleagues use RFC 2047 to encode Japanese file names in
MIME attachments (a MUST NOT behavior according to RFC 2047), and a
significant fraction of the rest end up with binary Shift JIS or EUC
or MacRoman in there.

And those are just the most widespread violations I can think of off
the top of my head.

Not to mention that I find this:

=?X-UNKNOWN?Q?Martin_v=2E_L=F6wis?= [EMAIL PROTECTED],

in the header I got from you.  (I'm not ragging on you, I get Martin's
name wrong a significant portion of the time myself. :-( )
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Stephen J. Turnbull
Martin v. Löwis writes:

  In any case, I'm very skeptical that a general guess encoding
  module would do a meaningful thing when applied to incorrectly
  encoded HTML pages.

That depends on whether you can get meaningful information about the
language from the fact that you're looking at the page.  In the
browser context, for one, 99.44% of users are monolingual, so you only
have to distinguish among the encodings for their language.  In this
context a two stage process of determining a category of encoding (eg,
ISO 8859, ISO 2022 7-bit, ISO 2022 8-bit multibyte, UTF-8, etc), and
then picking an encoding from the category according to a
user-specified configuration has served Emacs/MULE users very well for
about 20 years.

It does *not* work in a context where multiple encodings from the same
category are in use (eg, the email folder of a Polish Gastarbeiter in
Berlin).

Nonetheless it is pretty useful for user agents like mail clients, web
browsers, and editors.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Stephen J. Turnbull
Guido van Rossum writes:

  To the contrary, an encoding-guessing module is often needed, and
  guessing can be done with a pretty high success rate. Other Unicode
  libraries (e.g. ICU) contain guessing modules. I suppose the API could
  return two values: the guessed encoding and a confidence indicator.
  Note that the locale settings might figure in the guess.

Not locale settings, but user configuration.  A Bayesian detector
(CodeBayes? hi, Skip!) might be a good way to go for servers, while a
simple language preference might really up the probability for user
agents.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Martin v. Löwis
 Yup, but DrProject (the target application) also serves as a relay and
 archive for email.  We have no control over the agent used for
 composition, and AFAIK there's no standard way to include encoding
 information.

That's not at all the case. MIME defines that in full detail, since
1993.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Martin v. Löwis
 I certainly agree that if the target set of documents is small enough it
 is possible to hand-code the encoding.  There are many applications,
 however, that need to examine the content of an arbitrary, or at least
 non-small set of web documents.  To name a few such applications:
 
  - web search engines
  - translation software

I'll question whether these are many programs. Web search engines
and translation software have many more challenges to master, and
they are fairly special-cased, so I would expect they need to find
their own answer to character set detection, anyway (see Bill Janssen's
answer on machine translation, also).

  - document/bookmark management systems
  - other kinds of document analysis (market research, seo, etc.)

Not sure what specifically you have in mind, however, I expect that
these also have their own challenges. For example, I would expect
that MS-Word documents are frequent. You don't need character set
detection there (Word is all Unicode), but you need an API to look
into the structure of .doc files.

 Not that I can substantiate.  Documents  feeds covers a lot of what is
 on the web--I was only trying to make the point that on the web,
 whenever an encoding can be specified, it will be specified incorrectly
 for a significant chunk of exemplars.

I firmly believe this assumption is false. If the encoding comes out of
software (which it often does), it will be correct most of the time.
It's incorrect only if the content editor has to type it.

 Indeed, if it is only one site it is pretty easy to work around.  My
 main use of python is processing and analyzing hundreds of millions of
 web documents, so it is pretty easy to see applications (which I have
 listed above).

Ok. What advantage would you (or somebody working on a similar project)
gain if chardet was part of the standard library? What if it was not
chardet, but some other algorithm?

 I can also think of good arguments for excluding encoding detection for
 maintenance reasons: is every case of the algorithm guessing wrong a bug
 that needs to be fixed in the stdlib?  That is an unbounded commitment.

Indeed, that's what I meant with my initial remark. People will expect
that it works correctly - both with the consequence of unknowingly
proceeding with the incorrect response, and then complaining when they
find out that it did produce an incorrect answer.

For chardet specifically, my usual standard-library remark applies:
it can't become part of the standard library unless the original
author contributes it, anyway. I would then hope that he or a group
of people would volunteer to maintain it, with the threat of removing
it from the stdlib again if these volunteers go away and too many
problems show up.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-22 Thread Terry Reedy

Martin v. Löwis [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
| I certainly agree that if the target set of documents is small enough it
|
| Ok. What advantage would you (or somebody working on a similar project)
| gain if chardet was part of the standard library? What if it was not
| chardet, but some other algorithm?

It seems to me that since there is not a 'correct' algorithm but only 
competing heuristics, encoding detection modules should be made available 
via PyPI and only be considered for stdlib after a best of breed emerges 
with community support. 



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread skip

David Is there some sort of text encoding detection module is the
David standard library?  And, if not, is there any reason not to add
David one?

No, there's not.  I suspect the fact that you can't correctly determine the
encoding of a chunk of text 100% of the time mitigates against it.

Skip
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Christian Heimes
David Wolever schrieb:
 Is there some sort of text encoding detection module is the standard  
 library?
 And, if not, is there any reason not to add one?

You cannot detect the encoding unless it's explicitly defined through a
header (e.g. the UTF BOM). It's technically impossible. The best you can
do is an educated guess.

Christian
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Jean-Paul Calderone
On Mon, 21 Apr 2008 17:50:43 +0100, Michael Foord [EMAIL PROTECTED] wrote:
[EMAIL PROTECTED] wrote:
 David Is there some sort of text encoding detection module is the
 David standard library?  And, if not, is there any reason not to add
 David one?

 No, there's not.  I suspect the fact that you can't correctly determine the
 encoding of a chunk of text 100% of the time mitigates against it.


The only approach I know of is a heuristic based approach. e.g.

http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml

(Which was 'borrowed' from docutils in the first place.)

This isn't the only approach, although you're right that in general you
have to rely on heuristics.  See the charset detection features of ICU:

  http://www.icu-project.org/userguide/charsetDetection.html

I think OSAF's pyicu exposes these APIs:

  http://pyicu.osafoundation.org/

Jean-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Michael Foord
[EMAIL PROTECTED] wrote:
 David Is there some sort of text encoding detection module is the
 David standard library?  And, if not, is there any reason not to add
 David one?

 No, there's not.  I suspect the fact that you can't correctly determine the
 encoding of a chunk of text 100% of the time mitigates against it.
   

The only approach I know of is a heuristic based approach. e.g.

http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml

(Which was 'borrowed' from docutils in the first place.)

Michael Foord
 Skip
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread David Wolever
On 21-Apr-08, at 12:44 PM, [EMAIL PROTECTED] wrote:

 David Is there some sort of text encoding detection module is the
 David standard library?  And, if not, is there any reason not  
 to add
 David one?
 No, there's not.  I suspect the fact that you can't correctly  
 determine the
 encoding of a chunk of text 100% of the time mitigates against it.
Sorry, I wasn't very clear what I was asking.

I was thinking about making an educated guess -- just like chardet  
(http://chardet.feedparser.org/).

This is useful when you get a hunk of data which _should_ be some  
sort of intelligible text from the Big Scary Internet (say, a posted  
web form or email message), and you want to do something useful with  
it (say, search the content).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Guido van Rossum
To the contrary, an encoding-guessing module is often needed, and
guessing can be done with a pretty high success rate. Other Unicode
libraries (e.g. ICU) contain guessing modules. I suppose the API could
return two values: the guessed encoding and a confidence indicator.
Note that the locale settings might figure in the guess.

On Mon, Apr 21, 2008 at 10:28 AM, Georg Brandl [EMAIL PROTECTED] wrote:
 Christian Heimes schrieb:

  David Wolever schrieb:
   Is there some sort of text encoding detection module is the standard
   library?
   And, if not, is there any reason not to add one?
  
   You cannot detect the encoding unless it's explicitly defined through a
   header (e.g. the UTF BOM). It's technically impossible. The best you can
   do is an educated guess.

  Exactly, and in light of that, I'm -1 for such a standard module.
  We've enough issues with modules implementing (apparently) fully
  specified standards. :)

  Georg



  ___
  Python-Dev mailing list
  Python-Dev@python.org
  http://mail.python.org/mailman/listinfo/python-dev
  Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread skip

Michael The only approach I know of is a heuristic based approach. e.g.

Michael http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml

Michael (Which was 'borrowed' from docutils in the first place.)

Yes, I implemented a heuristic approach for the Musi-Cal web server.  I was
able to rely on domain knowledge to guess correctly almost all the time.
The heuristic was that almost all form submissions came from the US and the
rest which didn't came from Western Europe.  Python could never embed such a
narrow-focused heuristic into its core distribution.

Skip

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread skip

Guido Note that the locale settings might figure in the guess.

Alas, locale settings in a web server have little or nothing to do with the
locale settings in the client submitting the form.

Skip
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Tony Nelson
At 1:14 PM -0400 4/21/08, David Wolever wrote:
On 21-Apr-08, at 12:44 PM, [EMAIL PROTECTED] wrote:

 David Is there some sort of text encoding detection module is the
 David standard library?  And, if not, is there any reason not
 to add
 David one?
 No, there's not.  I suspect the fact that you can't correctly
 determine the
 encoding of a chunk of text 100% of the time mitigates against it.
Sorry, I wasn't very clear what I was asking.

I was thinking about making an educated guess -- just like chardet
(http://chardet.feedparser.org/).

This is useful when you get a hunk of data which _should_ be some
sort of intelligible text from the Big Scary Internet (say, a posted
web form or email message), and you want to do something useful with
it (say, search the content).

Feedparser.org's chardet can't guess 'latin1', so it should be used as a
last resort, just as the docs say.
-- 

TonyN.:'   mailto:[EMAIL PROTECTED]
  '  http://www.georgeanelson.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread David Wolever

On 21-Apr-08, at 5:31 PM, Martin v. Löwis wrote:

This is useful when you get a hunk of data which _should_ be some
sort of intelligible text from the Big Scary Internet (say, a posted
web form or email message), and you want to do something useful with
it (say, search the content).

I don't think that should be part of the standard library. People
will mistake what it tells them for certain.
As Oleg mentioned, if the method is called something like  
'guess_encoding', I think we could live with clear consciences.


IMO, encoding estimation is something that many web programs will  
have to deal with, so it might as well be built in; I would prefer  
the option to run `text=input.encode('guess')` (or something similar)  
than relying on an external dependency or worse yet using a hand- 
rolled algorithm.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Encoding detection in the standard library?

2008-04-21 Thread Martin v. Löwis
 IMO, encoding estimation is something that many web programs will have
 to deal with

Can you please explain why that is? Web programs should not normally
have the need to detect the encoding; instead, it should be specified
always - unless you are talking about browsers specifically, which
need to support web pages that specify the encoding incorrectly.

 so it might as well be built in; I would prefer the option
 to run `text=input.encode('guess')` (or something similar) than relying
 on an external dependency or worse yet using a hand-rolled algorithm.

Ok, let me try differently then. Please feel free to post a patch to
bugs.python.org, and let other people rip it apart.

For example, I don't think it should be a codec, as I can't imagine it
working on streams.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com