subject:"Re\: \[Python\-Dev\] Fixing the XML batteries"

Re: [Python-Dev] Fixing the XML batteries

2012-02-07 Thread Eli Bendersky

 On one hand I agree that ET should be emphasized since it's the better
 API with a much faster implementation. But I also understand Martin's
 point of view that minidom has its place, so IMHO some sort of
 compromise should be reached. Perhaps we can recommend using ET for
 those not specifically interested in the DOM interface, but for those
 who *are*, minidom is still a good stdlib option (?).


 If you can, go ahead and write a patch saying something like that. It should
 not be hard to come up with something that is a definite improvement. Create
 a tracker issue for comment. but don't let it sit forever.



A tracker issue already exists for this -
http://bugs.python.org/issue11379 - I see no reason to open a new one.
I will add my opinion there - feel free to do that too.

 Since the current policy seems to be to hide C behind Python when there is
 both, I assume that finishing the transition here is something just not
 gotten around to yet. Open another issue if there is not one.


I will open a separate discussion on this.

Eli
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2012-02-06 Thread Eli Bendersky

On Fri, Dec 9, 2011 at 10:02, Stefan Behnel stefan...@behnel.de wrote:
 Hi everyone,

 I think Py3.3 would be a good milestone for cleaning up the stdlib support
 for XML. Note upfront: you may or may not know me as the maintainer of lxml,
 the de-facto non-stdlib standard Python XML tool. This (lengthy) post was
 triggered by the following kind of conversation that I keep having with new
 XML users in Python (mostly on c.l.py), which hints at some serious flaw in
 the stdlib.

snip

AFAIU nothing really happened with this. The discussion started with a
lot of +1s but then got derailed. The related Issue 11379 also got
stuck nearly two months ago. It would be great if some sort of
consensus could be reached here, since this is an important issue :-)

Eli
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2012-02-06 Thread Calvin Spealman

On Dec 9, 2011 3:04 AM, Stefan Behnel stefan...@behnel.de wrote:

 Hi everyone,

 I think Py3.3 would be a good milestone for cleaning up the stdlib
support for XML. Note upfront: you may or may not know me as the maintainer
of lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy)
post was triggered by the following kind of conversation that I keep having
with new XML users in Python (mostly on c.l.py), which hints at some
serious flaw in the stdlib.

 User: I'm trying to do XML stuff XYZ in Python and have problem ABC.
 Me: What library are you using? Could you show us some code?
 User: My code looks like this snippet: ...
 Me: You are using minidom which is known to be hard to use, slow and uses
lots of memory. Use the xml.etree.ElementTree package instead, or rather
its C implementation cElementTree, also in the stdlib.
 User (coming back after a while): thanks, that was exactly what [I didn't
know] I was looking for.

 What does this tell us?

 1) MiniDOM is what new users find first. It's highly visible because
there are still lots of ancient Python and XML web pages out there that
date back from the time before Python 2.5 (or rather something like 2.2),
when it was the only XML tree library in the stdlib. It's also the first
hit from the top when you search for XML on the stdlib docs page and
contains the (to some people) familiar word DOM, which lets users stop
their search and start writing code, not expecting to find a separate
alternative in the same stdlib, way further down. And the description as
mini, simple and lightweight suggests to users that it's going to be
easy to use and efficient.

 2) MiniDOM is not what users want. It leads to complicated, unpythonic
code and lots of problems. It is neither easy to use, nor efficient, nor
lightweight, simple or mini, not in absolute numbers (see
http://bugs.python.org/issue11379#msg148584 and following for a recent
discussion). It's also badly maintained in the sense that its performance
characteristics could likely be improved, but no-one is seriously
interested in doing that, because it would not lead to something that
actually *is* fast or memory friendly compared to any of the 'real'
alternatives that are available right now.

 3) ElementTree is what users should use, MiniDOM is not. ElementTree was
added to the stdlib in Py2.5 on popular demand, exactly because it is very
easy to use, very fast, and very memory friendly. And because users did not
want to use MiniDOM any more. Today, ElementTree has a rather straight
upgrade path towards lxml.etree if more XML features like validation or
XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end.

 4) In the stdlib, cElementTree is independent of ElementTree, but totally
hidden in the documentation. In conversations like the above, it's
unnecessarily complex to explain to users that there is ElementTree (which
is documented in the stdlib), but that what they want to use is really
cElementTree, which has the same API but does not have a stdlib
documentation page that I can send them to. Note that the other Python
implementations simply provide cElementTree as an alias for ElementTree.
That leaves CPython as the only Python implementation that really has these
two separate modules.

 So, there are many problems here. And I think they make it unnecessarily
complicated for users to process XML in Python and that the current
situation helps in turning away new users from Python as a language for XML
processing. Python does have impressively great tools for working with XML.
It's just that the stdlib and its documentation do not reflect or even
appreciate that.

 What should change?

 a) The stdlib documentation should help users to choose the right tool
right from the start. Instead of using the totally misleading wording that
it uses now, it should be honest about the performance characteristics of
MiniDOM and should actively suggest that those who don't know what to
choose (or even *that* they can choose) should not use MiniDOM in the first
place. I created a ticket (issue11379) for a minor step in this direction,
but given the responses, I'm rather convinced that there's a lot more that
can be done and should be done, and that it should be done now, right for
the next release.

 b) cElementTree should finally loose it's special status as a separate
library and disappear as an accelerator module behind ElementTree. This has
been suggested a couple of times already, and AFAIR, there was some
opposition because 1) ET was maintained outside of the stdlib and 2) the
APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2
was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent
Xicluna (who is doing a good job with it), and ET 1.3 has basically made
the APIs of both implementations compatible again. So, 3.3 would be the
right milestone for fixing the two libs for one quirk.

 Given that this is the third time during the last couple

Re: [Python-Dev] Fixing the XML batteries

2012-02-06 Thread Eli Bendersky

 What should change?

 a) The stdlib documentation should help users to choose the right tool right
 from the start. Instead of using the totally misleading wording that it uses
 now, it should be honest about the performance characteristics of MiniDOM
 and should actively suggest that those who don't know what to choose (or
 even *that* they can choose) should not use MiniDOM in the first place. I
 created a ticket (issue11379) for a minor step in this direction, but given
 the responses, I'm rather convinced that there's a lot more that can be done
 and should be done, and that it should be done now, right for the next
 release.

On one hand I agree that ET should be emphasized since it's the better
API with a much faster implementation. But I also understand Martin's
point of view that minidom has its place, so IMHO some sort of
compromise should be reached. Perhaps we can recommend using ET for
those not specifically interested in the DOM interface, but for those
who *are*, minidom is still a good stdlib option (?).

Tying this doc clarification with an optimization in minidom is not
something that makes sense. This is just delaying a much needed change
forever.


 b) cElementTree should finally loose it's special status as a separate
 library and disappear as an accelerator module behind ElementTree. This has
 been suggested a couple of times already, and AFAIR, there was some
 opposition because 1) ET was maintained outside of the stdlib and 2) the
 APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2
 was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent
 Xicluna (who is doing a good job with it), and ET 1.3 has basically made the
 APIs of both implementations compatible again. So, 3.3 would be the right
 milestone for fixing the two libs for one quirk.

This, at least in my view, is the more important point which
unfortunately got much less attention in the thread. I was a bit
shocked to see that in 3.3 trunk we still have both the Python and C
versions exposed and only formally document ElementTree (the Python
version), The only reference to cElementTree is an un-emphasized note:

  A C implementation of this API is available as xml.etree.cElementTree.

Is there anything that *really* blocks providing cElementTree on
import ElementTree and removing the explicit cElementTree for 3.3
(or at least leaving it with a deprecation warning)?

Eli
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2012-02-06 Thread Terry Reedy


On 2/6/2012 8:01 AM, Eli Bendersky wrote:


On one hand I agree that ET should be emphasized since it's the better
API with a much faster implementation. But I also understand Martin's
point of view that minidom has its place, so IMHO some sort of
compromise should be reached. Perhaps we can recommend using ET for
those not specifically interested in the DOM interface, but for those
who *are*, minidom is still a good stdlib option (?).


If you can, go ahead and write a patch saying something like that. It 
should not be hard to come up with something that is a definite 
improvement. Create a tracker issue for comment. but don't let it sit 
forever.



Tying this doc clarification with an optimization in minidom is not
something that makes sense. This is just delaying a much needed change
forever.


Right.


This, at least in my view, is the more important point which
unfortunately got much less attention in the thread. I was a bit
shocked to see that in 3.3 trunk we still have both the Python and C
versions exposed and only formally document ElementTree (the Python
version), The only reference to cElementTree is an un-emphasized note:

   A C implementation of this API is available as xml.etree.cElementTree.


Since the current policy seems to be to hide C behind Python when there 
is both, I assume that finishing the transition here is something just 
not gotten around to yet. Open another issue if there is not one.



Is there anything that *really* blocks providing cElementTree on
import ElementTree and removing the explicit cElementTree for 3.3
(or at least leaving it with a deprecation warning)?


If cElementTree were renamed _ElementTree for import from ElementTree, 
then a new cElementTree.py could raise the warning and then import 
_ElementTree also.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-16 Thread Nick Coghlan

On Fri, Dec 16, 2011 at 4:53 PM, Stefan Behnel stefan...@behnel.de wrote:
 If these changes are considered acceptable, I'll copy the above over to the
 documentation bug I opened at

 http://bugs.python.org/issue11379

 Can these doc changes go into both 2.7 and 3.3? Given that there is no
 important difference between the implementations, I don't see why the
 documentation should differ in Py2.

Your suggested tweaks look good to me and could go into all of 2.7, 3.2 and 3.3

 b) cElementTree should finally loose it's special status as a separate
 library and disappear as an accelerator module behind ElementTree.

 There was no opposition and a general agreement on this in the thread,
 except for the warning that Fredrik Lundh should have a word in this. I
 wrote him an e-mail and didn't get a response so far. We can wait a little
 longer, I guess, there's still time before 3.3beta.

Having ElementTree implicitly do from _elementtree import * is a 3.3
only change, though. (Note that xml.etree.cElementTree isn't the
actual acceleration module - that honor already goes to
_elementtree. The only bit missing is the automatic import in
xml.etree.ElementTree and the appropriate test updates to ensure the
Python version still gets tested)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-16 Thread Stefan Behnel


Stefan Behnel, 14.12.2011 20:41:

It's clear from the
discussion that there are still users and that new code is still being
written that uses MiniDOM. However, I would argue that this cannot possibly
be performance critical code and that it only deals with somewhat small
documents. I say that because MiniDOM is evidently not suitable for large
documents or performance critical applications, so this is the only
explanation I have why the performance problems would not be obvious in the
cases where it is still being used. And if they do show, it appears to be
much more likely that users rewrite their code using ElementTree or lxml
than that they try to fix MiniDOM's performance issues.


Out of curiosity, I reran my benchmarks under PyPy 1.7.

http://blog.behnel.de/index.php?p=210

In short: MiniDOM performs substantially better there, both in terms of 
time and space. That by itself doesn't make PyPy an interesting platform 
for XML processing (using lxml in CPython is way faster), but I found it 
interesting to note that the problem is not strictly inherent in MiniDOM. 
It also depends a lot on the runtime environment, even when it comes to 
memory usage.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-16 Thread Baptiste Carvello

Le 16/12/2011 07:53, Stefan Behnel a écrit :

 Additionally, the documentation on the xml.sax page would benefit from
 the following paragraph:
 
 
 [[Note: The xml.sax package provides an implementation of the SAX
 interface whose API is similar to that in other programming languages.
 Users who are unfamiliar with the SAX interface or who would like to
 write less code for efficient stream processing of XML files should
 consider using the iterparse() function in the xml.etree.ElementTree
 module instead.]]
 
 

A small caveat to note about iterparse(), which I otherwise like a lot:
when processing very big data (I encountered this with a region-wide
openstreetmap XML dump), you have to remove the processed nodes from the
root element. Otherwise, its memory footprint increases with the size of
the document.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-15 Thread Stefan Behnel


Stefan Behnel, 09.12.2011 09:02:

I think Py3.3 would be a good milestone for cleaning up the stdlib support
for XML.
[...]


I still think it is, so let me sum up the current discussion here.



What should change?

a) The stdlib documentation should help users to choose the right tool
right from the start.


It looks like there's agreement on this part.



Instead of using the totally misleading wording that
it uses now, it should be honest about the performance characteristics of
MiniDOM and should actively suggest that those who don't know what to
choose (or even *that* they can choose) should not use MiniDOM in the first
place.


There was some disagreement on whether MiniDOM should publicly disclose its 
performance characteristics in the documentation, and whether its use 
should be discouraged, even just for new users.


However, it seemed that there was enough consensus to settle on Nick 
Coghlan's proposal for a compromise to move ElementTree up to the top of 
the list, and to add a visible note to the top of each of the XML modules 
like this:


Note: The
whatever module is a yada, yada, DOM based, whatever. If all you
are trying to do is read and write XML files, consider using the
xml.etree.ElementTree module instead

That template could (with a bit of peaking into the getopt documentation) 
be expanded into the following.



[[Note: The xml.dom.minidom module provides an implementation of the 
W3C-DOM whose API is similar to that in other programming languages. Users 
who are unfamiliar with the W3C-DOM interface or who would like to write 
less code for processing XML files should consider using the 
xml.etree.ElementTree module instead.]]



I think this should go on the xml.dom.minidom page as well as the xml.dom 
package page. Hand-wavingly, users who are new to the DOM are more likely 
to hit the package page first, whereas those who know it already will 
likely find the MiniDOM page directly.


Note that I'd still encourage the removal of the misleading word 
lightweight until it makes sense to put it back in a meaningful way. I 
therefore propose the following minimalistic changes to the first paragraph 
on the minidom page:



xml.dom.minidom is a [-XXX: light-weight] implementation of the Document 
Object Model interface. It is intended to be simpler than the full DOM and 
also [+XXX: provide a] significantly smaller [+XXX: API].



@Martin: note how the original paragraph does not refer to 4DOM or 
PyXML. It only generically mentions the DOM interface. It is certainly 
not true that MiniDOM is more light-weight and significantly smaller 
than (most) other DOM interface implementations outside of the Python 
world, for example. So the current wording actually makes no sense at all.


Additionally, the documentation on the xml.sax page would benefit from the 
following paragraph:



[[Note: The xml.sax package provides an implementation of the SAX interface 
whose API is similar to that in other programming languages. Users who are 
unfamiliar with the SAX interface or who would like to write less code for 
efficient stream processing of XML files should consider using the 
iterparse() function in the xml.etree.ElementTree module instead.]]



If these changes are considered acceptable, I'll copy the above over to the 
documentation bug I opened at


http://bugs.python.org/issue11379

Can these doc changes go into both 2.7 and 3.3? Given that there is no 
important difference between the implementations, I don't see why the 
documentation should differ in Py2.




b) cElementTree should finally loose it's special status as a separate
library and disappear as an accelerator module behind ElementTree.


There was no opposition and a general agreement on this in the thread, 
except for the warning that Fredrik Lundh should have a word in this. I 
wrote him an e-mail and didn't get a response so far. We can wait a little 
longer, I guess, there's still time before 3.3beta.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-14 Thread Martin v. Löwis

Am 12.12.2011 10:04, schrieb Stefan Behnel:
 Martin v. Löwis, 11.12.2011 23:39:
 I can't recall anyone working on any substantial improvements during the
 last six years or so, and the reason for that seems obvious to me.

 What do you think is the reason? It's not at all obvious to me.
 
 Just to repeat myself for the third time here: lack of interest.

Ah, that's certainly wrong. I am interested in these libraries.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-14 Thread Martin v. Löwis

 Just look through the xml-sig page, basically all requests regarding
 PyXML during the last five years deal with problems in installing it,
 i.e. *before* even starting to use it. So you can't use this to claim
 that people really *are* still using it.

I'm not so sure. In many of these cases, it turned out that they were
trying to run some software that uses PyXML, and that they tried
installing PyXML to satisfy the prerequisite. So while they may not
be software developers, they are indirectly users of PyXML.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-14 Thread Stefan Behnel


Martin v. Löwis, 14.12.2011 19:14:

Am 12.12.2011 10:04, schrieb Stefan Behnel:

Martin v. Löwis, 11.12.2011 23:39:

I can't recall anyone working on any substantial improvements during the
last six years or so, and the reason for that seems obvious to me.


What do you think is the reason? It's not at all obvious to me.


Just to repeat myself for the third time here: lack of interest.


Ah, that's certainly wrong. I am interested in these libraries.


I meant: lack of interest in improving them. It's clear from the 
discussion that there are still users and that new code is still being 
written that uses MiniDOM. However, I would argue that this cannot possibly 
be performance critical code and that it only deals with somewhat small 
documents. I say that because MiniDOM is evidently not suitable for large 
documents or performance critical applications, so this is the only 
explanation I have why the performance problems would not be obvious in the 
cases where it is still being used. And if they do show, it appears to be 
much more likely that users rewrite their code using ElementTree or lxml 
than that they try to fix MiniDOM's performance issues.


Now, read my first quote above again (and preferably also its context, 
which I already emphasized in a previous post), it should be clearer now.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-14 Thread Xavier Morel

On 2011-12-14, at 20:41 , Stefan Behnel wrote:
 I meant: lack of interest in improving them. It's clear from the discussion 
 that there are still users and that new code is still being written that uses 
 MiniDOM. However, I would argue that this cannot possibly be performance 
 critical code and that it only deals with somewhat small documents. I say 
 that because MiniDOM is evidently not suitable for large documents or 
 performance critical applications, so this is the only explanation I have why 
 the performance problems would not be obvious in the cases where it is still 
 being used. And if they do show, it appears to be much more likely that users 
 rewrite their code using ElementTree or lxml than that they try to fix 
 MiniDOM's performance issues.
Could also be because XML is slow (and sucks) is part of the global 
consciousness at this point, and that minidom is slow and verbose doesn't 
surprise much.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-14 Thread Stefan Behnel


Xavier Morel, 14.12.2011 20:54:

On 2011-12-14, at 20:41 , Stefan Behnel wrote:

I meant: lack of interest in improving them. It's clear from the
discussion that there are still users and that new code is still being
written that uses MiniDOM. However, I would argue that this cannot
possibly be performance critical code and that it only deals with
somewhat small documents. I say that because MiniDOM is evidently not
suitable for large documents or performance critical applications, so
this is the only explanation I have why the performance problems would
not be obvious in the cases where it is still being used. And if they
do show, it appears to be much more likely that users rewrite their
code using ElementTree or lxml than that they try to fix MiniDOM's
performance issues.


Could also be because XML is slow (and sucks) is part of the global
consciousness at this point, and that minidom is slow and verbose
doesn't surprise much.


Possibly, yes. Or that Python is slow and sucks. But I think there are 
good counter arguments against both.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-14 Thread Martin v. Löwis

Am 14.12.2011 20:41, schrieb Stefan Behnel:
 Martin v. Löwis, 14.12.2011 19:14:
 Am 12.12.2011 10:04, schrieb Stefan Behnel:
 Martin v. Löwis, 11.12.2011 23:39:
 I can't recall anyone working on any substantial improvements
 during the
 last six years or so, and the reason for that seems obvious to me.

 What do you think is the reason? It's not at all obvious to me.

 Just to repeat myself for the third time here: lack of interest.

 Ah, that's certainly wrong. I am interested in these libraries.
 
 I meant: lack of interest in improving them.

That's also what I meant. I'm interested in improving them.

 Now, read my first quote above again (and preferably also its context,
 which I already emphasized in a previous post), it should be clearer now.

I (now) know what you mean - but you are incorrect.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-14 Thread Stefan Behnel


Martin v. Löwis, 14.12.2011 22:20:

Am 14.12.2011 20:41, schrieb Stefan Behnel:

Martin v. Löwis, 14.12.2011 19:14:

Am 12.12.2011 10:04, schrieb Stefan Behnel:

Martin v. Löwis, 11.12.2011 23:39:

I can't recall anyone working on any substantial improvements
during the
last six years or so, and the reason for that seems obvious to me.


What do you think is the reason? It's not at all obvious to me.


Just to repeat myself for the third time here: lack of interest.


Ah, that's certainly wrong. I am interested in these libraries.


I meant: lack of interest in improving them.


That's also what I meant. I'm interested in improving them.


Then please do. I posted the numbers, so you know what the baseline is, 
both in terms of speed and memory usage. If you need further benchmarks of 
other areas of the API (e.g. tag search or whatever), just ask.


Note, however, that even an improvement by an order of magnitude wouldn't 
solve the API issue for new users, so I'd still suggest to add an 
appropriate link towards ET to the MiniDOM documentation.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-12 Thread Stefan Behnel


Martin v. Löwis, 11.12.2011 23:39:

I can't recall anyone working on any substantial improvements during the
last six years or so, and the reason for that seems obvious to me.


What do you think is the reason? It's not at all obvious to me.


Just to repeat myself for the third time here: lack of interest.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-12 Thread Stefan Behnel


Martin v. Löwis, 11.12.2011 23:03:

Am 09.12.2011 10:09, schrieb Xavier Morel:

On 2011-12-09, at 09:41 , Martin v. Löwis wrote:

a) The stdlib documentation should help users to choose the right
tool right from the start. Instead of using the totally
misleading wording that it uses now, it should be honest about
the performance characteristics of MiniDOM and should actively
suggest that those who don't know what to choose (or even *that*
they can choose) should not use MiniDOM in the first place.



[...]


Minidom is inferior in interface flow and pythonicity, in terseness,
in speed, in memory consumption (even more so using cElementTree, and
that's not something which can be fixed unless minidom gets a C
accelerator), etc… Even after fixing minidom (if anybody has the time
and drive to commit to it), ET/cET should be preferred over it.


I don't mind pointing people to ElementTree, despite that I disagree
whether the ET interface is superior to DOM.


Yes, that's clearly a point where we agree to disagree, and I understand 
that you are as biased towards minidom as I am biased towards ElementTree.


However, I think I made it clear that the implementation of cElementTree 
(and lxml.etree as well, for that purpose) is largely superiour to MiniDOM 
in terms of performance, for any sensible meaning of the word performance.


And I'm also convinced that the API is largely superiour in terms of 
usability. ET certainly matches Python as a language much better than 
MiniDOM. But that's just my personal opinion.




It's Stefan's reasoning
as to *why* people should be pointed to ET, and what words should be
used to do that. IOW, I detest bashing some part of the standard
library, just to urge users to use some other part of the standard library.


I'm all for finding a good way of putting it into words, as long as it 
keeps uninformed users from taking the wrong decision and getting the wrong 
idea of how complicated and slow Python is.




People are still using PyXML, despite it's not being maintained anymore.


My experience with that is that it's only *new* users that are still 
running into PyXML by accident, because they didn't see that it's a dead 
project and they find it through ancient web pages that tell them that they 
need it because it's the way to do XML in Python and if minidom is not 
enough, use PyXML. Maybe we should misuse the stdlib documentation to 
clear that up as well. PyXML is just too attractive a name for a dead 
project.


Just look through the xml-sig page, basically all requests regarding PyXML 
during the last five years deal with problems in installing it, i.e. 
*before* even starting to use it. So you can't use this to claim that 
people really *are* still using it.




Telling them to replace 4DOM with minidom is much more appropriate


Do you actually have any evidence that anyone is still actively using 4DOM?



than telling them to rewrite in ET.


I usually encourage people to rewrite minidom code for ET. It makes the 
code simpler, more readable, more maintainable and much faster.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-12 Thread Stefan Behnel


Stefan Behnel, 12.12.2011 10:59:

Just look through the xml-sig page


Hmm, I meant xml-sig mailing list archive here ...

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-11 Thread Martin v. Löwis

Am 09.12.2011 10:09, schrieb Xavier Morel:
 On 2011-12-09, at 09:41 , Martin v. Löwis wrote:
 a) The stdlib documentation should help users to choose the right
 tool right from the start. Instead of using the totally
 misleading wording that it uses now, it should be honest about
 the performance characteristics of MiniDOM and should actively
 suggest that those who don't know what to choose (or even *that*
 they can choose) should not use MiniDOM in the first place.
 
[...]
 
 Minidom is inferior in interface flow and pythonicity, in terseness,
 in speed, in memory consumption (even more so using cElementTree, and
 that's not something which can be fixed unless minidom gets a C
 accelerator), etc… Even after fixing minidom (if anybody has the time
 and drive to commit to it), ET/cET should be preferred over it.

I don't mind pointing people to ElementTree, despite that I disagree
whether the ET interface is superior to DOM. It's Stefan's reasoning
as to *why* people should be pointed to ET, and what words should be
used to do that. IOW, I detest bashing some part of the standard
library, just to urge users to use some other part of the standard library.

People are still using PyXML, despite it's not being maintained anymore.
Telling them to replace 4DOM with minidom is much more appropriate than
telling them to rewrite in ET.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-11 Thread Martin v. Löwis

Am 09.12.2011 16:09, schrieb Dirkjan Ochtman:
 On Fri, Dec 9, 2011 at 09:02, Stefan Behnel stefan...@behnel.de wrote:
 a) The stdlib documentation should help users to choose the right tool right
 from the start.
 b) cElementTree should finally loose it's special status as a separate
 library and disappear as an accelerator module behind ElementTree.
 
 An at least somewhat informed +1 from me. The ElementTree API is a
 very good way to deal with XML from Python, and it deserves to be
 promoted over the included alternatives.
 
 Let's deprecate the NiCad batteries and try to guide users toward the
 Li-Ion ones.

If you are proposing to deprecate minidom: -1

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-11 Thread Martin v. Löwis

 I can't recall anyone working on any substantial improvements during the
 last six years or so, and the reason for that seems obvious to me.

What do you think is the reason? It's not at all obvious to me.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-11 Thread Xavier Morel

On 2011-12-11, at 23:03 , Martin v. Löwis wrote:
 People are still using PyXML, despite it's not being maintained anymore.
 Telling them to replace 4DOM with minidom is much more appropriate than
 telling them to rewrite in ET.

From my understanding, Stefan's suggestion is mostly aimed at new
python users trying to manipulate XML and not knowing what to use
(yet). It's not about telling people to rewrite existing codebase
(it's a good idea as well when possible, as far as I'm concerned, but
it's a different issue).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-11 Thread Ethan Furman


Martin,

You seem heavily invested in minidom.

In the near future I will need to parse and rewrite parts of an xml file 
created by a third-party program (PrintShopMail, for the curious).

It contains both binary and textual data.

Would you recommend minidom for this purpose?  What other purposes would 
you recommend minidom for?


xml-confused-ly yours,

~Ethan~

(Comments by others are, of course, also welcome. :)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-10 Thread Tim Wintle

On Fri, 2011-12-09 at 19:39 +0100, Xavier Morel wrote:
 On 2011-12-09, at 19:15 , Bill Janssen wrote:
  I use ElementTree for parsing valid XML, but minidom for producing it.
 Could you expand on your reasons to use minidom for producing XML?

To throw my 2c in here:

I personally normally use minidom for manipulating (x)html data (through
html5lib), and for writing XML.

I think it's primarily because DOM:

a) matches the way I think about XML documents.

b) Provides the same API as I use in other languages. (FWIW, I do a lot
of DOM manipulation in javascript)

c) Feels (to me) more similar to other formats I work with.


All three may be because I haven't spent enough time with ElementTree -
again I've found the documentation lacking.

Tim

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-10 Thread Bill Janssen

Stefan Behnel stefan...@behnel.de wrote:

 Bill Janssen, 09.12.2011 19:15:
  I think another thing that might go into refreshing the batteries is a
  feature comparison of BeautifulSoup and HTML5lib against the stdlib
  competition, to see what needs to be added/revised.  Having to switch to
  an outside package for parsing possibly invalid HTML is a pain.
 
 Such a feature request should be worth a separate thread.
 
 Note, however, that html5lib is likely way too big to add it to the
 stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
 in Python 3, which would be the target release series for better HTML
 support. So, whatever library or API you would want to use for HTML
 processing is currently only the second question as long as Py3 lacks
 a real-world HTML parser in the stdlib, as well as a robust character
 detection mechanism. I don't think that can be fixed all that easily.

Sounds like it needs a PEP.

I'm only advocating spending some thought on what needs to be done --
whether outside libraries need to be adopted into the stdlib would be a
step after that.  But understanding *why* those libraries exist and are
widely used should be a prerequisite to refreshing the stdlib's support.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-10 Thread Glyph Lefkowitz

On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:

 Note, however, that html5lib is likely way too big to add it to the stdlib, 
 and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3, 
 which would be the target release series for better HTML support. So, 
 whatever library or API you would want to use for HTML processing is 
 currently only the second question as long as Py3 lacks a real-world HTML 
 parser in the stdlib, as well as a robust character detection mechanism. I 
 don't think that can be fixed all that easily.


Here's the problem in a nutshell, I think:

Everybody wants an HTML parser in the stdlib, because it's inconvenient to pull 
in a dependency for such a simple task.
Everybody wants the stdlib to remain small, stable, and simple and not get 
overcomplicated.
Parsing arbitrary HTML5 is a monstrously complex problem, for which there exist 
rapidly-evolving standards and libraries to deal with it.  Parsing 'the web' 
(which is rapidly growing to include stuff like SVG, MathML etc) is even harder.

My personal opinion is that HTML5Lib gets this problem almost completely right, 
and so it should be absorbed by the stdlib.  Trying to re-invent this from 
scratch, or even use something like BeautifulSoup which uses a bunch of 
heuristics and hacks rather than reference to the laboriously-crafted standard 
that says exactly how parsing malformed stuff has to go to be like a browser, 
seems like it will just give the stdlib solution a reputation for working on 
the test input but not working in the real world.

(No disrespect to BeautifulSoup: it was a great attempt in the pre-HTML5 world 
which it was born into, and I've used it numerous times to implement useful 
things.  But much more effort has been poured into this problem since then, and 
the problems are better understood now.)

-glyph

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-10 Thread Terry Reedy

On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote:

On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:

Note, however, that html5lib is likely way too big to add it to the
stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
in Python 3, which would be the target release series for better HTML
support. So, whatever library or API you would want to use for HTML
processing is currently only the second question as long as Py3 lacks
a real-world HTML parser in the stdlib, as well as a robust character
detection mechanism. I don't think that can be fixed all that easily.

Here's the problem in a nutshell, I think:

1. Everybody wants an HTML parser in the stdlib, because it's
inconvenient to pull in a dependency for such a simple task.
2. Everybody wants the stdlib to remain small, stable, and simple and
not get overcomplicated.
3. Parsing arbitrary HTML5 is a monstrously complex problem, for which
there exist rapidly-evolving standards and libraries to deal with
it. Parsing 'the web' (which is rapidly growing to include stuff
like SVG, MathML etc) is even harder.

My personal opinion is that HTML5Lib gets this problem almost completely
right, and so it should be absorbed by the stdlib.

A little data: the HTML5lib project lives at
https://code.google.com/p/html5lib/
It has 4 owners and 22 other committers.

The most recent release, html5lib 0.90 for Python, is nearly 2 years
old. Since there is a separate Python3 repository, and there is no
mention on Python3 compatibility elsewhere that I saw, including the
pypi listing, I assume that is for Python2 only.

A comment on a recent (July 11) Python3 issue
https://code.google.com/p/html5lib/issues/detail?id=187colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
suggest that the Python3 version still has problems. Merged in now,
though still lots of errors and failures in the testsuite.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-10 Thread Glyph Lefkowitz

On Dec 10, 2011, at 6:30 PM, Terry Reedy wrote:

A little data: the HTML5lib project lives at
https://code.google.com/p/html5lib/
It has 4 owners and 22 other committers.

The most recent release, html5lib 0.90 for Python, is nearly 2 years old.
Since there is a separate Python3 repository, and there is no mention on
Python3 compatibility elsewhere that I saw, including the pypi listing, I
assume that is for Python2 only.

I believe that you are correct.

A comment on a recent (July 11) Python3 issue
https://code.google.com/p/html5lib/issues/detail?id=187colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
suggest that the Python3 version still has problems. Merged in now, though
still lots of errors and failures in the testsuite.

I don't see what bearing this has on the discussion. There are three possible
ways I can imagine to interpret this information.

First, you could believe that porting a codebase from Python 2 to Python 3 is
much easier than solving a difficult domain-specific problem. In that case,
html5lib has done the hard part and someone interested in html-in-the-stdlib
should do the rest.

Second, you could believe that porting a codebase from Python 2 to Python 3 is
harder than solving a difficult domain-specific problem, in which case
something is seriously wrong with Python 3 or its attendant migration tools and
that needs to be fixed, so someone should fix that rather than worrying about
parsing HTML right now. (I doubt that many subscribers to this list would
share this opinion, though.)

Third, you could believe that parsing HTML is not a difficult domain-specific
problem. But only a crazy person would believe that, so you're left with one
of the previous options :).

-glyph

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-10 Thread Terry Reedy

On 12/10/2011 9:25 PM, Glyph Lefkowitz wrote:

On Dec 10, 2011, at 6:30 PM, Terry Reedy wrote:

A little data: the HTML5lib project lives at
https://code.google.com/p/html5lib/
It has 4 owners and 22 other committers.

If there really are 4 'owners' rather than 4 people with admin access to
the site, then there are 4 people to negotiate with.

I believe that you are correct.

There are issues pointing to a 1.0 release, but I could not find any
current timetable. The project lots a bit stagnant. That does not bode
well for a commitment to future active maintenance.

A comment on a recent (July 11) Python3 issue
https://code.google.com/p/html5lib/issues/detail?id=187colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
https://code.google.com/p/html5lib/issues/detail?id=187colspec=ID
Type Status Priority Milestone Owner Summary Port
suggest that the Python3 version still has problems. Merged in now,
though still lots of errors and failures in the testsuite.

I don't see what bearing this has on the discussion.

I think both points above show that 'absorbing HTML5Lib in the stdlib'
will involve more sociological and technical problems than doing so with
a active one-person module that already runs on 3.2. One is that the
multiple version Python 2.x codebase is the reference version and that
will not be incorporated. A serious plan will have to address the real
situation.

---
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Martin v. Löwis

 a) The stdlib documentation should help users to choose the right tool
 right from the start. Instead of using the totally misleading wording
 that it uses now, it should be honest about the performance
 characteristics of MiniDOM and should actively suggest that those who
 don't know what to choose (or even *that* they can choose) should not
 use MiniDOM in the first place.

I disagree. The right approach is not to document performance problems,
but to fix them.

 b) cElementTree should finally loose it's special status as a separate
 library and disappear as an accelerator module behind ElementTree. This
 has been suggested a couple of times already, and AFAIR, there was some
 opposition because 1) ET was maintained outside of the stdlib and 2) the
 APIs of both were not identical. However, getting ET 1.3 into Py2.7 and
 3.2 was a U-turn.

Unfortunately (?), there is a near-contract-like agreement with Fredrik
Lundh that any significant changes to ElementTree in the standard
library have to be agreed by him. So whatever change you plan: make sure
Fredrik gives his explicit support.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Stefan Behnel


Martin v. Löwis, 09.12.2011 09:41:

a) The stdlib documentation should help users to choose the right tool
right from the start. Instead of using the totally misleading wording
that it uses now, it should be honest about the performance
characteristics of MiniDOM and should actively suggest that those who
don't know what to choose (or even *that* they can choose) should not
use MiniDOM in the first place.


I disagree. The right approach is not to document performance problems,
but to fix them.


Here's the relevant part of my mail that you stripped:


It's also badly maintained in the sense that its performance
characteristics could likely be improved, but no-one is seriously
interested in doing that, because it would not lead to something that
actually *is* fast or memory friendly compared to any of the 'real'
alternatives that are available right now.


I can't recall anyone working on any substantial improvements during the 
last six years or so, and the reason for that seems obvious to me.




b) cElementTree should finally loose it's special status as a separate
library and disappear as an accelerator module behind ElementTree. This
has been suggested a couple of times already, and AFAIR, there was some
opposition because 1) ET was maintained outside of the stdlib and 2) the
APIs of both were not identical. However, getting ET 1.3 into Py2.7 and
3.2 was a U-turn.


Unfortunately (?), there is a near-contract-like agreement with Fredrik
Lundh that any significant changes to ElementTree in the standard
library have to be agreed by him. So whatever change you plan: make sure
Fredrik gives his explicit support.


Ok, I'll try to contact him.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Xavier Morel

On 2011-12-09, at 09:41 , Martin v. Löwis wrote:
 a) The stdlib documentation should help users to choose the right tool
 right from the start. Instead of using the totally misleading wording
 that it uses now, it should be honest about the performance
 characteristics of MiniDOM and should actively suggest that those who
 don't know what to choose (or even *that* they can choose) should not
 use MiniDOM in the first place.
 
 I disagree. The right approach is not to document performance problems,
 but to fix them.
Even if performance problems should not be documented, I think Stefan's point 
that users should be steered away from minidom and towards ET and cET is 
completely valid and worthy of support: the *only* advantage minidom has over 
ET is that it uses an interface familiar to Java users[0] (they are about the 
only people using actual W3C DOM, while the DOM exists in javascript I'd say 
most code out there actively tries to not touch it with anything less than a 
10-foot library pole like jQuery). That interface is also, of course, 
absolutely dreadful.

Minidom is inferior in interface flow and pythonicity, in terseness, in speed, 
in memory consumption (even more so using cElementTree, and that's not 
something which can be fixed unless minidom gets a C accelerator), etc… Even 
after fixing minidom (if anybody has the time and drive to commit to it), 
ET/cET should be preferred over it.

And that's not even considering the ease of switching to lxml (if only for 
validators), which Stefan outlined.

[0] not 100% true now that I think about it: handling mixed content is simpler 
in minidom as there is no .text/.tail duality and text nodes are nodes like 
every other, but I really can't think of an other reason to prefer minidom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Nick Coghlan

On Fri, Dec 9, 2011 at 6:41 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 a) The stdlib documentation should help users to choose the right tool
 right from the start. Instead of using the totally misleading wording
 that it uses now, it should be honest about the performance
 characteristics of MiniDOM and should actively suggest that those who
 don't know what to choose (or even *that* they can choose) should not
 use MiniDOM in the first place.

 I disagree. The right approach is not to document performance problems,
 but to fix them.

When we offer a better way to do something that new users are want to
do, we generally redirect them to the more recent alternative. I
believe the redirection from the getopt module to the argparse module
strikes the right tone for that kind of thing:
http://docs.python.org/library/getopt

For the various XML libraries, a message along the lines of Note: The
whatever module is a yada, yada, DOM based, whatever. If all you
are trying to do is read and write XML files, consider using the
xml.etree.ElementTree module instead.

I'd also be +1 on adjusting the order of the XML pages in the main
index such that xml.etree.ElementTree appeared before xml.parser.expat
and all the others slid down one entry.

These are simple changes that don't harm current users of the modules
in the least, while being up front and very helpful for beginners.
Again, I think argparse vs getopt is a good comparison: argparse
appears first in the main index, and there's a redirection from getopt
to argparse that says if you don't have a specific reason to be using
getopt, you probably want argparse instead.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Antoine Pitrou



Mostly uninformed +1 to Stefan's suggestions from me.

Regards

Antoine.


On Fri, 09 Dec 2011 09:02:35 +0100
Stefan Behnel stefan...@behnel.de wrote:
 Hi everyone,
 
 I think Py3.3 would be a good milestone for cleaning up the stdlib support 
 for XML. Note upfront: you may or may not know me as the maintainer of 
 lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post 
 was triggered by the following kind of conversation that I keep having with 
 new XML users in Python (mostly on c.l.py), which hints at some serious 
 flaw in the stdlib.
[etc.]


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Dirkjan Ochtman

On Fri, Dec 9, 2011 at 09:02, Stefan Behnel stefan...@behnel.de wrote:
 a) The stdlib documentation should help users to choose the right tool right
 from the start.
 b) cElementTree should finally loose it's special status as a separate
 library and disappear as an accelerator module behind ElementTree.

An at least somewhat informed +1 from me. The ElementTree API is a
very good way to deal with XML from Python, and it deserves to be
promoted over the included alternatives.

Let's deprecate the NiCad batteries and try to guide users toward the
Li-Ion ones.

Cheers,

Dirkjan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Matt Joiner

+1

On Sat, Dec 10, 2011 at 2:09 AM, Dirkjan Ochtman dirk...@ochtman.nl wrote:
 On Fri, Dec 9, 2011 at 09:02, Stefan Behnel stefan...@behnel.de wrote:
 a) The stdlib documentation should help users to choose the right tool right
 from the start.
 b) cElementTree should finally loose it's special status as a separate
 library and disappear as an accelerator module behind ElementTree.

 An at least somewhat informed +1 from me. The ElementTree API is a
 very good way to deal with XML from Python, and it deserves to be
 promoted over the included alternatives.

 Let's deprecate the NiCad batteries and try to guide users toward the
 Li-Ion ones.

 Cheers,

 Dirkjan
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/anacrolix%40gmail.com



-- 
ಠ_ಠ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Mike Meyer

On Fri, 09 Dec 2011 09:02:35 +0100
Stefan Behnel stefan...@behnel.de wrote:

 a) The stdlib documentation should help users to choose the right
 tool right from the start.
 b) cElementTree should finally loose it's special status as a
 separate library and disappear as an accelerator module behind
 ElementTree.

+1 and +1.

I've done a lot of xml work in Python, and unless you've got a
particular reason for wanting to use the dom, ElementTree is the only
sane way to go.

I recently converted a middling-sized app from using the dom to using
ElementTree, and wrote up some guidelines for the process for the
client. I can try and shake it out of my clients lawyers if it would
help with this or others are interested.

 mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Bill Janssen

Mike Meyer m...@mired.org wrote:

 On Fri, 09 Dec 2011 09:02:35 +0100
 Stefan Behnel stefan...@behnel.de wrote:
 
  a) The stdlib documentation should help users to choose the right
  tool right from the start.
  b) cElementTree should finally loose it's special status as a
  separate library and disappear as an accelerator module behind
  ElementTree.
 
 +1 and +1.
 
 I've done a lot of xml work in Python, and unless you've got a
 particular reason for wanting to use the dom, ElementTree is the only
 sane way to go.

I use ElementTree for parsing valid XML, but minidom for producing it.

I think another thing that might go into refreshing the batteries is a
feature comparison of BeautifulSoup and HTML5lib against the stdlib
competition, to see what needs to be added/revised.  Having to switch to
an outside package for parsing possibly invalid HTML is a pain.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Paul Moore

On 9 December 2011 18:15, Bill Janssen jans...@parc.com wrote:
 I use ElementTree for parsing valid XML, but minidom for producing it.

 I think another thing that might go into refreshing the batteries is a
 feature comparison of BeautifulSoup and HTML5lib against the stdlib
 competition, to see what needs to be added/revised.  Having to switch to
 an outside package for parsing possibly invalid HTML is a pain.

For what little use I make of XML/HTML parsing, I use lxml, simply
because it has a parser that covers the sort of HTML I have to deal
with in real life. As I have lxml installed, I use it for any XML
parsing tasks, just because I'm used to it.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Xavier Morel

On 2011-12-09, at 19:15 , Bill Janssen wrote:
 I use ElementTree for parsing valid XML, but minidom for producing it.
Could you expand on your reasons to use minidom for producing XML?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Bill Janssen

Xavier Morel python-...@masklinn.net wrote:

 On 2011-12-09, at 19:15 , Bill Janssen wrote:
  I use ElementTree for parsing valid XML, but minidom for producing it.
 Could you expand on your reasons to use minidom for producing XML?

Inertia, I guess.  I tried that first, and it seems to work.

I tend to use html5lib and/or BeautifulSoup instead of ElementTree, and
that's mainly because I find the documentation for ElementTree is
confusing and partial and inconsistent.  Having various undated but
obsolete tutorials and documentation still up on effbot.org doesn't
help.


Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Matt Joiner

I second this. The doco is very bad.
On Dec 10, 2011 6:34 AM, Bill Janssen jans...@parc.com wrote:

 Xavier Morel python-...@masklinn.net wrote:

  On 2011-12-09, at 19:15 , Bill Janssen wrote:
   I use ElementTree for parsing valid XML, but minidom for producing it.
  Could you expand on your reasons to use minidom for producing XML?

 Inertia, I guess.  I tried that first, and it seems to work.

 I tend to use html5lib and/or BeautifulSoup instead of ElementTree, and
 that's mainly because I find the documentation for ElementTree is
 confusing and partial and inconsistent.  Having various undated but
 obsolete tutorials and documentation still up on effbot.org doesn't
 help.


 Bill
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/anacrolix%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Eli Bendersky

On Sat, Dec 10, 2011 at 00:43, Matt Joiner anacro...@gmail.com wrote:

 I second this. The doco is very bad.


It would be constructive to open issues for specific problems in the
documentation. I'm sure this won't be hard to fix. Documentation should not
be the roadblock for using a library.
Eli
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing the XML batteries

2011-12-09 Thread Stefan Behnel


Bill Janssen, 09.12.2011 19:15:

I think another thing that might go into refreshing the batteries is a
feature comparison of BeautifulSoup and HTML5lib against the stdlib
competition, to see what needs to be added/revised.  Having to switch to
an outside package for parsing possibly invalid HTML is a pain.


Such a feature request should be worth a separate thread.

Note, however, that html5lib is likely way too big to add it to the stdlib, 
and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3, 
which would be the target release series for better HTML support. So, 
whatever library or API you would want to use for HTML processing is 
currently only the second question as long as Py3 lacks a real-world HTML 
parser in the stdlib, as well as a robust character detection mechanism. I 
don't think that can be fixed all that easily.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

45 matches

Mail list logo