Re: Ignoring XML Namespaces with cElementTree

2010-05-01 Thread Stefan Behnel

dmtr, 30.04.2010 23:59:

I think that's your main mistake: don't remove them. Instead, use the fully
qualified names when comparing.


Yes. That's what I'm forced to do. Pre-calculating tags like tagChild
= {%s}child % uri and using them instead of child.


Exactly. Keeps you from introducing typos in your code. And keeps you from 
having to deal with namespace-prefix mappings. Big features.




As a result the
code looks ugly and there is extra overhead concatenating/comparing
these repeating and redundant prefixes.


The overhead is really small, though. In many cases, a pointer comparison 
will do.




I don't understand why
cElementTree forces users to do that. So far I couldn't find any way
around that without rebuilding cElementTree from source.


Then don't do it.



Apparently somebody hard-coded the namespace_separator parameter in
the cElementTree.c (what a dumb thing to do!!!, it should have been a
parameter in the cElementTree.XMLParser() arguments):
===
self-parser = EXPAT(ParserCreate_MM)(encoding,memory_handler, });
===

Simply replacing } with NULL gives me desired tags without stinking
URIs.


You should try to calm down and embrace this feature.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-05-01 Thread Carl Banks
On Apr 27, 6:42 pm, dmtr dchich...@gmail.com wrote:
 Is there any way to configure cElementTree to ignore the XML root
 namespace?  Default cElementTree (Python 2.6.4) appears to add the XML
 root namespace URI to _every_ single tag.  I know that I can strip
 URIs manually, from every tag, but it is a rather idiotic thing to do
 (performance wise).

Perhaps upgrade to lxml.  Not sure if gives you control over namespace
expansion but if it doesn't it should at least be faster.

For this and some other reasons, I find ElementTree not quite as handy
when processing files from another source as when I'm saving and
retrieving my own data.


Carl Banks
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-05-01 Thread Carl Banks
On Apr 29, 10:12 pm, Stefan Behnel stefan...@behnel.de wrote:
 dmtr, 30.04.2010 04:57:



  I'm referring to xmlns/URI prefixes. Here's a code example:
    from xml.etree.cElementTree import iterparse
    from cStringIO import StringIO
    xml = root xmlns=http://www.very_long_url.com;child//
  root
    for event, elem in iterparse(StringIO(xml)): print event, elem

  The output is:
    endElement '{http://www.very_long_url.com}child' at 0xb7ddfa58
    endElement '{http://www.very_long_url.com}root' at 0xb7ddfa40

  I don't want these {http://www.very_long_url.com}; in front of my
  tags.

  They create performance disaster on large files

 I seriously doubt that they do.

I don't know what kind of XML files you deal with, but for me a large
XML file is gigabyte-sized (obviously I don't use Element Tree for
those).

Even for files tens-of-megabyte files string ops to expand tags with
namespaces is going to be a pretty decent penalty--remember
ElementTree does nothing lazily.


  (first cElementTree
  adds them, then I have to remove them in python).

 I think that's your main mistake: don't remove them. Instead, use the fully
 qualified names when comparing.

Unless you have multiple namespaces or are working with defined schema
or something, it's useless boilerplate.

It'd be a nice feature if ElementTree could let users optionally
ignore a namespace, unfortunately it doesn't have it.


Carl Banks
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-05-01 Thread Stefan Behnel

Carl Banks, 01.05.2010 12:33:

On Apr 29, 10:12 pm, Stefan Behnel wrote:

dmtr, 30.04.2010 04:57:

I don't want these {http://www.very_long_url.com}; in front of my
tags.  They create performance disaster on large files


I seriously doubt that they do.


I don't know what kind of XML files you deal with, but for me a large
XML file is gigabyte-sized (obviously I don't use Element Tree for
those).


Why not? I used cElementTree for files of that size (1-1.5GB unpacked) a 
couple of times, and it was never a problem.




Even for files tens-of-megabyte files string ops to expand tags with
namespaces is going to be a pretty decent penalty--remember
ElementTree does nothing lazily.


So? Did you run a profiler on it to know that there is a penalty due to the 
string concatenation? cElementTree's parser (expat) and its tree builder 
are blazingly fast, especially the iterparse() implementation.


http://codespeak.net/lxml/performance.html#parsing-and-serialising
http://codespeak.net/lxml/performance.html#a-longer-example
http://effbot.org/zone/celementtree.htm#benchmarks



(first cElementTree adds them, then I have to remove them in python).


I think that's your main mistake: don't remove them. Instead, use the fully
qualified names when comparing.


Unless you have multiple namespaces or are working with defined schema
or something, it's useless boilerplate.

It'd be a nice feature if ElementTree could let users optionally
ignore a namespace, unfortunately it doesn't have it.


I agree that that would make for a nice parser option, e.g. when dealing 
with HTML and XHTML in the same code.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-05-01 Thread dmtr
 Unless you have multiple namespaces or are working with defined schema
 or something, it's useless boilerplate.

 It'd be a nice feature if ElementTree could let users optionally
 ignore a namespace, unfortunately it doesn't have it.


Yep. Exactly my point. Here's a link to the patch addressing this:
http://bugs.python.org/issue8583
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-04-30 Thread dmtr
 I think that's your main mistake: don't remove them. Instead, use the fully
 qualified names when comparing.

 Stefan

Yes. That's what I'm forced to do. Pre-calculating tags like tagChild
= {%s}child % uri and using them instead of child. As a result the
code looks ugly and there is extra overhead concatenating/comparing
these repeating and redundant prefixes. I don't understand why
cElementTree forces users to do that. So far I couldn't find any way
around that without rebuilding cElementTree from source.

Apparently somebody hard-coded the namespace_separator parameter in
the cElementTree.c (what a dumb thing to do!!!, it should have been a
parameter in the cElementTree.XMLParser() arguments):
===
self-parser = EXPAT(ParserCreate_MM)(encoding, memory_handler, });
===

Simply replacing } with NULL gives me desired tags without stinking
URIs.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-04-30 Thread dmtr
Here's a link to the patch exposing this parameter: 
http://bugs.python.org/issue8583
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-04-29 Thread dmtr
I'm referring to xmlns/URI prefixes. Here's a code example:
 from xml.etree.cElementTree import iterparse
 from cStringIO import StringIO
 xml = root xmlns=http://www.very_long_url.com;child//
root
 for event, elem in iterparse(StringIO(xml)): print event, elem

The output is:
 end Element '{http://www.very_long_url.com}child' at 0xb7ddfa58
 end Element '{http://www.very_long_url.com}root' at 0xb7ddfa40


I don't want these {http://www.very_long_url.com}; in front of my
tags.

They create performance disaster on large files (first cElementTree
adds them, then I have to remove them in python). Is there any way to
tell cElementTree not to mess with my tags? I need that in the
standard python distribution, not my custom cElementTree build...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-04-29 Thread Stefan Behnel

dmtr, 30.04.2010 04:57:

I'm referring to xmlns/URI prefixes. Here's a code example:
  from xml.etree.cElementTree import iterparse
  from cStringIO import StringIO
  xml = root xmlns=http://www.very_long_url.com;child//
root
  for event, elem in iterparse(StringIO(xml)): print event, elem

The output is:
  endElement '{http://www.very_long_url.com}child' at 0xb7ddfa58
  endElement '{http://www.very_long_url.com}root' at 0xb7ddfa40


I don't want these {http://www.very_long_url.com}; in front of my
tags.

They create performance disaster on large files


I seriously doubt that they do.



(first cElementTree
adds them, then I have to remove them in python).


I think that's your main mistake: don't remove them. Instead, use the fully 
qualified names when comparing.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-04-28 Thread Stefan Behnel

dmtr, 28.04.2010 03:42:

Is there any way to configure cElementTree to ignore the XML root
namespace?  Default cElementTree (Python 2.6.4) appears to add the XML
root namespace URI to _every_ single tag.


Certainly not in the serialised XML. Are you referring to the qualified 
names it uses?


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Ignoring XML Namespaces with cElementTree

2010-04-27 Thread dmtr
Is there any way to configure cElementTree to ignore the XML root
namespace?  Default cElementTree (Python 2.6.4) appears to add the XML
root namespace URI to _every_ single tag.  I know that I can strip
URIs manually, from every tag, but it is a rather idiotic thing to do
(performance wise).
-- 
http://mail.python.org/mailman/listinfo/python-list