Re: Parsing unicode (devanagari) text with xml.dom.minidom

2009-03-08 Thread Martin v. Löwis
 Regarding minidom, you might be happier with the xml.etree package that
 comes with Python2.5 and later (it's also avalable for older versions).
 It's a lot easier to use, more memory friendly and also much faster.

OTOH, choice of XML library is completely irrelevant for the issue at
hand. If the OP is happy with minidom, we shouldn't talk him into using
something else.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list



Re: Parsing unicode (devanagari) text with xml.dom.minidom

2009-03-08 Thread Stefan Behnel
Martin v. Löwis wrote:
 Regarding minidom, you might be happier with the xml.etree package that
 comes with Python2.5 and later (it's also avalable for older versions).
 It's a lot easier to use, more memory friendly and also much faster.
 
 OTOH, choice of XML library is completely irrelevant for the issue at
 hand.

For the described problem, maybe. But certainly not for the application.
The background was parsing the XML dump of an entire web site, which I
would expect to be larger than what minidom is designed to handle
gracefully. Switching to cElementTree before major code gets written is
almost certainly a good idea here.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing unicode (devanagari) text with xml.dom.minidom

2009-03-08 Thread Martin v. Löwis
 For the described problem, maybe. But certainly not for the application.
 The background was parsing the XML dump of an entire web site, which I
 would expect to be larger than what minidom is designed to handle
 gracefully. Switching to cElementTree before major code gets written is
 almost certainly a good idea here.

I think minidom is designed to handle the very same documents taht
cElementTree is designed to handle (namely, documents that fit into
memory).

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


comparing (c)ElementTree and minidom (was: Parsing unicode (devanagari) text with xml.dom.minidom)

2009-03-08 Thread Stefan Behnel
Martin v. Löwis wrote:
 The background was parsing the XML dump of an entire web site, which I
 would expect to be larger than what minidom is designed to handle
 gracefully. Switching to cElementTree before major code gets written is
 almost certainly a good idea here.
 
 I think minidom is designed to handle the very same documents taht
 cElementTree is designed to handle (namely, documents that fit into
 memory).

I do not doubt that a machine running a cElementTree application can handle
exactly the same documents as a machine with, say, ten times as much memory
that runs a minidom application. However, when deciding which library to
choose for a new application, it does matter what hardware you want to use
it on. And if you can handle multiple times larger documents on the same
hardware, that might be as much of reason to choose cElementTree as the
(likely) shorter and more readable code (which usually translates into
shorter development and debugging times) and the higher execution speed.
Honestly, I haven't seen a reason in a while why preferring minidom over
any of the ElementTree derivates would be a good idea when starting a new
application.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing unicode (devanagari) text with xml.dom.minidom

2009-03-08 Thread rparimi
On Mar 8, 12:42 am, Stefan Behnel stefan...@behnel.de wrote:
 rpar...@gmail.com wrote:
  I am trying to process an xml file that contains unicode characters
  (seehttp://vyakarnam.wordpress.com/). Wordpress allows exporting the
  entire content of the website into an xml file. Using
  xml.dom.minidom,  I wrote a few lines of python code to parse out the
  xml file, but am stuck with the following error:

  import xml.dom.minidom
  dom = xml.dom.minidom.parse(wordpress.2009-02-19.xml)
  titles = dom.getElementsByTagName(title)
  for title in titles:
  ...    print childNode = , title.childNodes
  ...
  childNode =  [DOM Text node Sanskrit N...]
  childNode =  [DOM Text node Sanskrit N...]
  childNode =  []
  childNode =  []
  childNode =  [DOM Text node 1-1-1]
  childNode =  Traceback (most recent call last):
    File stdin, line 2, in module
  UnicodeEncodeError: 'ascii' codec can't encode characters in position
  16-18: ordinal not in range(128)

 That's because you are printing it out to your console, in which case you
 need to make sure it's encoded properly for printing. repr() might also help.

 Regarding minidom, you might be happier with the xml.etree package that
 comes with Python2.5 and later (it's also avalable for older versions).
 It's a lot easier to use, more memory friendly and also much faster.

 Stefan

Thanks for the reply. I didn't realize that printing to console was
causing the problem. I am now able to parse out the relevant portions
of my xml file. Will also look at the xml.etree module.

Regards
--
http://mail.python.org/mailman/listinfo/python-list


Parsing unicode (devanagari) text with xml.dom.minidom

2009-03-07 Thread rparimi
Hello,

I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom,  I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:

 import xml.dom.minidom
 dom = xml.dom.minidom.parse(wordpress.2009-02-19.xml)
 titles = dom.getElementsByTagName(title)
 for title in titles:
...print childNode = , title.childNodes
...
childNode =  [DOM Text node Sanskrit N...]
childNode =  [DOM Text node Sanskrit N...]
childNode =  []
childNode =  []
childNode =  [DOM Text node 1-1-1]
childNode =  Traceback (most recent call last):
  File stdin, line 2, in module
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)


Python exited when it was trying to parse the following node:
titleअन् /title

The xml header tells me that the document is UTF-8:
?xml version=1.0 encoding=UTF-8?

I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG=en_US.UTF-8
LC_COLLATE=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_ALL=


I googled around for similar errors, and tried using unicode but that
didn't help either:
 foo = unicode(titles[5].childNodes)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

I'm a novice with unicode, and am not not sure about how best to
handle the unicode  text I'm dealing with (devanagari). Any
suggestions will be helpful.

Thanks
--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing unicode (devanagari) text with xml.dom.minidom

2009-03-07 Thread Stefan Behnel
rpar...@gmail.com wrote:
 I am trying to process an xml file that contains unicode characters
 (see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
 entire content of the website into an xml file. Using
 xml.dom.minidom,  I wrote a few lines of python code to parse out the
 xml file, but am stuck with the following error:
 
 import xml.dom.minidom
 dom = xml.dom.minidom.parse(wordpress.2009-02-19.xml)
 titles = dom.getElementsByTagName(title)
 for title in titles:
 ...print childNode = , title.childNodes
 ...
 childNode =  [DOM Text node Sanskrit N...]
 childNode =  [DOM Text node Sanskrit N...]
 childNode =  []
 childNode =  []
 childNode =  [DOM Text node 1-1-1]
 childNode =  Traceback (most recent call last):
   File stdin, line 2, in module
 UnicodeEncodeError: 'ascii' codec can't encode characters in position
 16-18: ordinal not in range(128)

That's because you are printing it out to your console, in which case you
need to make sure it's encoded properly for printing. repr() might also help.

Regarding minidom, you might be happier with the xml.etree package that
comes with Python2.5 and later (it's also avalable for older versions).
It's a lot easier to use, more memory friendly and also much faster.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list