Re: Parsing unicode (devanagari) text with xml.dom.minidom
Regarding minidom, you might be happier with the xml.etree package that comes with Python2.5 and later (it's also avalable for older versions). It's a lot easier to use, more memory friendly and also much faster. OTOH, choice of XML library is completely irrelevant for the issue at hand. If the OP is happy with minidom, we shouldn't talk him into using something else. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing unicode (devanagari) text with xml.dom.minidom
Martin v. Löwis wrote: Regarding minidom, you might be happier with the xml.etree package that comes with Python2.5 and later (it's also avalable for older versions). It's a lot easier to use, more memory friendly and also much faster. OTOH, choice of XML library is completely irrelevant for the issue at hand. For the described problem, maybe. But certainly not for the application. The background was parsing the XML dump of an entire web site, which I would expect to be larger than what minidom is designed to handle gracefully. Switching to cElementTree before major code gets written is almost certainly a good idea here. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing unicode (devanagari) text with xml.dom.minidom
For the described problem, maybe. But certainly not for the application. The background was parsing the XML dump of an entire web site, which I would expect to be larger than what minidom is designed to handle gracefully. Switching to cElementTree before major code gets written is almost certainly a good idea here. I think minidom is designed to handle the very same documents taht cElementTree is designed to handle (namely, documents that fit into memory). Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
comparing (c)ElementTree and minidom (was: Parsing unicode (devanagari) text with xml.dom.minidom)
Martin v. Löwis wrote: The background was parsing the XML dump of an entire web site, which I would expect to be larger than what minidom is designed to handle gracefully. Switching to cElementTree before major code gets written is almost certainly a good idea here. I think minidom is designed to handle the very same documents taht cElementTree is designed to handle (namely, documents that fit into memory). I do not doubt that a machine running a cElementTree application can handle exactly the same documents as a machine with, say, ten times as much memory that runs a minidom application. However, when deciding which library to choose for a new application, it does matter what hardware you want to use it on. And if you can handle multiple times larger documents on the same hardware, that might be as much of reason to choose cElementTree as the (likely) shorter and more readable code (which usually translates into shorter development and debugging times) and the higher execution speed. Honestly, I haven't seen a reason in a while why preferring minidom over any of the ElementTree derivates would be a good idea when starting a new application. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing unicode (devanagari) text with xml.dom.minidom
On Mar 8, 12:42 am, Stefan Behnel stefan...@behnel.de wrote: rpar...@gmail.com wrote: I am trying to process an xml file that contains unicode characters (seehttp://vyakarnam.wordpress.com/). Wordpress allows exporting the entire content of the website into an xml file. Using xml.dom.minidom, I wrote a few lines of python code to parse out the xml file, but am stuck with the following error: import xml.dom.minidom dom = xml.dom.minidom.parse(wordpress.2009-02-19.xml) titles = dom.getElementsByTagName(title) for title in titles: ... print childNode = , title.childNodes ... childNode = [DOM Text node Sanskrit N...] childNode = [DOM Text node Sanskrit N...] childNode = [] childNode = [] childNode = [DOM Text node 1-1-1] childNode = Traceback (most recent call last): File stdin, line 2, in module UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-18: ordinal not in range(128) That's because you are printing it out to your console, in which case you need to make sure it's encoded properly for printing. repr() might also help. Regarding minidom, you might be happier with the xml.etree package that comes with Python2.5 and later (it's also avalable for older versions). It's a lot easier to use, more memory friendly and also much faster. Stefan Thanks for the reply. I didn't realize that printing to console was causing the problem. I am now able to parse out the relevant portions of my xml file. Will also look at the xml.etree module. Regards -- http://mail.python.org/mailman/listinfo/python-list
Parsing unicode (devanagari) text with xml.dom.minidom
Hello, I am trying to process an xml file that contains unicode characters (see http://vyakarnam.wordpress.com/). Wordpress allows exporting the entire content of the website into an xml file. Using xml.dom.minidom, I wrote a few lines of python code to parse out the xml file, but am stuck with the following error: import xml.dom.minidom dom = xml.dom.minidom.parse(wordpress.2009-02-19.xml) titles = dom.getElementsByTagName(title) for title in titles: ...print childNode = , title.childNodes ... childNode = [DOM Text node Sanskrit N...] childNode = [DOM Text node Sanskrit N...] childNode = [] childNode = [] childNode = [DOM Text node 1-1-1] childNode = Traceback (most recent call last): File stdin, line 2, in module UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-18: ordinal not in range(128) Python exited when it was trying to parse the following node: titleअन् /title The xml header tells me that the document is UTF-8: ?xml version=1.0 encoding=UTF-8? I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are as below: $locale LANG=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_NUMERIC=en_US.UTF-8 LC_TIME=en_US.UTF-8 LC_ALL= I googled around for similar errors, and tried using unicode but that didn't help either: foo = unicode(titles[5].childNodes) Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-18: ordinal not in range(128) I'm a novice with unicode, and am not not sure about how best to handle the unicode text I'm dealing with (devanagari). Any suggestions will be helpful. Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing unicode (devanagari) text with xml.dom.minidom
rpar...@gmail.com wrote: I am trying to process an xml file that contains unicode characters (see http://vyakarnam.wordpress.com/). Wordpress allows exporting the entire content of the website into an xml file. Using xml.dom.minidom, I wrote a few lines of python code to parse out the xml file, but am stuck with the following error: import xml.dom.minidom dom = xml.dom.minidom.parse(wordpress.2009-02-19.xml) titles = dom.getElementsByTagName(title) for title in titles: ...print childNode = , title.childNodes ... childNode = [DOM Text node Sanskrit N...] childNode = [DOM Text node Sanskrit N...] childNode = [] childNode = [] childNode = [DOM Text node 1-1-1] childNode = Traceback (most recent call last): File stdin, line 2, in module UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-18: ordinal not in range(128) That's because you are printing it out to your console, in which case you need to make sure it's encoded properly for printing. repr() might also help. Regarding minidom, you might be happier with the xml.etree package that comes with Python2.5 and later (it's also avalable for older versions). It's a lot easier to use, more memory friendly and also much faster. Stefan -- http://mail.python.org/mailman/listinfo/python-list