Re: [CODE4LIB] XML Parsing and Python

2013-03-07 Thread Michael Beccaria
: [CODE4LIB] XML Parsing and Python I'll note that 0x is a UTF-8 non-character, and these noncharacters should never be included in text interchange between implementations. [1] I assume the OCR engine maybe using 0x when it can't recognize a character? So, it's not wrong for a parser

Re: [CODE4LIB] XML Parsing and Python

2013-03-07 Thread Jay Luker
On Thu, Mar 7, 2013 at 10:49 AM, Michael Beccaria mbecca...@paulsmiths.eduwrote: I ended up doing a regular expression find and replace function to replace all illegal xml characters with a dash or something. :( A string translation map might be a better approach. Here's what I do as one

Re: [CODE4LIB] XML Parsing and Python

2013-03-07 Thread Al Matthews
518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Chris Beer Sent: Tuesday, March 05, 2013 1:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] XML

Re: [CODE4LIB] XML Parsing and Python

2013-03-05 Thread Jon Stroop
Mike, I haven't used minidom extensively but my guess is that doc.toprettyxml(indent= ,encoding=utf-8) isn't actually changing the encoding because it can't parse the string in your content variable. I'm surprised that you're not getting tossed a UnicodeError, but The docs for Node.toxml()

Re: [CODE4LIB] XML Parsing and Python

2013-03-05 Thread Chris Beer
I'll note that 0x is a UTF-8 non-character, and these noncharacters should never be included in text interchange between implementations. [1] I assume the OCR engine maybe using 0x when it can't recognize a character? So, it's not wrong for a parser to complain (or, not complain) about

[CODE4LIB] XML Parsing and Python

2013-03-04 Thread Michael Beccaria
I'm working on a project that takes the ocr data found in a pdf and places it in a custom xml file. I use Python scripts to create the xml file. Something like this (trimmed down a bit): from xml.dom.minidom import Document doc = Document() Page = doc.createElement(Page)

Re: [CODE4LIB] XML Parsing and Python

2013-03-04 Thread Stuart Myles
It sounds like your code isn't recognizing the XML file as UTF-8 (even though the encoding is correctly marked in your example). You could try telling the parser explicitly to use UTF-8, like this parser = XMLParser(encoding=utf-8) As discussed in