: [CODE4LIB] XML Parsing and Python
I'll note that 0x is a UTF-8 non-character, and these noncharacters
should never be included in text interchange between implementations. [1] I
assume the OCR engine maybe using 0x when it can't recognize a character?
So, it's not wrong for a parser
On Thu, Mar 7, 2013 at 10:49 AM, Michael Beccaria
mbecca...@paulsmiths.eduwrote:
I ended up doing a regular expression find and replace function to replace
all illegal xml characters with a dash or something.
:(
A string translation map might be a better approach. Here's what I do as
one
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!
-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Chris Beer
Sent: Tuesday, March 05, 2013 1:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] XML
Mike,
I haven't used minidom extensively but my guess is that
doc.toprettyxml(indent= ,encoding=utf-8) isn't actually changing the
encoding because it can't parse the string in your content variable. I'm
surprised that you're not getting tossed a UnicodeError, but The docs
for Node.toxml()
I'll note that 0x is a UTF-8 non-character, and these noncharacters
should never be included in text interchange between implementations. [1] I
assume the OCR engine maybe using 0x when it can't recognize a character?
So, it's not wrong for a parser to complain (or, not complain) about
I'm working on a project that takes the ocr data found in a pdf and places it
in a custom xml file.
I use Python scripts to create the xml file. Something like this (trimmed down
a bit):
from xml.dom.minidom import Document
doc = Document()
Page = doc.createElement(Page)
It sounds like your code isn't recognizing the XML file as UTF-8 (even
though the encoding is correctly marked in your example).
You could try telling the parser explicitly to use UTF-8, like this
parser = XMLParser(encoding=utf-8)
As discussed in