Ankit Rastogi wrote: > I am having problem with parsing CDATA section. I am using PyXml and minidom > for parsing the xml document. > My motive is to get the data back in the same format in one string as it is > writen in xml file. Here is the sample: > -- > <StateChg> > <![CDATA[ > check.. its cdata section > > all data is printed in it s format > ]]> > <StateChg> > -- > put when I print all its childs using: > --- > print StChg.childNodes #StChg. is instance to <StateChg> element > -- > It gives following output > -- > [<DOM Text node "\n">, <DOM Text node "\t\t\t\t \t">, <DOM Text node "\n">, > <DOM Text node "\t\t\t\t\t chec...">, <DOM Text node "\n">, <DOM Text node > "\t\t\t\t\t all ...">, <DOM Text node "\n">, <DOM Text node "\t\t\t\t\t ">, > <DOM Text node "\n">, <DOM Text node "\t\t\t\t\t">] > -- > > The output Shows it is text node but we had declare it as > CDATA_SECTION_NODE. > > and also the output is not desired ( format lost and some data is lost), > > Why its happening. What I have to do to get the same output as in xml with > the format and indentation. > > Please,correct me, where I am wrong
In XML, you have the logical constructs: elements, attributes, character data, processing instructions, and comments. You use markup to represent these constructs. And you are currently using the DOM API to access an abstract representation of them -- an implicit tree of nodes. At the markup level, a span of character data can be written using either (1) literal characters, numeric character references, and entity references, or (2) a CDATA section, consisting of literal characters only, bounded by start and end markers. There is no semantic difference between the two ways of writing character data; it is just two different ways of writing the same thing. Thus "1 & 2 are < 3" in regular markup is exactly the same as "1 & 2 are < 3" in a CDATA section. It is common for a parser to report each span of character data separately. It may say "this character data was written using a CDATA section" and "this character data was written with regular markup"; or it may just say "I saw this character data, and then I saw this other character data". It is also possible that character and entity references in the markup will be treated as separate spans of character data. Very long spans might be split as well. Consequently, in both the SAX and DOM APIs, these separate reports from the parser *may* manifest as separate, subsequent 'characters' events (in SAX) or as separate Text nodes (in DOM). You must be prepared to see them in chunks. You must also realize that it is not incorrect to see CDATA sections as Text nodes in an implementation that only supports the Core Interfaces of DOM. DOM does have a CDATASection node, which is a subclass of Text, but it is only in the Extended Interfaces, which are optional. So if an implementation chooses to support DOM's Extended Interfaces, then CDATA will manifest as CDATASection instead of Text. CDATASection nodes are in fact supported in newer versions of minidom, despite the docs at http://python.org/doc/2.4.2/lib/minidom-and-dom.html which say otherwise. These nodes and some of the other extended interfaces blur the distinction between lexical markup and logical constructs that the markup is intended to represent, so they actually make things more difficult for users, typically, which is why they're optional. As Dieter Maurer pointed out, you can merge adjacent Text nodes by calling the normalize() method on any ancestor of the nodes. However, by design, this only works on Text nodes, not CDATASection nodes, as per DOM requirements. Python 2.2.3: >>> from xml.dom.minidom import getDOMImplementation, parseString >>> impl = getDOMImplementation() >>> impl.hasFeature('Core', '2.0') # core interfaces? 1 >>> impl.hasFeature('XML', '2.0') # extended interfaces? 0 >>> doc = parseString('<test>1 & 2 are < 3 ... <![CDATA[1 & 2 are < >>> 3]]></test>') >>> doc.childNodes[0].childNodes [<DOM Text node "1 ">, <DOM Text node "&">, <DOM Text node " 2 are ">, <DOM Text node "<">, <DOM Text node " 3 ... ">, <DOM Text node "1 & 2 are ...">] >>> doc.normalize() >>> doc.childNodes[0].childNodes [<DOM Text node "1 & 2 are ...">] >>> doc.childNodes[0].childNodes[0].data u'1 & 2 are < 3 ... 1 & 2 are < 3' Python 2.4.2: >>> from xml.dom.minidom import getDOMImplementation, parseString >>> impl = getDOMImplementation() >>> from xml.dom.minidom import parseString >>> impl.hasFeature('Core', '2.0') # core interfaces? True >>> impl.hasFeature('XML', '2.0') # extended interfaces? True >>> doc = parseString('<test>1 & 2 are < 3 ... <![CDATA[1 & 2 are < >>> 3]]></test>') >>> doc.childNodes[0].childNodes [<DOM Text node "1 & 2 are ...">, <DOM CDATASection node "1 & 2 are ...">] >>> doc.normalize() >>> doc.childNodes[0].childNodes [<DOM Text node "1 & 2 are ...">, <DOM CDATASection node "1 & 2 are ...">] >>> doc.childNodes[0].childNodes[0].data u'1 & 2 are < 3 ... ' >>> doc.childNodes[0].childNodes[1].data u'1 & 2 are < 3' If you need to merge adjacent CDATASection nodes and/or mixed Text and CDATASection nodes, there are no functions built-in to do that. You'll have to roll your own. There's no way to disable the creation of CDATASection nodes in minidom. Anyway, you should not expect to be able to precisely reproduce or even know exactly what was in the lexical markup in your original, unparsed document when you run it through a parser, and especially not after you access the parser's reports through a relatively abstract API like DOM or SAX or the XPath tree model. You can produce XML that *means* the same thing as the original, but you're not going to get XML That *looks* exactly like the original. If you expect to do that, then you shouldn't be running your XML through a real XML parser at all. Mike _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://mail.python.org/mailman/listinfo/xml-sig