Somebody any idea? Solr seems to ignore the DTD definition and therefore does not understand the entities like ü or ä that are defined in dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD definition?
On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenki <v...@belenki.name> wrote: > Dear community, > > I am experiencing strange problem while trying to index / to import XML > document to SOLR via DataImportHandler. The XML document contains some > special characters (e.g. german ü) that are represented as XML entities > ü or ä. There is also DTD file that defines these entities > (<!ENTITY uuml "ü" >) (I tried to use dtd file as well as to > include the DTD definition to the xml itself). After I start the import > command full-import, the import process throws an exception as soon as it > tries to parse ü: "Un > declared general entity "uuml". Did anyone already face such a problem? > > best regards, > > Michael > > > My data-config for importing is: > > > <dataConfig> > <dataSource type="FileDataSource" encoding="ISO-8859-1" /> > <document> > <!-- stream should be true since huge xml document is being > parsed --> > <entity name="article" > processor="XPathEntityProcessor" > stream="true" > forEach="/dblp/article" > url="documents/dblp.xml" > > > > <field column="key" xpath="/dblp/article/@key" /> > <field column="title" xpath="/dblp/article/title" /> > > > </entity> > </document> > </dataConfig> > > The XML file looks e.g. like this: > > <?xml version="1.0" encoding="ISO-8859-1"?> > > <!DOCTYPE dblp [ > > <!ENTITY uuml "ü" ><!-- small u, dieresis or umlaut mark --> > ]> > <dblp> > > <article key="journals/fm/Riccardi09" mdate="2011-10-27"> > <author>Marco Riccardi</author> > <title>Solution of Cubic and Quartic Equations.ü</title> > <pages>117-122</pages> > <year>2009</year> > <volume>17</volume> > > <journal>Formalized Mathematics</journal> > > <number>1-4</number> > <ee>http://dx.doi.org/10.2478/v10037-009-0012-z</ee><url>db/journals/fm/fm17.html#Riccardi09</url> > </article></dblp> > > The stack-trace is: > > 05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor > finish > INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1 > 05.07.2012 17:37:19 org.apache.solr.common.SolrException log > SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException: > java.lang.RuntimeE > xception: org.apache.solr.handler.dataimport.DataImportHandlerException: > Parsing > failed for xml, url:documents/dblp.xml rows processed in this xml:2 last > row in > this xml:{title=Common Subexpression Identification in General Algebraic > System > s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java > :264) > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo > rter.java:375) > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j > ava:445) > at > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja > va:426) > Caused by: java.lang.RuntimeException: > org.apache.solr.handler.dataimport.DataIm > portHandlerException: Parsing failed for xml, url:documents/dblp.xml rows > proces > sed in this xml:2 last row in this xml:{title=Common Subexpression > Identificatio > n in General Algebraic Systems., $forEach=/dblp/article, > key=persons/Hall74} Pro > cessing Document # 3 > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde > r.java:621) > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j > ava:327) > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java > :225) > ... 3 more > Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: > Parsin > g failed for xml, url:documents/dblp.xml rows processed in this xml:2 last > row i > n this xml:{title=Common Subexpression Identification in General Algebraic > Syste > ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 > at > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd > Throw(DataImportHandlerException.java:72) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE > ntityProcessor.java:504) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE > ntityProcessor.java:517) > at > org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity > ProcessorBase.java:120) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow( > XPathEntityProcessor.java:225) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath > EntityProcessor.java:204) > at > org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent > ityProcessorWrapper.java:330) > at > org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent > ityProcessorWrapper.java:296) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde > r.java:683) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde > r.java:619) > ... 5 more > Caused by: java.lang.RuntimeException: > com.ctc.wstx.exc.WstxParsingException: Un > declared general entity "uuml" > at [row,col {unknown-source}]: [26,42] > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP > athRecordReader.java:187) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor$2.run(XPathEn > tityProcessor.java:427) > Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general > entity "uum > l" > at [row,col {unknown-source}]: [26,42] > at > com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.jav > a:630) > at > com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:467) > > at > com.ctc.wstx.sr.BasicStreamReader.handleUndeclaredEntity(BasicStreamR > eader.java:5431) > at > com.ctc.wstx.sr.StreamScanner.expandUnresolvedEntity(StreamScanner.ja > va:1661) > at > com.ctc.wstx.sr.StreamScanner.expandEntity(StreamScanner.java:1555) > at > com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1 > 523) > at > com.ctc.wstx.sr.BasicStreamReader.skipTokenText(BasicStreamReader.jav > a:3568) > at > com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:33 > 42) > at > com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java > :2622) > at > com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart > Element(XPathRecordReader.java:376) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath > RecordReader.java:310) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart > Element(XPathRecordReader.java:346) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath > RecordReader.java:310) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart > Element(XPathRecordReader.java:346) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath > RecordReader.java:310) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$200( > XPathRecordReader.java:202) > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP > athRecordReader.java:184) > ... 1 more > > 05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback > INFO: start rollback > 05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback > INFO: end_rollback