Somebody any idea? Solr seems to ignore the DTD definition and therefore
does not understand the entities like ü or ä that are defined in
dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
definition?

On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenki <v...@belenki.name>
wrote:
> Dear community,
> 
> I am experiencing strange problem while trying to index / to import XML
> document to SOLR via DataImportHandler. The XML document contains some
> special characters (e.g. german ü) that are represented as XML entities
> ü or ä. There is also DTD file that defines these entities
> (<!ENTITY uuml    "ü" >) (I tried to use dtd file as well as to
> include the DTD definition to the xml itself). After I start the import
> command full-import, the import process throws an exception as soon as
it
> tries to parse ü: "Un
> declared general entity "uuml". Did anyone already face such a problem? 
> 
> best regards,
> 
> Michael
> 
> 
> My data-config for importing is:
> 
> 
> <dataConfig>
>         <dataSource type="FileDataSource" encoding="ISO-8859-1" />
>         <document>
>               <!--  stream should be true since huge xml document is being 
> parsed
-->
>         <entity name="article"
>                 processor="XPathEntityProcessor"
>                 stream="true"
>                 forEach="/dblp/article"
>                 url="documents/dblp.xml"
> 
>                 >
>             <field column="key"        xpath="/dblp/article/@key" />
>             <field column="title"     xpath="/dblp/article/title" />
> 
> 
>        </entity>
>         </document>
> </dataConfig>
> 
> The XML file looks e.g. like this:
> 
> <?xml version="1.0" encoding="ISO-8859-1"?>
> 
> <!DOCTYPE dblp [
> 
>     <!ENTITY uuml    "ü" ><!-- small u, dieresis or umlaut mark -->
> ]>
> <dblp>
> 
> <article key="journals/fm/Riccardi09" mdate="2011-10-27">
> <author>Marco Riccardi</author>
> <title>Solution of Cubic and Quartic Equations.ü</title>
> <pages>117-122</pages>
> <year>2009</year>
> <volume>17</volume>
> 
> <journal>Formalized Mathematics</journal>
> 
> <number>1-4</number>
>
<ee>http://dx.doi.org/10.2478/v10037-009-0012-z</ee><url>db/journals/fm/fm17.html#Riccardi09</url>
> </article></dblp>
> 
> The stack-trace is:
> 
> 05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1
> 05.07.2012 17:37:19 org.apache.solr.common.SolrException log
> SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException:
> java.lang.RuntimeE
> xception: org.apache.solr.handler.dataimport.DataImportHandlerException:
> Parsing
>  failed for xml, url:documents/dblp.xml rows processed in this xml:2
last
> row in
>  this xml:{title=Common Subexpression Identification in General
Algebraic
> System
> s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
>         at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
> :264)
>         at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
> rter.java:375)
>         at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
> ava:445)
>         at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
> va:426)
> Caused by: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataIm
> portHandlerException: Parsing failed for xml, url:documents/dblp.xml
rows
> proces
> sed in this xml:2 last row in this xml:{title=Common Subexpression
> Identificatio
> n in General Algebraic Systems., $forEach=/dblp/article,
> key=persons/Hall74} Pro
> cessing Document # 3
>         at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> r.java:621)
>         at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
> ava:327)
>         at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
> :225)
>         ... 3 more
> Caused by:
org.apache.solr.handler.dataimport.DataImportHandlerException:
> Parsin
> g failed for xml, url:documents/dblp.xml rows processed in this xml:2
last
> row i
> n this xml:{title=Common Subexpression Identification in General
Algebraic
> Syste
> ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
>         at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
> Throw(DataImportHandlerException.java:72)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
> ntityProcessor.java:504)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
> ntityProcessor.java:517)
>         at
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity
> ProcessorBase.java:120)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
> XPathEntityProcessor.java:225)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath
> EntityProcessor.java:204)
>         at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent
> ityProcessorWrapper.java:330)
>         at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> ityProcessorWrapper.java:296)
>         at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> r.java:683)
>         at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> r.java:619)
>         ... 5 more
> Caused by: java.lang.RuntimeException:
> com.ctc.wstx.exc.WstxParsingException: Un
> declared general entity "uuml"
>  at [row,col {unknown-source}]: [26,42]
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP
> athRecordReader.java:187)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor$2.run(XPathEn
> tityProcessor.java:427)
> Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general
> entity "uum
> l"
>  at [row,col {unknown-source}]: [26,42]
>         at
> com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.jav
> a:630)
>         at
> com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:467)
> 
>         at
> com.ctc.wstx.sr.BasicStreamReader.handleUndeclaredEntity(BasicStreamR
> eader.java:5431)
>         at
> com.ctc.wstx.sr.StreamScanner.expandUnresolvedEntity(StreamScanner.ja
> va:1661)
>         at
> com.ctc.wstx.sr.StreamScanner.expandEntity(StreamScanner.java:1555)
>         at
> com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1
> 523)
>         at
> com.ctc.wstx.sr.BasicStreamReader.skipTokenText(BasicStreamReader.jav
> a:3568)
>         at
> com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:33
> 42)
>         at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java
> :2622)
>         at
> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
> Element(XPathRecordReader.java:376)
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
> RecordReader.java:310)
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
> Element(XPathRecordReader.java:346)
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
> RecordReader.java:310)
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
> Element(XPathRecordReader.java:346)
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
> RecordReader.java:310)
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$200(
> XPathRecordReader.java:202)
>         at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP
> athRecordReader.java:184)
>         ... 1 more
> 
> 05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: start rollback
> 05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: end_rollback

Reply via email to