Re: Problem while indexing XML file with special characters represented uuml
I think the issue here is that DIH uses Woodstox BasicStreamReader (see http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/BasicStreamReader.html) which has only minimal DTD support. It might be best to use ValidatingStreamReader (http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/ValidatingStreamReader.html) instead. I think you could get this by requesting a validating XmlReader; that's a setting that's exposed at the factory level that returns a parser (ie an XmlReader). But then you would probably also get validation turned on, which might not be so great in all cases. Probably should be a user setting for XPathEntityProcessor somewhere? -Mike On 07/10/2012 07:10 PM, Chris Hostetter wrote: : Somebody any idea? Solr seems to ignore the DTD definition and therefore : does not understand the entities likeuuml; orauml; that are defined in : dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD : definition? Solr is just utilizing the builtin java XML parser for this, so there's nothing you can tell solr to consider the DTD but it is odd that this isn't working by default with java's parser -- i supsect there is some hint XPathEntityProcessor should be giving hte parser to ask it to look at these ENTITY declarations. I've filed a Jira issue to try and track this (and included a test case) but unfortunately i don't relaly know what the fix is... https://issues.apache.org/jira/browse/SOLR-3614 -Hoss
Re: Problem while indexing XML file with special characters represented uuml
I don't have any experience with DIH: maybe XPathEntityProcessor doesn't use a true XML parser? You might want to try passing your documents through xmllint -noent (basically parse and reserialize) - that should inline the characters as UTF-8? On 07/09/2012 03:18 PM, Michael Belenki wrote: Somebody any idea? Solr seems to ignore the DTD definition and therefore does not understand the entities likeuuml; orauml; that are defined in dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD definition? On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenkiv...@belenki.name wrote: Dear community, I am experiencing strange problem while trying to index / to import XML document to SOLR via DataImportHandler. The XML document contains some special characters (e.g. german ü) that are represented as XML entities ü or ä. There is also DTD file that defines these entities (!ENTITY uumlü) (I tried to use dtd file as well as to include the DTD definition to the xml itself). After I start the import command full-import, the import process throws an exception as soon as it tries to parse ü: Un declared general entity uuml. Did anyone already face such a problem? best regards, Michael My data-config for importing is: dataConfig dataSource type=FileDataSource encoding=ISO-8859-1 / document !-- stream should be true since huge xml document is being parsed -- entity name=article processor=XPathEntityProcessor stream=true forEach=/dblp/article url=documents/dblp.xml field column=keyxpath=/dblp/article/@key / field column=title xpath=/dblp/article/title / /entity /document /dataConfig The XML file looks e.g. like this: ?xml version=1.0 encoding=ISO-8859-1? !DOCTYPE dblp [ !ENTITY uumlü!-- small u, dieresis or umlaut mark -- ] dblp article key=journals/fm/Riccardi09 mdate=2011-10-27 authorMarco Riccardi/author titleSolution of Cubic and Quartic Equations.ü/title pages117-122/pages year2009/year volume17/volume journalFormalized Mathematics/journal number1-4/number eehttp://dx.doi.org/10.2478/v10037-009-0012-z/eeurldb/journals/fm/fm17.html#Riccardi09/url /article/dblp The stack-trace is: 05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1 05.07.2012 17:37:19 org.apache.solr.common.SolrException log SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeE xception: org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:documents/dblp.xml rows processed in this xml:2 last row in this xml:{title=Common Subexpression Identification in General Algebraic System s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :264) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:375) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:445) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:426) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataIm portHandlerException: Parsing failed for xml, url:documents/dblp.xml rows proces sed in this xml:2 last row in this xml:{title=Common Subexpression Identificatio n in General Algebraic Systems., $forEach=/dblp/article, key=persons/Hall74} Pro cessing Document # 3 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:621) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:327) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :225) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Parsin g failed for xml, url:documents/dblp.xml rows processed in this xml:2 last row i n this xml:{title=Common Subexpression Identification in General Algebraic Syste ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd Throw(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE ntityProcessor.java:504) at org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE ntityProcessor.java:517) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity ProcessorBase.java:120) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow( XPathEntityProcessor.java:225) at
Re: Problem while indexing XML file with special characters represented uuml
: Somebody any idea? Solr seems to ignore the DTD definition and therefore : does not understand the entities like uuml; or auml; that are defined in : dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD : definition? Solr is just utilizing the builtin java XML parser for this, so there's nothing you can tell solr to consider the DTD but it is odd that this isn't working by default with java's parser -- i supsect there is some hint XPathEntityProcessor should be giving hte parser to ask it to look at these ENTITY declarations. I've filed a Jira issue to try and track this (and included a test case) but unfortunately i don't relaly know what the fix is... https://issues.apache.org/jira/browse/SOLR-3614 -Hoss
Re: Problem while indexing XML file with special characters represented uuml
Somebody any idea? Solr seems to ignore the DTD definition and therefore does not understand the entities like uuml; or auml; that are defined in dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD definition? On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenki v...@belenki.name wrote: Dear community, I am experiencing strange problem while trying to index / to import XML document to SOLR via DataImportHandler. The XML document contains some special characters (e.g. german ü) that are represented as XML entities ü or ä. There is also DTD file that defines these entities (!ENTITY uumlü ) (I tried to use dtd file as well as to include the DTD definition to the xml itself). After I start the import command full-import, the import process throws an exception as soon as it tries to parse ü: Un declared general entity uuml. Did anyone already face such a problem? best regards, Michael My data-config for importing is: dataConfig dataSource type=FileDataSource encoding=ISO-8859-1 / document !-- stream should be true since huge xml document is being parsed -- entity name=article processor=XPathEntityProcessor stream=true forEach=/dblp/article url=documents/dblp.xml field column=keyxpath=/dblp/article/@key / field column=title xpath=/dblp/article/title / /entity /document /dataConfig The XML file looks e.g. like this: ?xml version=1.0 encoding=ISO-8859-1? !DOCTYPE dblp [ !ENTITY uumlü !-- small u, dieresis or umlaut mark -- ] dblp article key=journals/fm/Riccardi09 mdate=2011-10-27 authorMarco Riccardi/author titleSolution of Cubic and Quartic Equations.ü/title pages117-122/pages year2009/year volume17/volume journalFormalized Mathematics/journal number1-4/number eehttp://dx.doi.org/10.2478/v10037-009-0012-z/eeurldb/journals/fm/fm17.html#Riccardi09/url /article/dblp The stack-trace is: 05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1 05.07.2012 17:37:19 org.apache.solr.common.SolrException log SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeE xception: org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:documents/dblp.xml rows processed in this xml:2 last row in this xml:{title=Common Subexpression Identification in General Algebraic System s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :264) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:375) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:445) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:426) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataIm portHandlerException: Parsing failed for xml, url:documents/dblp.xml rows proces sed in this xml:2 last row in this xml:{title=Common Subexpression Identificatio n in General Algebraic Systems., $forEach=/dblp/article, key=persons/Hall74} Pro cessing Document # 3 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:621) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:327) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :225) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Parsin g failed for xml, url:documents/dblp.xml rows processed in this xml:2 last row i n this xml:{title=Common Subexpression Identification in General Algebraic Syste ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd Throw(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE ntityProcessor.java:504) at org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE ntityProcessor.java:517) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity ProcessorBase.java:120) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow( XPathEntityProcessor.java:225) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath EntityProcessor.java:204) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent ityProcessorWrapper.java:330) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:296) at