Re: Problem while indexing XML file with special characters represented uuml

2012-07-11 Thread Mike Sokolov
I think the issue here is that DIH uses Woodstox BasicStreamReader 
(see 
http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/BasicStreamReader.html) 
which has only minimal DTD support.  It might be best to use 
ValidatingStreamReader 
(http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/ValidatingStreamReader.html) 
instead.


I think you could get this by requesting a validating XmlReader; that's 
a setting that's exposed at the factory level that returns a parser (ie 
an XmlReader).  But then you would probably also get validation turned 
on, which might not be so great in all cases.  Probably should be a user 
setting for XPathEntityProcessor somewhere?


-Mike

On 07/10/2012 07:10 PM, Chris Hostetter wrote:

: Somebody any idea? Solr seems to ignore the DTD definition and therefore
: does not understand the entities likeuuml; orauml; that are defined in
: dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
: definition?

Solr is just utilizing the builtin java XML parser for this, so there's
nothing you can tell solr to consider the DTD but it is odd that this
isn't working by default with java's parser -- i supsect there is some
hint XPathEntityProcessor should be giving hte parser to ask it to look
at these ENTITY declarations.

I've filed a Jira issue to try and track this (and included a test case)
but unfortunately i don't relaly know what the fix is...

https://issues.apache.org/jira/browse/SOLR-3614



-Hoss
   


Re: Problem while indexing XML file with special characters represented uuml

2012-07-10 Thread Mike Sokolov
I don't have any experience with DIH: maybe XPathEntityProcessor doesn't 
use a true XML parser?


You might want to try passing your documents through xmllint -noent 
(basically parse and reserialize) - that should inline the characters as 
UTF-8?


On 07/09/2012 03:18 PM, Michael Belenki wrote:

Somebody any idea? Solr seems to ignore the DTD definition and therefore
does not understand the entities likeuuml; orauml; that are defined in
dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
definition?

On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenkiv...@belenki.name
wrote:
   

Dear community,

I am experiencing strange problem while trying to index / to import XML
document to SOLR via DataImportHandler. The XML document contains some
special characters (e.g. german ü) that are represented as XML entities
ü or ä. There is also DTD file that defines these entities
(!ENTITY uumlü) (I tried to use dtd file as well as to
include the DTD definition to the xml itself). After I start the import
command full-import, the import process throws an exception as soon as
 

it
   

tries to parse ü: Un
declared general entity uuml. Did anyone already face such a problem?

best regards,

Michael


My data-config for importing is:


dataConfig
 dataSource type=FileDataSource encoding=ISO-8859-1 /
 document
!--  stream should be true since huge xml document is being 
parsed
 

--
   

 entity name=article
 processor=XPathEntityProcessor
 stream=true
 forEach=/dblp/article
 url=documents/dblp.xml

 
 field column=keyxpath=/dblp/article/@key /
 field column=title xpath=/dblp/article/title /


/entity
 /document
/dataConfig

The XML file looks e.g. like this:

?xml version=1.0 encoding=ISO-8859-1?

!DOCTYPE dblp [

 !ENTITY uumlü!-- small u, dieresis or umlaut mark --
]
dblp

article key=journals/fm/Riccardi09 mdate=2011-10-27
authorMarco Riccardi/author
titleSolution of Cubic and Quartic Equations.ü/title
pages117-122/pages
year2009/year
volume17/volume

journalFormalized Mathematics/journal

number1-4/number

 

eehttp://dx.doi.org/10.2478/v10037-009-0012-z/eeurldb/journals/fm/fm17.html#Riccardi09/url
   

/article/dblp

The stack-trace is:

05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1
05.07.2012 17:37:19 org.apache.solr.common.SolrException log
SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeE
xception: org.apache.solr.handler.dataimport.DataImportHandlerException:
Parsing
  failed for xml, url:documents/dblp.xml rows processed in this xml:2
 

last
   

row in
  this xml:{title=Common Subexpression Identification in General
 

Algebraic
   

System
s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:264)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
rter.java:375)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
ava:445)
 at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
va:426)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataIm
portHandlerException: Parsing failed for xml, url:documents/dblp.xml
 

rows
   

proces
sed in this xml:2 last row in this xml:{title=Common Subexpression
Identificatio
n in General Algebraic Systems., $forEach=/dblp/article,
key=persons/Hall74} Pro
cessing Document # 3
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:621)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
ava:327)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:225)
 ... 3 more
Caused by:
 

org.apache.solr.handler.dataimport.DataImportHandlerException:
   

Parsin
g failed for xml, url:documents/dblp.xml rows processed in this xml:2
 

last
   

row i
n this xml:{title=Common Subexpression Identification in General
 

Algebraic
   

Syste
ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
 at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
Throw(DataImportHandlerException.java:72)
 at
org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
ntityProcessor.java:504)
 at
org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
ntityProcessor.java:517)
 at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity
ProcessorBase.java:120)
 at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
XPathEntityProcessor.java:225)
 at

Re: Problem while indexing XML file with special characters represented uuml

2012-07-10 Thread Chris Hostetter

: Somebody any idea? Solr seems to ignore the DTD definition and therefore
: does not understand the entities like uuml; or auml; that are defined in
: dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
: definition?

Solr is just utilizing the builtin java XML parser for this, so there's 
nothing you can tell solr to consider the DTD but it is odd that this 
isn't working by default with java's parser -- i supsect there is some 
hint XPathEntityProcessor should be giving hte parser to ask it to look 
at these ENTITY declarations.

I've filed a Jira issue to try and track this (and included a test case) 
but unfortunately i don't relaly know what the fix is...

https://issues.apache.org/jira/browse/SOLR-3614



-Hoss


Re: Problem while indexing XML file with special characters represented uuml

2012-07-09 Thread Michael Belenki
Somebody any idea? Solr seems to ignore the DTD definition and therefore
does not understand the entities like uuml; or auml; that are defined in
dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
definition?

On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenki v...@belenki.name
wrote:
 Dear community,
 
 I am experiencing strange problem while trying to index / to import XML
 document to SOLR via DataImportHandler. The XML document contains some
 special characters (e.g. german ü) that are represented as XML entities
 ü or ä. There is also DTD file that defines these entities
 (!ENTITY uumlü ) (I tried to use dtd file as well as to
 include the DTD definition to the xml itself). After I start the import
 command full-import, the import process throws an exception as soon as
it
 tries to parse ü: Un
 declared general entity uuml. Did anyone already face such a problem? 
 
 best regards,
 
 Michael
 
 
 My data-config for importing is:
 
 
 dataConfig
 dataSource type=FileDataSource encoding=ISO-8859-1 /
 document
   !--  stream should be true since huge xml document is being 
 parsed
--
 entity name=article
 processor=XPathEntityProcessor
 stream=true
 forEach=/dblp/article
 url=documents/dblp.xml
 
 
 field column=keyxpath=/dblp/article/@key /
 field column=title xpath=/dblp/article/title /
 
 
/entity
 /document
 /dataConfig
 
 The XML file looks e.g. like this:
 
 ?xml version=1.0 encoding=ISO-8859-1?
 
 !DOCTYPE dblp [
 
 !ENTITY uumlü !-- small u, dieresis or umlaut mark --
 ]
 dblp
 
 article key=journals/fm/Riccardi09 mdate=2011-10-27
 authorMarco Riccardi/author
 titleSolution of Cubic and Quartic Equations.ü/title
 pages117-122/pages
 year2009/year
 volume17/volume
 
 journalFormalized Mathematics/journal
 
 number1-4/number

eehttp://dx.doi.org/10.2478/v10037-009-0012-z/eeurldb/journals/fm/fm17.html#Riccardi09/url
 /article/dblp
 
 The stack-trace is:
 
 05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor
 finish
 INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1
 05.07.2012 17:37:19 org.apache.solr.common.SolrException log
 SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException:
 java.lang.RuntimeE
 xception: org.apache.solr.handler.dataimport.DataImportHandlerException:
 Parsing
  failed for xml, url:documents/dblp.xml rows processed in this xml:2
last
 row in
  this xml:{title=Common Subexpression Identification in General
Algebraic
 System
 s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
 :264)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
 rter.java:375)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
 ava:445)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
 va:426)
 Caused by: java.lang.RuntimeException:
 org.apache.solr.handler.dataimport.DataIm
 portHandlerException: Parsing failed for xml, url:documents/dblp.xml
rows
 proces
 sed in this xml:2 last row in this xml:{title=Common Subexpression
 Identificatio
 n in General Algebraic Systems., $forEach=/dblp/article,
 key=persons/Hall74} Pro
 cessing Document # 3
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
 r.java:621)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
 ava:327)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
 :225)
 ... 3 more
 Caused by:
org.apache.solr.handler.dataimport.DataImportHandlerException:
 Parsin
 g failed for xml, url:documents/dblp.xml rows processed in this xml:2
last
 row i
 n this xml:{title=Common Subexpression Identification in General
Algebraic
 Syste
 ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
 at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
 Throw(DataImportHandlerException.java:72)
 at
 org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
 ntityProcessor.java:504)
 at
 org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
 ntityProcessor.java:517)
 at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity
 ProcessorBase.java:120)
 at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
 XPathEntityProcessor.java:225)
 at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath
 EntityProcessor.java:204)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent
 ityProcessorWrapper.java:330)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
 ityProcessorWrapper.java:296)
 at