Re: Problem while indexing XML file with special characters represented uuml

2012-07-11 Thread Mike Sokolov
I think the issue here is that DIH uses Woodstox BasicStreamReader (see http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/BasicStreamReader.html) which has only minimal DTD support. It might be best to use ValidatingStreamReader

Re: Problem while indexing XML file with special characters represented uuml

2012-07-10 Thread Mike Sokolov
I don't have any experience with DIH: maybe XPathEntityProcessor doesn't use a true XML parser? You might want to try passing your documents through xmllint -noent (basically parse and reserialize) - that should inline the characters as UTF-8? On 07/09/2012 03:18 PM, Michael Belenki wrote:

Re: Problem while indexing XML file with special characters represented uuml

2012-07-10 Thread Chris Hostetter
: Somebody any idea? Solr seems to ignore the DTD definition and therefore : does not understand the entities like uuml; or auml; that are defined in : dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD : definition? Solr is just utilizing the builtin java XML parser for

Re: Problem while indexing XML file with special characters represented uuml

2012-07-09 Thread Michael Belenki
Somebody any idea? Solr seems to ignore the DTD definition and therefore does not understand the entities like uuml; or auml; that are defined in dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD definition? On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenki