Christian, This is interesting. I have been always thinking that Solr shouldn't be in the business of parsing; it's responsibility of the Solr client. But what Peter suggested, adding a parsing capability to the Solr as a request handler does make sense.
One thing that I noticed this approach can't do (or won't fit nicely), however, is that it can't crawl docs. If this is your requirement, then using Nutch as a crawler & parser for Nutch may be an answer. This is the linke provided by Otis in a discussion while ago: http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with .html I was going to try this but I haven't done so yet. I am also aware that Solr's plugin architecture is different than, and superior to Nutch in certain aspects. I recall Nutch has had an issue handling non-European languages in its parsing code, but that might not be an issue here as Solr provides the search capability. -kuro