Back in November, Shalin and Grant were discussing integrating DataImportHandler and Tika. Shalin's estimation about the best way to do this was as follows:
** I think the best way would be a TikaEntityProcessor which knows how to handle documents. I guess a typical use-case would be FileListEntityProcessor->TikaEntityProcessor as parent-child entities. Also see SOLR-833 which adds a FieldReaderDataSource using which you can pass any field's content to an entity for processing. So you can have a [SqlEntityProcessor, JdbcDataSource] producing a blob and a [FieldReaderDataSource, TikaEntityProcessor] consuming it. (http://www.nabble.com/DataImportHandler-and-Blobs-td20464891.html) ** Has there been any work on something like this? Alternatively, is anyone else put together an alternative way to get DataImportHandler to extract body text from PDFs, Word files, etc.? Thanks, Chris