Back in November, Shalin and Grant were discussing integrating
DataImportHandler and Tika. Shalin's estimation about the best way to
do this was as follows:

**

I think the best way would be a TikaEntityProcessor which knows how to
handle documents. I guess a typical use-case would be
FileListEntityProcessor->TikaEntityProcessor as parent-child entities.

Also see SOLR-833 which adds a FieldReaderDataSource using which you can
pass any field's content to an entity for processing. So you can have a
[SqlEntityProcessor, JdbcDataSource] producing a blob and a
[FieldReaderDataSource, TikaEntityProcessor] consuming it.

(http://www.nabble.com/DataImportHandler-and-Blobs-td20464891.html)

**

Has there been any work on something like this? Alternatively, is
anyone else put together an alternative way to get DataImportHandler
to extract body text from PDFs, Word files, etc.?

Thanks,
Chris
  • Latest on DataImportHa... Chris Harris

Reply via email to