Sending tika extracted metadata, text and original binary content from a document to elasticsearch

Mike Caceres Sun, 09 Aug 2015 12:07:38 -0700

Hello,

Let's say I have a single MS-Word document and would like use ManifoldCF to 
crawl it and to send it to elasticsearch using the Attachment plugin.


Currently I am able to successfully crawl the MS-Word document and to send to 
elasticsearch two things: a) the metadata extracted from the MS-Word document 
by the Tika Content Extractor as well as b) the plain text detected by the 
Boilerplate "Extract Everything".

If I remove the Tika Content Extractor from the pipeline, I am able to send the 
actual binary data from the MS-Word document and the elasticsearch Attachment 
plugin is able to index it, but I do not have as rich metadata associated to 
the document as when I use the Tika Content Extractor.
  
Now I would like to be able to combine both, so at the end I have in 
elasticseach three things: a) the metadata extracted by the Tika Content 
Extractor b) the plain text of the document and c) the binary data from the 
MS-Word (so I can do downstream processing as needed).  

How can I achieve that?

Thank you!

Mike

Sending tika extracted metadata, text and original binary content from a document to elasticsearch

Reply via email to