Hello,

Let's say I have a single MS-Word document and would like use ManifoldCF to 
crawl it and to send it to elasticsearch using the Attachment plugin.

Currently I am able to successfully crawl the MS-Word document and to send to 
elasticsearch two things: a) the metadata extracted from the MS-Word document 
by the Tika Content Extractor as well as b) the plain text detected by the 
Boilerplate "Extract Everything".

If I remove the Tika Content Extractor from the pipeline, I am able to send the 
actual binary data from the MS-Word document and the elasticsearch Attachment 
plugin is able to index it, but I do not have as rich metadata associated to 
the document as when I use the Tika Content Extractor.
  
Now I would like to be able to combine both, so at the end I have in 
elasticseach three things: a) the metadata extracted by the Tika Content 
Extractor b) the plain text of the document and c) the binary data from the 
MS-Word (so I can do downstream processing as needed).  

How can I achieve that?

Thank you!

Mike

Reply via email to