Hello, Let's say I have a single MS-Word document and would like use ManifoldCF to crawl it and to send it to elasticsearch using the Attachment plugin.
Currently I am able to successfully crawl the MS-Word document and to send to elasticsearch two things: a) the metadata extracted from the MS-Word document by the Tika Content Extractor as well as b) the plain text detected by the Boilerplate "Extract Everything". If I remove the Tika Content Extractor from the pipeline, I am able to send the actual binary data from the MS-Word document and the elasticsearch Attachment plugin is able to index it, but I do not have as rich metadata associated to the document as when I use the Tika Content Extractor. Now I would like to be able to combine both, so at the end I have in elasticseach three things: a) the metadata extracted by the Tika Content Extractor b) the plain text of the document and c) the binary data from the MS-Word (so I can do downstream processing as needed). How can I achieve that? Thank you! Mike
