I wish to do two things when processing OUTLOOK MSG files to get all the metadata. They seem simple enough.
Goal 1. IGNORE all attachments.

I was able to ignore the attachments by adding a FileSelector to the parseContect ONLY while parsing MSG files to ignore _all_ attachments.
myParseContext.set(FilenameFilter.class, new MyFilenameFilter() )
myParseContext.set(Parser.class, parser);
parser.parse(tinput, handler, metadata, myParseContext);

But I was not able to accomplish my second goal

Goal 2. Get the BODY text separated out by Tika, because I need the subject which is available in the resulting metadata and the body text, so that I can send them off t Lucene/ElasticSearch.

Apparently a simple BodyContentHandler() isn't sufficient as mentioned in various old (?) comments.
How do I get to do one of
a. get an XHTML marked up version of the results, so the body is more obvious
or
b. configure it to just get the body text

I'm thinking I'm missing something obvious.
thanks,
-Paul




Reply via email to