I wish to do two things when processing OUTLOOK MSG files to get all the
metadata. They seem simple enough.
Goal 1. IGNORE all attachments.
I was able to ignore the attachments by adding a FileSelector to the
parseContect ONLY while parsing MSG files to ignore _all_ attachments.
myParseContext.set(FilenameFilter.class, new MyFilenameFilter() )
myParseContext.set(Parser.class, parser);
parser.parse(tinput, handler, metadata, myParseContext);
But I was not able to accomplish my second goal
Goal 2. Get the BODY text separated out by Tika, because I need the
subject which is available in the resulting metadata and the body text,
so that I can send them off t Lucene/ElasticSearch.
Apparently a simple BodyContentHandler() isn't sufficient as mentioned
in various old (?) comments.
How do I get to do one of
a. get an XHTML marked up version of the results, so the body is more
obvious
or
b. configure it to just get the body text
I'm thinking I'm missing something obvious.
thanks,
-Paul