On Mon, 16 Apr 2012, babug wrote:
I have attached(Ticket_Diary.oft) a outlook format template.I need to parse these type of files and get the actual HTML content.I have tested with following code, but the parser returns <p> tag instead of <table> or <Div> tags.How do i exclude from SAFE_ELEMENTS map.?

It might not be stored as html - Outlook often stores "html" content of emails as RTF.

Also...

*String msgfile = "/home/test/Desktop/EmailParse/Ticket Diary.oft";
                InputStream stream = new FileInputStream(msgfile);
                StringWriter sw = new StringWriter();
                Parser parser = new OfficeParser();
                Metadata metadata = new Metadata();
                ParseContext context = new ParseContext();
                context.set(HtmlMapper.class,IdentityHtmlMapper.INSTANCE);

This seems to be you using Tika. If you want to use Tika to do this, you should probably ask on the Tika list. Alternately, you can use HSMF from Apache POI to directly access the file, and get at the exact bits of it you need. I'd suggest you look at the HSMF text extractor in POI, and OutlookExtractor from Apache Tika as good examples of how to go about using HSMF

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to