Hello Tika Team, We have a requirement of parsing files in Tika for text extraction.
When parsing html file (Ex. eml_parser.html), it doesn't properly parse and use \n instead of <br> tags for line breaks. Also, it gives way more new line characters (\n) then the source file. These new line characters(\n) are ignored in html iframe and it renders text in the same line which doesn't look good. We are using Tika version 3.2.2 and I have also attached Tika Code, input file(eml_parser.html) and output html(tika_processor.html) How can we handle this in Tika ? Best Regards, sthadhani This e-mail and its attachments contain confidential information from oppscience, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you received this e-mail in error, please notify the sender by phone or email immediately and delete it.
|
Sent for test purposes only.
Ticket link for reference:
https://abc.atlassian.net/browse/SPE-6986
Ticket content:
Cf. attached documents
Ambroise LAURENT
Product manager - Spectra
26 rue de Montholon • 75009
Paris, France
|
AutoDetectParser parser = new AutoDetectParser();
ByteArrayOutputStream htmlOS = new ByteArrayOutputStream();
ContentHandler handler = new ToXMLContentHandler(htmlOS, "UTF-8");
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();
parser.parse(metaAndBodyAndAttachments.body.getInputStream(), handler,
metadata, context);
String html = htmlOS.toString(UTF_8);\n\n\n\r\n\r\nSent for test purposes only.\n\r\n\r\n\n\r\n\n\r\n\r\nTicket link for reference: \r\nhttps://abc.atlassian.net/browse/SPE-6986\n\r\n\r\n\n\r\n\n\r\n\r\nTicket content:\n\r\n\r\n\n\r\n\n\r\n\r\n