Hello Tika Team,

We have a requirement of parsing files in Tika for text extraction.

When parsing html file (Ex. eml_parser.html), it doesn't properly parse and use 
\n instead of <br> tags for line breaks. Also, it gives way more new line 
characters (\n) then the source file.

These new line characters(\n) are ignored in html iframe and it renders text in 
the same line which doesn't look good.

We are using Tika version 3.2.2 and I have also attached Tika Code, input 
file(eml_parser.html) and output html(tika_processor.html)

How can we handle this in Tika ?


Best Regards,
sthadhani

This e-mail and its attachments contain confidential information from 
oppscience, which is intended only for the person or entity whose address is 
listed above. Any use of the information contained herein in any way 
(including, but not limited to, total or partial disclosure, reproduction, or 
dissemination) by persons other than the intended recipient(s) is prohibited. 
If you received this e-mail in error, please notify the sender by phone or 
email immediately and delete it.
Sent for test purposes only.


Ticket content:


Cf. attached documents


Ambroise LAURENT
Product manager - Spectra
26 rue de Montholon • 75009
Paris, France


        AutoDetectParser parser = new AutoDetectParser();
    ByteArrayOutputStream htmlOS = new ByteArrayOutputStream();
    ContentHandler handler = new ToXMLContentHandler(htmlOS, "UTF-8");

    ParseContext context = new ParseContext();
    Metadata metadata = new Metadata();
    parser.parse(metaAndBodyAndAttachments.body.getInputStream(), handler, 
metadata, context);

    String html = htmlOS.toString(UTF_8);
\n\n\n\r\n\r\nSent for test purposes only.\n\r\n\r\n\n\r\n\n\r\n\r\nTicket link for reference: \r\nhttps://abc.atlassian.net/browse/SPE-6986\n\r\n\r\n\n\r\n\n\r\n\r\nTicket content:\n\r\n\r\n\n\r\n\n\r\n\r\n\"\"\n\r\n\r\n\n\r\n\n\r\n\r\nCf. attached documents\n\r\n\r\n\n\r\n\n\r\n\r\n\r\n\n\r\n\n\r\n\r\nAmbroise LAURENT\n\r\n\r\nProduct manager - Spectra\n\r\n\r\[email protected]\n\r\n\r\n26 rue de Montholon • 75009\n\r\n\r\nParis, France\n\r\n\r\n\n\r\n\n\r\n

\r\n\"\"

\n\r\n\r\n\n\r\n\n\r\n\n\r\n

Reply via email to