Tika 3.2.2 - Html file parsing doesn't process line breaks properly

Sunny Thadhani Fri, 12 Dec 2025 02:57:49 -0800

Hello Tika Team,

We have a requirement of parsing files in Tika for text extraction.


When parsing html file (Ex. eml_parser.html), it doesn't properly parse and use 
\n instead of <br> tags for line breaks. Also, it gives way more new line 
characters (\n) then the source file.

These new line characters(\n) are ignored in html iframe and it renders text in 
the same line which doesn't look good.

We are using Tika version 3.2.2 and I have also attached Tika Code, input 
file(eml_parser.html) and output html(tika_processor.html)

How can we handle this in Tika ?


Best Regards,
sthadhani

This e-mail and its attachments contain confidential information from 
oppscience, which is intended only for the person or entity whose address is 
listed above. Any use of the information contained herein in any way 
(including, but not limited to, total or partial disclosure, reproduction, or 
dissemination) by persons other than the intended recipient(s) is prohibited. 
If you received this e-mail in error, please notify the sender by phone or 
email immediately and delete it.

Sent for test purposes only.

Ticket link for reference: https://abc.atlassian.net/browse/SPE-6986

Ticket content:

Cf. attached documents

Ambroise LAURENT

Product manager - Spectra

[email protected]

26 rue de Montholon â€¢ 75009

Paris, France

        AutoDetectParser parser = new AutoDetectParser();
    ByteArrayOutputStream htmlOS = new ByteArrayOutputStream();
    ContentHandler handler = new ToXMLContentHandler(htmlOS, "UTF-8");

    ParseContext context = new ParseContext();
    Metadata metadata = new Metadata();
    parser.parse(metaAndBodyAndAttachments.body.getInputStream(), handler, 
metadata, context);

    String html = htmlOS.toString(UTF_8);

\n\n\n\r\n\r\nSent for test purposes only.\n\r\n\r\n\n\r\n\n\r\n\r\nTicket link for reference: \r\nhttps://abc.atlassian.net/browse/SPE-6986\n\r\n\r\n\n\r\n\n\r\n\r\nTicket content:\n\r\n\r\n\n\r\n\n\r\n\r\n $\"\"$ \n\r\n\r\n\n\r\n\n\r\n\r\nCf. attached documents\n\r\n\r\n\n\r\n\n\r\n\r\n\r\n\n\r\n\n\r\n\r\nAmbroise LAURENT\n\r\n\r\nProduct manager - Spectra\n\r\n\r\[email protected]\n\r\n\r\n26 rue de Montholon â¢ 75009\n\r\n\r\nParis, France\n\r\n\r\n\n\r\n\n\r\n

\r\n $\"\"$

\n\r\n\r\n\n\r\n\n\r\n\n\r\n

Tika 3.2.2 - Html file parsing doesn't process line breaks properly

Reply via email to