Hello,

We want <br> tag to be present instead of \n. Also, there are many unnecessary 
\n characters. It should be as per the layout of eml_parser.html

We also tried ToHTMLContentHandler, but it didn't resolve the issue.

________________________________
From: Tim Allison <[email protected]>
Sent: 12 December 2025 20:04
To: [email protected] <[email protected]>
Subject: Re: Tika 3.2.2 - Html file parsing doesn't process line breaks properly

You don't often get email from [email protected]. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
Sorry. That was the triggering file. Got it.

What exactly do you expect/want? You want <br> for new lines? Or do you want a 
single \n for the new line and you're getting too many?

Have you tried the ToHTMLContentHandler?

On Fri, Dec 12, 2025 at 9:31 AM Tim Allison 
<[email protected]<mailto:[email protected]>> wrote:
Can you share a triggering file and the exact output you'd expect/want? Thank 
you.

On Fri, Dec 12, 2025 at 5:58 AM Sunny Thadhani 
<[email protected]<mailto:[email protected]>> wrote:
Hello Tika Team,

We have a requirement of parsing files in Tika for text extraction.

When parsing html file (Ex. eml_parser.html), it doesn't properly parse and use 
\n instead of <br> tags for line breaks. Also, it gives way more new line 
characters (\n) then the source file.

These new line characters(\n) are ignored in html iframe and it renders text in 
the same line which doesn't look good.

We are using Tika version 3.2.2 and I have also attached Tika Code, input 
file(eml_parser.html) and output html(tika_processor.html)

How can we handle this in Tika ?


Best Regards,
sthadhani

This e-mail and its attachments contain confidential information from 
oppscience, which is intended only for the person or entity whose address is 
listed above. Any use of the information contained herein in any way 
(including, but not limited to, total or partial disclosure, reproduction, or 
dissemination) by persons other than the intended recipient(s) is prohibited. 
If you received this e-mail in error, please notify the sender by phone or 
email immediately and delete it.

Reply via email to