Hi Tim, Thank you so much to enlighten that part to me. THAT is really useful.
Kindest regards, christian From: Tim Allison <[email protected]> Date: Tuesday, 1 November 2022 at 17:09 To: [email protected] <[email protected]> Subject: Re: Paragraph words getting merged Sorry. Took a while to make time to look in detail. Yes, Tika adds "ignorable whitespace". Specifically in the case mentioned, the PDFParser writes a line separator, which has our XHTMLContentHandler in turn call ignoreableWhitespace: @Override protected void writeLineSeparator() throws IOException { try { xhtml.newline(); } catch (SAXException e) { throw new IOException("Unable to write a newline character", e); } } public void newline() throws SAXException { ignorableWhitespace(NL, 0, NL.length); } On Tue, Nov 1, 2022 at 11:55 AM Christian Ribeaud <[email protected]<mailto:[email protected]>> wrote: Tim, what do you exactly mean by Tika appears to add a new line in the correct spot at least for IDEC-102...? This is correct but it is an ignorable whitespace, right? Best, christian From: Tim Allison <[email protected]<mailto:[email protected]>> Date: Monday, 31 October 2022 at 16:22 To: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Paragraph words getting merged Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102... On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <[email protected]<mailto:[email protected]>> wrote: On Sun, 30 Oct 2022, Christian Ribeaud wrote: > I am using the default configuration. I think, we could reduce my > problem to following code snippet: Is there a reason that you aren't using one of the built-in Tika content handlers? Generally they should be taking care of everything for you with paragraphs, plain text vs html etc Nick
