Hi Tim,

Thank you so much to enlighten that part to me. THAT is really useful.

Kindest regards,

christian

From: Tim Allison <[email protected]>
Date: Tuesday, 1 November 2022 at 17:09
To: [email protected] <[email protected]>
Subject: Re: Paragraph words getting merged
Sorry. Took a while to make time to look in detail.  Yes, Tika adds "ignorable 
whitespace".  Specifically in the case mentioned, the PDFParser writes a line 
separator, which has our XHTMLContentHandler in turn call ignoreableWhitespace:


@Override
protected void writeLineSeparator() throws IOException {
    try {
        xhtml.newline();
    } catch (SAXException e) {
        throw new IOException("Unable to write a newline character", e);
    }
}

public void newline() throws SAXException {
    ignorableWhitespace(NL, 0, NL.length);
}




On Tue, Nov 1, 2022 at 11:55 AM Christian Ribeaud 
<[email protected]<mailto:[email protected]>> wrote:
Tim,

what do you exactly mean by Tika appears to add a new line in the correct spot 
at least for IDEC-102...?
This is correct but it is an ignorable whitespace, right?

Best,

christian

From: Tim Allison <[email protected]<mailto:[email protected]>>
Date: Monday, 31 October 2022 at 16:22
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at 
least for IDEC-102...

On Mon, Oct 31, 2022 at 9:22 AM Nick Burch 
<[email protected]<mailto:[email protected]>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:

Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc

Nick

Reply via email to