Re: Extracting PDF text without soft line breaks

Clément Doumouro via user Tue, 13 Feb 2024 05:52:03 -0800

Hi Tilman,

Thank you for your prompt reply.
I think the expected behavior is well explained by this unit test that I
added to PDFParserTest:

@Test
public void testLineBreaks() throws Exception {
    // Give,
    PDFParser parser = new PDFParser();
    InputStream stream =
getResourceAsStream("/test-documents/testLineBreaks.pdf");
    // When
    String content = getText(stream, parser);

    String expected = " This sentence is expected to be extracted as a
single line because the"
        + " user hasn’t hit any line return on the keyboard, keeping
the line return in added"
        + " by the editor for visualization will make NER mode complicated\n"
        + "In contrast this once should appear on a new line\n\n"
        + "And same for this one which should ideally be separated
from the previous one by a"
        + " blank line\n\n"
        + "Note: by copy pasting the content into a text editor, it
will appear as described"
        + " upper, whereas the output of Tika contains actual line
break at the end of PDF"
        + " lines\n";

    assertEquals(expected, content);

}

I agree this behavior is not always the one expected by user who might want
to keep line breaks at the end of PDF lines, but in my opinion since the
PDFTextStripper has knowledge whether there is a hard or soft line break we
should keep that information available to the end user.

By adding breakpoints when debugging on some PDF I could see that when
inside PDFTextStripper.handleLineSeparation, we have
current.isParagraphStart() = true this is a hard break and otherwise a soft
break.

Moreover I get the feeling that handling PDFs this way would be more
consistent with the handling of doc/docx where the content is extracted as
described in the unit test (soft breaks don't appear in the extracted
content, only hardbreaks added by the user when typing).

Ideally I would like to be able to configure this behavior in the
PDFParserConfig, another solution would be to allow developers to have more
control on the PDFParser underlying objects methods. I tried overriding the
AbstractPDF2XHTML/PDFTextStripper.handleLineSeparation method and using my
new class in the PDFParser, but actually I stopped because I ended up
having to copy a lot of Tika/PDFBox source code because a lot of
methods/objects were not accessible.

Please let me know if you need any more details to understand my situation
and if the behavior I describe makes sense for you.
Thank you for your help !

Regards,

<https://www.icij.org/>
Clément Doumouro
Machine Learning Engineer

+1 301-244-8803‬ <+13012448803%E2%80%AC>    ICIJ.org  <https://icij.org/>
cdoumo...@icij.org   PGP key
<https://keys.openpgp.org/vks/v1/by-fingerprint/DFA5082713A7D671F384489886AB2EB14650FD50>

 1730 Rhode Island Ave NW, Suite 317 | Washington D.C. 20036 | United States
<https://maps.google.com/?q=1800+M+Street+NW,+Front+1+#33019+%7C+Washington,+DC+20033+%7C+United+States>
<https://www.facebook.com/ICIJ.org> <https://twitter.com/icijorg>
<https://www.linkedin.com/company/international-consortium-of-investigative-journalists/mycompany/>
<https://www.instagram.com/icijorg/> <https://www.youtube.com/c/IcijOrg>

Subscribe:  Get our stories in your inbox <https://www.icij.org/newsletter>
<https://www.icij.org/donate>

On Tue, Feb 13, 2024 at 1:57 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Hi,
>
> Can you share a non confidential file and explain what you did, what got
> and what you want instead? I also fail to grasp the first sentence. Do you
> want soft line breaks, or no line breaks at all?
>
> Tilman
>
> On 12.02.2024 10:46, Clément Doumouro via user wrote:
>
> Hi all,
>
> I need to extract PDF text without soft line breaks in order to process
> PDF content as part of a NLP pipeline (NER). Soft line breaks appearing as
> hard line breaks in the text content is responsible for most of my NER
> model errors.
>
> When PDF text is extracted with line breaks, it's impossible for me to
> post-process the content and distinguish soft line breaks from hard/real
> line breaks, so I would like to avoid post-processing extracted text and
> rather handle line breaks differently when extracting.
>
> Adding a few break points in the source code, I have the feeling that my
> problem would be solved by overriding
> the AbstractPDF2XHTML/PDFTextStripper.handleLineSeparation (let's call t
> PDF2XHTMLWithoutSoftBreaks) not to write the lineSeparator when a new line
> is detected and we're not starting a new paragraph. Then I would have to
> use my new PDF2XHTMLWithoutSoftBreaks inside a new
> PDFParserWithoutSoftBreaks and then configure Tika to use that parser for
> PDFs.
>
> Doing so sounds very heavy and will require rewriting a lot more code than
> just the handleLineSeparation method since it's actually private.
>
> I wanted to ask if there were any alternative approaches which do not
> imply post processing (since as mentioned above during processing we lose
> the soft line break information and we can't retrieve it precisely
> afterwards) ?
>
> Thank you for your help !
> Best,
>
> <https://www.icij.org/>
> Clément Doumouro
> Machine Learning Engineer
>
> +1 301-244-8803‬ <+13012448803%E2%80%AC>    ICIJ.org  <https://icij.org/>
>    cdoumo...@icij.org   PGP key
> <https://keys.openpgp.org/vks/v1/by-fingerprint/DFA5082713A7D671F384489886AB2EB14650FD50>
>
>  1730 Rhode Island Ave NW, Suite 317 | Washington D.C. 20036 | United
> States
> <https://maps.google.com/?q=1800+M+Street+NW,+Front+1+#33019+%7C+Washington,+DC+20033+%7C+United+States>
> <https://www.facebook.com/ICIJ.org> <https://twitter.com/icijorg>
> <https://www.linkedin.com/company/international-consortium-of-investigative-journalists/mycompany/>
> <https://www.instagram.com/icijorg/> <https://www.youtube.com/c/IcijOrg>
>
>
> Subscribe:  Get our stories in your inbox
> <https://www.icij.org/newsletter>
> <https://www.icij.org/donate>
>
>
>

testLineBreaks.pdf
Description: Adobe PDF document

Re: Extracting PDF text without soft line breaks

Reply via email to