Hi Tilman, Thank you for your prompt reply. I think the expected behavior is well explained by this unit test that I added to PDFParserTest:
@Test public void testLineBreaks() throws Exception { // Give, PDFParser parser = new PDFParser(); InputStream stream = getResourceAsStream("/test-documents/testLineBreaks.pdf"); // When String content = getText(stream, parser); String expected = " This sentence is expected to be extracted as a single line because the" + " user hasn’t hit any line return on the keyboard, keeping the line return in added" + " by the editor for visualization will make NER mode complicated\n" + "In contrast this once should appear on a new line\n\n" + "And same for this one which should ideally be separated from the previous one by a" + " blank line\n\n" + "Note: by copy pasting the content into a text editor, it will appear as described" + " upper, whereas the output of Tika contains actual line break at the end of PDF" + " lines\n"; assertEquals(expected, content); } I agree this behavior is not always the one expected by user who might want to keep line breaks at the end of PDF lines, but in my opinion since the PDFTextStripper has knowledge whether there is a hard or soft line break we should keep that information available to the end user. By adding breakpoints when debugging on some PDF I could see that when inside PDFTextStripper.handleLineSeparation, we have current.isParagraphStart() = true this is a hard break and otherwise a soft break. Moreover I get the feeling that handling PDFs this way would be more consistent with the handling of doc/docx where the content is extracted as described in the unit test (soft breaks don't appear in the extracted content, only hardbreaks added by the user when typing). Ideally I would like to be able to configure this behavior in the PDFParserConfig, another solution would be to allow developers to have more control on the PDFParser underlying objects methods. I tried overriding the AbstractPDF2XHTML/PDFTextStripper.handleLineSeparation method and using my new class in the PDFParser, but actually I stopped because I ended up having to copy a lot of Tika/PDFBox source code because a lot of methods/objects were not accessible. Please let me know if you need any more details to understand my situation and if the behavior I describe makes sense for you. Thank you for your help ! Regards, <https://www.icij.org/> Clément Doumouro Machine Learning Engineer +1 301-244-8803 <+13012448803%E2%80%AC> ICIJ.org <https://icij.org/> cdoumo...@icij.org PGP key <https://keys.openpgp.org/vks/v1/by-fingerprint/DFA5082713A7D671F384489886AB2EB14650FD50> 1730 Rhode Island Ave NW, Suite 317 | Washington D.C. 20036 | United States <https://maps.google.com/?q=1800+M+Street+NW,+Front+1+#33019+%7C+Washington,+DC+20033+%7C+United+States> <https://www.facebook.com/ICIJ.org> <https://twitter.com/icijorg> <https://www.linkedin.com/company/international-consortium-of-investigative-journalists/mycompany/> <https://www.instagram.com/icijorg/> <https://www.youtube.com/c/IcijOrg> Subscribe: Get our stories in your inbox <https://www.icij.org/newsletter> <https://www.icij.org/donate> On Tue, Feb 13, 2024 at 1:57 PM Tilman Hausherr <thaush...@t-online.de> wrote: > Hi, > > Can you share a non confidential file and explain what you did, what got > and what you want instead? I also fail to grasp the first sentence. Do you > want soft line breaks, or no line breaks at all? > > Tilman > > On 12.02.2024 10:46, Clément Doumouro via user wrote: > > Hi all, > > I need to extract PDF text without soft line breaks in order to process > PDF content as part of a NLP pipeline (NER). Soft line breaks appearing as > hard line breaks in the text content is responsible for most of my NER > model errors. > > When PDF text is extracted with line breaks, it's impossible for me to > post-process the content and distinguish soft line breaks from hard/real > line breaks, so I would like to avoid post-processing extracted text and > rather handle line breaks differently when extracting. > > Adding a few break points in the source code, I have the feeling that my > problem would be solved by overriding > the AbstractPDF2XHTML/PDFTextStripper.handleLineSeparation (let's call t > PDF2XHTMLWithoutSoftBreaks) not to write the lineSeparator when a new line > is detected and we're not starting a new paragraph. Then I would have to > use my new PDF2XHTMLWithoutSoftBreaks inside a new > PDFParserWithoutSoftBreaks and then configure Tika to use that parser for > PDFs. > > Doing so sounds very heavy and will require rewriting a lot more code than > just the handleLineSeparation method since it's actually private. > > I wanted to ask if there were any alternative approaches which do not > imply post processing (since as mentioned above during processing we lose > the soft line break information and we can't retrieve it precisely > afterwards) ? > > Thank you for your help ! > Best, > > <https://www.icij.org/> > Clément Doumouro > Machine Learning Engineer > > +1 301-244-8803 <+13012448803%E2%80%AC> ICIJ.org <https://icij.org/> > cdoumo...@icij.org PGP key > <https://keys.openpgp.org/vks/v1/by-fingerprint/DFA5082713A7D671F384489886AB2EB14650FD50> > > 1730 Rhode Island Ave NW, Suite 317 | Washington D.C. 20036 | United > States > <https://maps.google.com/?q=1800+M+Street+NW,+Front+1+#33019+%7C+Washington,+DC+20033+%7C+United+States> > <https://www.facebook.com/ICIJ.org> <https://twitter.com/icijorg> > <https://www.linkedin.com/company/international-consortium-of-investigative-journalists/mycompany/> > <https://www.instagram.com/icijorg/> <https://www.youtube.com/c/IcijOrg> > > > Subscribe: Get our stories in your inbox > <https://www.icij.org/newsletter> > <https://www.icij.org/donate> > > >
testLineBreaks.pdf
Description: Adobe PDF document