arjunce created PDFBOX-5828: ------------------------------- Summary: PDFTextStripper created garbled text Key: PDFBOX-5828 URL: https://issues.apache.org/jira/browse/PDFBOX-5828 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 3.0.1 PDFBox Reporter: arjunce Attachments: output_text_stripper.txt, test.pdf
Hello Folks, I am using pdfbox to extract and manipulate text contents of the pdf and using PDFTextStripper to extract text. I am also setting the below options: {code:java} PDDocument document = Loader.loadPDF(new File("src/test/resources/pdf/test.pdf")); PDFTextStripper textStripper = new PDFTextStripper(); textStripper.setSortByPosition(true); textStripper.setWordSeparator(" "); {code} When sorting the positions there is a "{*}Comparison method violates its general contract!{*}" The exception is possibly due to the float values. The custom merge sort implemented is messing up the text at the Individual character level and I can't fix the text later. I have attached the sample pdf and its text output below. The merge sort doesn't consider the y coordinates and x coordinates when sorting the letters. Adding that while sorting would fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org