Hi all, I want to extract the text from the page of this PDF file <https://drive.google.com/file/d/1RMBmU2XTaSgQVDkU2eYECP8fe2SjVqFp/view?usp=sharing>. I am using the following code to achieve it:
try (PDDocument document = PDDocument.load(new File(fileName))) { PDFTextStripper stripper = new PDFTextStripper(); stripper.setSortByPosition( false ); stripper.setStartPage( 0 ); stripper.setEndPage( document.getNumberOfPages() ); System.out.println(stripper.getText(document)); } The result I get (part of it) is: ---------------- A S am pl e P os te r La nd sc ap e La yo ut ---------------- If I use stripper.setSortByPosition( true ) I get the following (part of it): ---------------- A Sample Poster Landscape Layout - Title Name of Researcher(s) Name of Department Introduction Measurable Outcomes The Mechanical Engineering Department at WPI was established in 1868 and the first undergraduate degrees were awarded in 1871. The Department *currently has about 450 Graduating students* should demonstrate the following at a level equivalent to an entry- undergraduate students and 100 graduate students. Housed in the Higgins Laboratory and the level engineer or first year graduate student: Washburn shops the faculty consists of 29 tenured and tenure track professors, and several non-tenure track teaching staff. The Department offers undergraduate and graduate degrees in a. An understanding of the fundamental principles of conservation laws, ---------------- The text I get is better than the first one, but it mixes the text from left and right "columns" (please see the bold text). My question is: is it possible to get the text as one would naturally read it? i.e. the text of the left column and then the text of the right column? I attached the file, just in case the link cannot be opened. Thanks in advance. Best Regards. Jorge Eduardo Flórez
--------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org