Hesham, I faced a similar problem recently with text that had a different font being offset from other text in the line. I solved it by placing text in the same line based on vertical coordinates (in my case I used bottom coordinate within text height of prior words in the line). I then sorted the words in a line by x coordinates.
I'm not sure if boss will allow me share some code snippits, but I'll ask. -----Original Message----- From: Hesham Gneady <heshamgne...@gmail.com> Sent: Saturday, November 21, 2020 11:11 PM To: users@pdfbox.apache.org Subject: RE: Reading page using PDFTextStripper CAUTION: [EXTERNAL] I've tried it now, but it made no difference. I've actually explained the problem wrong, here's what actually happens: The 1st line in the PDF file is: 131 Comments are made from 1905, / See: Certain Neurotic Mechanisms in Where "131" is normal text, while the rest of the line has "Subscript" formatting. If I copy/paste the line from the PDF manually it copies it right ordered, but when extracting the text using PDFBox it extracts it like this: Comments are made from 1905, / See: Certain Neurotic Mechanisms in 131 The text is being read before the "131" number. Best regards, Hesham ---------------------------------------------------------------------------- ---------------------- Included Message: Am 17.11.20 um 07:54 schrieb Hesham Gneady: > Hi, > > > > I am trying to read this PDF file using > PDFTextStripper.processTextPosition(): > > > <https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%2 > 0> https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%20 > readin > g%20sample.pdf?dl=0 > > > > But when I do that it reads it with wrong order. It reads the 2nd line > before the 1st line because the 1st line has Subscript effect. Is > there a way to read it right ordered? I a pdf the text doesn't neccessarly appear in the rendering order. You should give the sort option a try: org.apache.pdfbox.text.PDFTextStripper.setSortByPosition(boolean) Andreas --------------------------------------------------------------------- To unsubscribe, e-mail: <mailto:users-unsubscr...@pdfbox.apache.org> users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: <mailto:users-h...@pdfbox.apache.org> users-h...@pdfbox.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org