Hesham,
I faced a similar problem recently with text that had a different font being 
offset from other text in the line.  I solved it by placing text in the same 
line based on vertical coordinates (in my case I used bottom coordinate within 
text height of prior words in the line).
I then sorted the words in a line by x coordinates.

I'm not sure if boss will allow me share some code snippits, but I'll ask.

-----Original Message-----
From: Hesham Gneady <heshamgne...@gmail.com> 
Sent: Saturday, November 21, 2020 11:11 PM
To: users@pdfbox.apache.org
Subject: RE: Reading page using PDFTextStripper

CAUTION: [EXTERNAL]


I've tried it now, but it made no difference. I've actually explained the 
problem wrong, here's what actually happens:

The 1st line in the PDF file is:

131 Comments are made from 1905, / See: Certain Neurotic Mechanisms in

Where "131" is normal text, while the rest of the line has "Subscript"
formatting. If I copy/paste the line from the PDF manually it copies it right 
ordered, but when extracting the text using PDFBox it extracts it like
this:

Comments are made from 1905, / See: Certain Neurotic Mechanisms in 131

The text is being read before the "131" number.





Best regards,

Hesham



----------------------------------------------------------------------------
----------------------

Included Message:



Am 17.11.20 um 07:54 schrieb Hesham Gneady:

> Hi,

>

>

>

> I am trying to read this PDF file using

> PDFTextStripper.processTextPosition():

>

>  
> <https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%2
> 0>
https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%20

> readin

> g%20sample.pdf?dl=0

>

>

>

> But when I do that it reads it with wrong order. It reads the 2nd line

> before the 1st line because the 1st line has Subscript effect. Is

> there a way to read it right ordered?

I a pdf the text doesn't neccessarly appear in the rendering order. You should 
give the sort option a try:



org.apache.pdfbox.text.PDFTextStripper.setSortByPosition(boolean)





Andreas



---------------------------------------------------------------------

To unsubscribe, e-mail:  <mailto:users-unsubscr...@pdfbox.apache.org>
users-unsubscr...@pdfbox.apache.org

For additional commands, e-mail:  <mailto:users-h...@pdfbox.apache.org>
users-h...@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to