RE: Reading page using PDFTextStripper

Hesham Gneady Sat, 21 Nov 2020 21:11:05 -0800

I've tried it now, but it made no difference. I've actually explained the
problem wrong, here's what actually happens:


The 1st line in the PDF file is:

131 Comments are made from 1905, / See: Certain Neurotic Mechanisms in

Where "131" is normal text, while the rest of the line has "Subscript"
formatting. If I copy/paste the line from the PDF manually it copies it
right ordered, but when extracting the text using PDFBox it extracts it like
this:

Comments are made from 1905, / See: Certain Neurotic Mechanisms in 131

The text is being read before the "131" number.

 

 

Best regards,

Hesham

 

----------------------------------------------------------------------------
----------------------

Included Message:

 

Am 17.11.20 um 07:54 schrieb Hesham Gneady:

> Hi,

> 

>   

> 

> I am trying to read this PDF file using

> PDFTextStripper.processTextPosition():

> 

>  <https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%20>
https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%20

> readin

> g%20sample.pdf?dl=0

> 

>   

> 

> But when I do that it reads it with wrong order. It reads the 2nd line 

> before the 1st line because the 1st line has Subscript effect. Is 

> there a way to read it right ordered?

I a pdf the text doesn't neccessarly appear in the rendering order. You
should give the sort option a try:

 

org.apache.pdfbox.text.PDFTextStripper.setSortByPosition(boolean)

 

 

Andreas

 

---------------------------------------------------------------------

To unsubscribe, e-mail:  <mailto:users-unsubscr...@pdfbox.apache.org>
users-unsubscr...@pdfbox.apache.org

For additional commands, e-mail:  <mailto:users-h...@pdfbox.apache.org>
users-h...@pdfbox.apache.org

RE: Reading page using PDFTextStripper

Reply via email to