Re: Spaces are ignored when reading a PDF file

Hesham G. Sat, 19 Mar 2016 04:52:17 -0700

Andreas,

You're absolutely right. I am testing it now, but it seems very complicated.I hope there might be another easier solution.



Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

"Hesham G." <[email protected]> hat am 17. März 2016 um 11:20
geschrieben:


Andreas,

That is very helpful.

I can get the x location of each character using TextPosition.getX(), ex:
W: 102.88399
i: 114.18165
t: 117.660614
h: 121.55801
d: 133.09477
u: 140.3994
e: 147.60838

So to detect the space between the 2 words "With" & "due" should I make
subtraction calculations between X of the last letter(h) and the X of the
first letter (d) and if the number is large than normal then this is a
space? I think this way might be risky in the detection, or what?

That's the short story. To decide what is normal could be quite tricky. Youhave

to take the following facts into account:

- different fonts have different widths (important if the font before thespace

isn't the same than the font after the space)
- keep in mind that you have to take a scaling and sometimes a rotation into
account
- the "space" between characters may vary if the text is jusitified

There are certainly some other details which may be important as well, sothat

you end up with some more or less heuristic.

BR
Andreas

Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

Hi,

> Frank van der Hulst <[email protected]> hat am 17. März 2016 um
> 08:34
> geschrieben:
>
>

> Spaces don't exist as characters in PDFs. To identify spaces, you have> to

> compare the X coordinates of adjacent characters against their widths.

That's not correct, spaces exist but in most cases pdf engines omit themand

replace spaces by a splitted text with an appropriate positioning.

BTW, latex uses the same strategy. Here is a excerpt from your pdf:

   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
(Article)

-384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383(has) -384

(the) -383 (right) ] TJ

The text is in between the braces and the numbers are used for horizontal
positioning.

BR
Andreas

>

> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <[email protected]>> wrote:

>
> > Hello ,
> >

> > I have a PDF file created using Latex. I am trying to read and print> > all

> > letters in that file using PDFBox, but when doing this all spaces in
> > that
> > file are ignored. Here is the code I am using:
> > PDPage page = (PDPage)allPages.get( 0 );
> > PDStream contents = page.getContents();
> > if ( contents != null ) {
> >     PDFTextStripperProcessor pdfTextStripperProcessor = new
> > PDFTextStripperProcessor();

> > pdfTextStripperProcessor.processStream( page,> > page.findResources(),

> > contents.getStream() );
> > }
> >
> > public class PDFTextStripperProcessor extends PDFTextStripper {
> >     @Override
> >     public void processTextPosition( TextPosition text )  {
> >         System.out.println( text.getCharacter() );
> >     }
> > }
> >
> > And you can check a one page file sample here to test it:
> >
> > 
https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> >
> > What is the cause of this issue please?
> >
> >
> > Best regards ,
> > Hesham

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Spaces are ignored when reading a PDF file

Reply via email to