Re: Spaces are ignored when reading a PDF file

Hesham G. Sat, 19 Mar 2016 01:06:47 -0700

Clovis,

Thanks a lot :)

I will have to follow this solution if there is no alternative. The problemis that if I am extracting text of 500 or 600 pages PDF, that will consumemuch additional memory and time. In addition I guess it's only a specialcase for latex books only.


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :


just an idea from whom is not fluent in pdfbox nor PDF.
if you just want to know there is a space in between the letters and not
the amount of spaces, you can use your code to get character details and
then use extractText to get the words.

2016-03-17 7:20 GMT-03:00 Hesham G. <[email protected]>:

Andreas,

That is very helpful.

I can get the x location of each character using TextPosition.getX(), ex:
W: 102.88399
i: 114.18165
t: 117.660614
h: 121.55801
d: 133.09477
u: 140.3994
e: 147.60838

So to detect the space between the 2 words "With" & "due" should I make
subtraction calculations between X of the last letter(h) and the X of the
first letter (d) and if the number is large than normal then this is a
space? I think this way might be risky in the detection, or what?


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

Hi,

Frank van der Hulst <[email protected]> hat am 17. März 2016 um

08:34
geschrieben:


Spaces don't exist as characters in PDFs. To identify spaces, you have to
compare the X coordinates of adjacent characters against their widths.

That's not correct, spaces exist but in most cases pdf engines omit them
and
replace spaces by a splitted text with an appropriate positioning.

BTW, latex uses the same strategy. Here is a excerpt from your pdf:

  [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
(Article)
-384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has)
-384
(the) -383 (right) ] TJ

The text is in between the braces and the numbers are used for horizontal
positioning.

BR
Andreas

On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <[email protected]>
wrote:

> Hello ,
>

> I have a PDF file created using Latex. I am trying to read and print> all

> letters in that file using PDFBox, but when doing this all spaces in >
that
> file are ignored. Here is the code I am using:
> PDPage page = (PDPage)allPages.get( 0 );
> PDStream contents = page.getContents();
> if ( contents != null ) {
>     PDFTextStripperProcessor pdfTextStripperProcessor = new
> PDFTextStripperProcessor();
>     pdfTextStripperProcessor.processStream( page, page.findResources(),
> contents.getStream() );
> }
>
> public class PDFTextStripperProcessor extends PDFTextStripper {
>     @Override
>     public void processTextPosition( TextPosition text )  {
>         System.out.println( text.getCharacter() );
>     }
> }
>
> And you can check a one page file sample here to test it:
>
>
https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>
> What is the cause of this issue please?
>
>
> Best regards ,
> Hesham


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Spaces are ignored when reading a PDF file

Reply via email to