just an idea from whom is not fluent in pdfbox nor PDF. if you just want to know there is a space in between the letters and not the amount of spaces, you can use your code to get character details and then use extractText to get the words.
2016-03-17 7:20 GMT-03:00 Hesham G. <[email protected]>: > Andreas, > > That is very helpful. > > I can get the x location of each character using TextPosition.getX(), ex: > W: 102.88399 > i: 114.18165 > t: 117.660614 > h: 121.55801 > d: 133.09477 > u: 140.3994 > e: 147.60838 > > So to detect the space between the 2 words "With" & "due" should I make > subtraction calculations between X of the last letter(h) and the X of the > first letter (d) and if the number is large than normal then this is a > space? I think this way might be risky in the detection, or what? > > > Best regards , > Hesham > > ------------------------------------------------------------------------ > Included message : > > Hi, > > Frank van der Hulst <[email protected]> hat am 17. März 2016 um >> 08:34 >> geschrieben: >> >> >> Spaces don't exist as characters in PDFs. To identify spaces, you have to >> compare the X coordinates of adjacent characters against their widths. >> > That's not correct, spaces exist but in most cases pdf engines omit them > and > replace spaces by a splitted text with an appropriate positioning. > > BTW, latex uses the same strategy. Here is a excerpt from your pdf: > > [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 > (Article) > -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) > -384 > (the) -383 (right) ] TJ > > The text is in between the braces and the numbers are used for horizontal > positioning. > > BR > Andreas > > >> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <[email protected]> >> wrote: >> >> > Hello , >> > >> > I have a PDF file created using Latex. I am trying to read and print all >> > letters in that file using PDFBox, but when doing this all spaces in > >> that >> > file are ignored. Here is the code I am using: >> > PDPage page = (PDPage)allPages.get( 0 ); >> > PDStream contents = page.getContents(); >> > if ( contents != null ) { >> > PDFTextStripperProcessor pdfTextStripperProcessor = new >> > PDFTextStripperProcessor(); >> > pdfTextStripperProcessor.processStream( page, page.findResources(), >> > contents.getStream() ); >> > } >> > >> > public class PDFTextStripperProcessor extends PDFTextStripper { >> > @Override >> > public void processTextPosition( TextPosition text ) { >> > System.out.println( text.getCharacter() ); >> > } >> > } >> > >> > And you can check a one page file sample here to test it: >> > >> > >> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf >> > >> > What is the cause of this issue please? >> > >> > >> > Best regards , >> > Hesham >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

