Re: Problem reading a PDF file
Dear Glen, PDFStreamParser is only for parsing PDF content streams (so specific parts of a PDF) and not the complete PDF. As a starting point take a look at CustomGraphicsStreamEngine and or CustomPageDrawer in the examples package. Also PDFTextStripper will give you some ideas how to process a PDF. BR Maruan > I'm trying to examine an existing PDF file. Initially to extract text and > maybe images, but ultimately to apply some logic to the formatting of the > text to turn it into valid HTML with H1, H2, ul, li, etc. I thought I > would start like this: > > PDFStreamParser sParse = new PDFStreamParser(fileItem.get()); > Object token = sParse.parseNextToken(); > while (token != null) { > logger.info("token: " + token); > token = sParse.parseNextToken(); > } > > That yields: > > file size: 5289793 > token: COSInt{6066} > token: COSInt{0} > token: PDFOperator{obj} > token: > COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{First}:COSInt{1193};COSName{Length}:COSInt{12594};COSName{N}:COSInt{98};COSName{Type}:COSName{ObjStm};} > token: PDFOperator{stream} > token: PDFOperator{hÞìÛ} > token: COSNull{} > token: PDFOperator{ ·½'à¯R—» '"Y¬} > token: COSInt{7} > token: PDFOperator{àà} > Error trying to process request > java.io.IOException: Error: Expected operator 'ID' actual='I6' at stream > offset 125 > at > org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:311) > > I'm using PDFBox 2.0.19. > > I'm probably doing this wrong at many levels. When I went to look at the > samples on the web site, the classes in the 1.8 samples don't exist any > more. The link to the sources for 2.0 samples actually has 3.0 samples, > whose classes don't exist yet. So I just kind of bumbled along looking at > the source code and guessing. > > If I had to guess what I'm seeing, everything looks good up > until PDFOperator{stream}, after which, it looks like all garbage until it > blows up. What do I do now? > > Is there an example somewhere of how I should be doing this that you could > just point me to? My sample file opens well in the Ubuntu 18.04 PDF viewer. > -- Maruan Sahyoun FileAffairs GmbH Josef-Schappe-Straße 21 40882 Ratingen Tel: +49 (2102) 89497 88 Fax: +49 (2102) 89497 91 sahy...@fileaffairs.de www.fileaffairs.de Geschäftsführer: Maruan Sahyoun Handelsregister: AG Düsseldorf, HRB 53837 UST.-ID: DE248275827 - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Problem reading a PDF file
I'm trying to examine an existing PDF file. Initially to extract text and maybe images, but ultimately to apply some logic to the formatting of the text to turn it into valid HTML with H1, H2, ul, li, etc. I thought I would start like this: PDFStreamParser sParse = new PDFStreamParser(fileItem.get()); Object token = sParse.parseNextToken(); while (token != null) { logger.info("token: " + token); token = sParse.parseNextToken(); } That yields: file size: 5289793 token: COSInt{6066} token: COSInt{0} token: PDFOperator{obj} token: COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{First}:COSInt{1193};COSName{Length}:COSInt{12594};COSName{N}:COSInt{98};COSName{Type}:COSName{ObjStm};} token: PDFOperator{stream} token: PDFOperator{hÞìÛ} token: COSNull{} token: PDFOperator{ ·½'à¯R—» '"Y¬} token: COSInt{7} token: PDFOperator{àà} Error trying to process request java.io.IOException: Error: Expected operator 'ID' actual='I6' at stream offset 125 at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:311) I'm using PDFBox 2.0.19. I'm probably doing this wrong at many levels. When I went to look at the samples on the web site, the classes in the 1.8 samples don't exist any more. The link to the sources for 2.0 samples actually has 3.0 samples, whose classes don't exist yet. So I just kind of bumbled along looking at the source code and guessing. If I had to guess what I'm seeing, everything looks good up until PDFOperator{stream}, after which, it looks like all garbage until it blows up. What do I do now? Is there an example somewhere of how I should be doing this that you could just point me to? My sample file opens well in the Ubuntu 18.04 PDF viewer. -- Glen K. Peterson (828) 393-0081
Re: Reading a PDF
> On 20 Sep 2016, at 06:13, Clark, Raymond C wrote: > > We have a need to read a PDF and create a Post Script file from it. This is > working pretty good but I have a question. > > I am currently extending PDFStreamEngine for image information, extending > PDFGraphicsStreamEngine for information on lines, and extending > PDFTextStripper for information on text. Is this the recommended approach > for pulling out information on images, lines and text? Yes, definitely. — John > Thank you, > Ray > > > > CONFIDENTIALITY NOTICE: This e-mail and any files transmitted with it are > intended solely for the use of the individual or entity to whom they are > addressed and may contain confidential and privileged information protected > by law. If you received this e-mail in error, any review, use, dissemination, > distribution, or copying of the e-mail is strictly prohibited. Please notify > the sender immediately by return e-mail and delete all copies from your > system. > > > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
RE: Reading a PDF
Thank you, that is what I am doing, I didn't know if there was some other PDF Reader class that I should be calling instead. Ray -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, September 20, 2016 12:12 PM To: users@pdfbox.apache.org Subject: Re: Reading a PDF Am 20.09.2016 um 15:13 schrieb Clark, Raymond C: > We have a need to read a PDF and create a Post Script file from it. This is > working pretty good but I have a question. > > I am currently extending PDFStreamEngine for image information, extending > PDFGraphicsStreamEngine for information on lines, and extending > PDFTextStripper for information on text. Is this the recommended approach > for pulling out information on images, lines and text? Yes... in theory, it should be possible to "join" it all together, but this be much more work. Tilman - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org CONFIDENTIALITY NOTICE: This e-mail and any files transmitted with it are intended solely for the use of the individual or entity to whom they are addressed and may contain confidential and privileged information protected by law. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of the e-mail is strictly prohibited. Please notify the sender immediately by return e-mail and delete all copies from your system. - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Reading a PDF
Am 20.09.2016 um 15:13 schrieb Clark, Raymond C: We have a need to read a PDF and create a Post Script file from it. This is working pretty good but I have a question. I am currently extending PDFStreamEngine for image information, extending PDFGraphicsStreamEngine for information on lines, and extending PDFTextStripper for information on text. Is this the recommended approach for pulling out information on images, lines and text? Yes... in theory, it should be possible to "join" it all together, but this be much more work. Tilman - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Reading a PDF
We have a need to read a PDF and create a Post Script file from it. This is working pretty good but I have a question. I am currently extending PDFStreamEngine for image information, extending PDFGraphicsStreamEngine for information on lines, and extending PDFTextStripper for information on text. Is this the recommended approach for pulling out information on images, lines and text? Thank you, Ray CONFIDENTIALITY NOTICE: This e-mail and any files transmitted with it are intended solely for the use of the individual or entity to whom they are addressed and may contain confidential and privileged information protected by law. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of the e-mail is strictly prohibited. Please notify the sender immediately by return e-mail and delete all copies from your system. - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
This subject of this thread is "Spaces are ignored when reading a PDF file. Please post new questions to a new thread. — John > On 18 Mar 2016, at 04:02, 风云天空 <1010800...@qq.com> wrote: > > who can help me > i get this error in multithreading > java.lang.NullPointerException > at > java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086) > at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742) > at > sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95) > at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775) > at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013) > at > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119) > at > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.(PDICCBased.java:89) > at > org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182) > at > org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172) > at > org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142) > at > org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) > at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187) > at > org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80) > at > com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109) > at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178) > at > com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > java.util.ConcurrentModificationException > at java.util.Vector$Itr.checkForComodification(Vector.java:1156) > at java.util.Vector$Itr.next(Vector.java:1133) > > > > -- 原始邮件 -- > 发件人: "Hesham G.";; > 发送时间: 2016年3月18日(星期五) 下午4:44 > 收件人: "users"; > > 主题: Re: Spaces are ignored when reading a PDF file > > > > John, > > I think I have got the idea ... Thumps up > > > Best regards , > Hesham > > > Included message : > > I’m rather confused by this thread, inferring spaces is one of the the main > features of PDFTextStripper. I’m not sure why anyone is suggesting to process > the text manually - there’s no need to do that. We do that already! > > Looking at the original code the problem is right here: > >> public class PDFTextStripperProcessor extends PDFTextStripper { >> @Override >> public void processTextPosition( TextPosition text ) { >> System.out.println( text.getCharacter() ); >> } >> } > > The processTextPosition method is used to pass an unprocessed TextPosition > *in* to PDFTextStripper, but this override prevents that from happening, and > is just printing the unprocessed token before PDFTextStripper has had a > chance to do its job, such as inferring the missing spaces. > > You should follow our PrintTextLocations.java example which shows you how to > get the processed TextPositions from PDFTextStripper. It’s really easy to do. > > — John > >> On 17 Mar 2016, at 04:44, Hesham G. wrote: >> >> Andreas, >> >> You're absolutely right. I am testing it now, but it seems very >> complicated. I hope there might be another easier solution. >> >> >> Best regards , >> Hesham >> >> >> Included message : >> >>> "Hesham G." hat am 17. März 2016 um 11:20 >>> geschrieben: >>> >>> >>> Andreas, &g
回复: Spaces are ignored when reading a PDF file
who can help me i get this error in multithreading java.lang.NullPointerException at java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086) at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742) at sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95) at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775) at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013) at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119) at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.(PDICCBased.java:89) at org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182) at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172) at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142) at org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80) at com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109) at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178) at com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.util.ConcurrentModificationException at java.util.Vector$Itr.checkForComodification(Vector.java:1156) at java.util.Vector$Itr.next(Vector.java:1133) -- 原始邮件 -- 发件人: "Hesham G.";; 发送时间: 2016年3月18日(星期五) 下午4:44 收件人: "users"; 主题: Re: Spaces are ignored when reading a PDF file John, I think I have got the idea ... Thumps up Best regards , Hesham Included message : I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s no need to do that. We do that already! Looking at the original code the problem is right here: > public class PDFTextStripperProcessor extends PDFTextStripper { >@Override >public void processTextPosition( TextPosition text ) { >System.out.println( text.getCharacter() ); >} > } The processTextPosition method is used to pass an unprocessed TextPosition *in* to PDFTextStripper, but this override prevents that from happening, and is just printing the unprocessed token before PDFTextStripper has had a chance to do its job, such as inferring the missing spaces. You should follow our PrintTextLocations.java example which shows you how to get the processed TextPositions from PDFTextStripper. It’s really easy to do. — John > On 17 Mar 2016, at 04:44, Hesham G. wrote: > > Andreas, > > You're absolutely right. I am testing it now, but it seems very > complicated. I hope there might be another easier solution. > > > Best regards , > Hesham > > > Included message : > >> "Hesham G." hat am 17. März 2016 um 11:20 >> geschrieben: >> >> >> Andreas, >> >> That is very helpful. >> >> I can get the x location of each character using TextPosition.getX(), ex: >> W: 102.88399 >> i: 114.18165 >> t: 117.660614 >> h: 121.55801 >> d: 133.09477 >> u: 140.3994 >> e: 147.60838 >> >> So to detect the space between the 2 words "With" & "due" should I make >> subtraction calculations between X of the last letter(h) and the X of the >> first letter (d) and if the number is large than normal then this is a >> space? I think this way might be risky in the d
Re: Spaces are ignored when reading a PDF file
just an idea from whom is not fluent in pdfbox nor PDF. if you just want to know there is a space in between the letters and not the amount of spaces, you can use your code to get character details and then use extractText to get the words. 2016-03-17 7:20 GMT-03:00 Hesham G. : > Andreas, > > That is very helpful. > > I can get the x location of each character using TextPosition.getX(), ex: > W: 102.88399 > i: 114.18165 > t: 117.660614 > h: 121.55801 > d: 133.09477 > u: 140.3994 > e: 147.60838 > > So to detect the space between the 2 words "With" & "due" should I make > subtraction calculations between X of the last letter(h) and the X of the > first letter (d) and if the number is large than normal then this is a > space? I think this way might be risky in the detection, or what? > > > Best regards , > Hesham > > > Included message : > > Hi, > > Frank van der Hulst hat am 17. März 2016 um >> 08:34 >> geschrieben: >> >> >> Spaces don't exist as characters in PDFs. To identify spaces, you have to >> compare the X coordinates of adjacent characters against their widths. >> > That's not correct, spaces exist but in most cases pdf engines omit them > and > replace spaces by a splitted text with an appropriate positioning. > > BTW, latex uses the same strategy. Here is a excerpt from your pdf: > > [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 > (Article) > -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) > -384 > (the) -383 (right) ] TJ > > The text is in between the braces and the numbers are used for horizontal > positioning. > > BR > Andreas > > >> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. >> wrote: >> >> > Hello , >> > >> > I have a PDF file created using Latex. I am trying to read and print all >> > letters in that file using PDFBox, but when doing this all spaces in > >> that >> > file are ignored. Here is the code I am using: >> > PDPage page = (PDPage)allPages.get( 0 ); >> > PDStream contents = page.getContents(); >> > if ( contents != null ) { >> > PDFTextStripperProcessor pdfTextStripperProcessor = new >> > PDFTextStripperProcessor(); >> > pdfTextStripperProcessor.processStream( page, page.findResources(), >> > contents.getStream() ); >> > } >> > >> > public class PDFTextStripperProcessor extends PDFTextStripper { >> > @Override >> > public void processTextPosition( TextPosition text ) { >> > System.out.println( text.getCharacter() ); >> > } >> > } >> > >> > And you can check a one page file sample here to test it: >> > >> > >> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf >> > >> > What is the cause of this issue please? >> > >> > >> > Best regards , >> > Hesham >> > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >
Re: Spaces are ignored when reading a PDF file
Am 17.03.2016 um 11:20 schrieb Hesham G.: So to detect the space between the 2 words "With" & "due" should I make subtraction calculations between X of the last letter(h) and the X of the first letter (d) and if the number is large than normal then this is a space? I think this way might be risky in the detection, or what? What you're doing is to reinvent the PDFTextStripper code, which has some strategies to decide where there are spaces. That's not a bad idea (there are some weaknesses), however it is indeed... "tricky". https://www.youtube.com/watch?v=cjEdxO91RWQ&feature=youtu.be&t=3m33s - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
Hi, > Frank van der Hulst hat am 17. März 2016 um 08:34 > geschrieben: > > > Spaces don't exist as characters in PDFs. To identify spaces, you have to > compare the X coordinates of adjacent characters against their widths. That's not correct, spaces exist but in most cases pdf engines omit them and replace spaces by a splitted text with an appropriate positioning. BTW, latex uses the same strategy. Here is a excerpt from your pdf: [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 (Article) -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384 (the) -383 (right) ] TJ The text is in between the braces and the numbers are used for horizontal positioning. BR Andreas > > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. wrote: > > > Hello , > > > > I have a PDF file created using Latex. I am trying to read and print all > > letters in that file using PDFBox, but when doing this all spaces in that > > file are ignored. Here is the code I am using: > > PDPage page = (PDPage)allPages.get( 0 ); > > PDStream contents = page.getContents(); > > if ( contents != null ) { > > PDFTextStripperProcessor pdfTextStripperProcessor = new > > PDFTextStripperProcessor(); > > pdfTextStripperProcessor.processStream( page, page.findResources(), > > contents.getStream() ); > > } > > > > public class PDFTextStripperProcessor extends PDFTextStripper { > > @Override > > public void processTextPosition( TextPosition text ) { > > System.out.println( text.getCharacter() ); > > } > > } > > > > And you can check a one page file sample here to test it: > > > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf > > > > What is the cause of this issue please? > > > > > > Best regards , > > Hesham - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
Andreas, That is very helpful. I can get the x location of each character using TextPosition.getX(), ex: W: 102.88399 i: 114.18165 t: 117.660614 h: 121.55801 d: 133.09477 u: 140.3994 e: 147.60838 So to detect the space between the 2 words "With" & "due" should I make subtraction calculations between X of the last letter(h) and the X of the first letter (d) and if the number is large than normal then this is a space? I think this way might be risky in the detection, or what? Best regards , Hesham Included message : Hi, Frank van der Hulst hat am 17. März 2016 um 08:34 geschrieben: Spaces don't exist as characters in PDFs. To identify spaces, you have to compare the X coordinates of adjacent characters against their widths. That's not correct, spaces exist but in most cases pdf engines omit them and replace spaces by a splitted text with an appropriate positioning. BTW, latex uses the same strategy. Here is a excerpt from your pdf: [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 (Article) -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384 (the) -383 (right) ] TJ The text is in between the braces and the numbers are used for horizontal positioning. BR Andreas On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. wrote: > Hello , > > I have a PDF file created using Latex. I am trying to read and print all > letters in that file using PDFBox, but when doing this all spaces in > that > file are ignored. Here is the code I am using: > PDPage page = (PDPage)allPages.get( 0 ); > PDStream contents = page.getContents(); > if ( contents != null ) { > PDFTextStripperProcessor pdfTextStripperProcessor = new > PDFTextStripperProcessor(); > pdfTextStripperProcessor.processStream( page, page.findResources(), > contents.getStream() ); > } > > public class PDFTextStripperProcessor extends PDFTextStripper { > @Override > public void processTextPosition( TextPosition text ) { > System.out.println( text.getCharacter() ); > } > } > > And you can check a one page file sample here to test it: > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf > > What is the cause of this issue please? > > > Best regards , > Hesham - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
Andreas, You're absolutely right. I am testing it now, but it seems very complicated. I hope there might be another easier solution. Best regards , Hesham Included message : "Hesham G." hat am 17. März 2016 um 11:20 geschrieben: Andreas, That is very helpful. I can get the x location of each character using TextPosition.getX(), ex: W: 102.88399 i: 114.18165 t: 117.660614 h: 121.55801 d: 133.09477 u: 140.3994 e: 147.60838 So to detect the space between the 2 words "With" & "due" should I make subtraction calculations between X of the last letter(h) and the X of the first letter (d) and if the number is large than normal then this is a space? I think this way might be risky in the detection, or what? That's the short story. To decide what is normal could be quite tricky. You have to take the following facts into account: - different fonts have different widths (important if the font before the space isn't the same than the font after the space) - keep in mind that you have to take a scaling and sometimes a rotation into account - the "space" between characters may vary if the text is jusitified There are certainly some other details which may be important as well, so that you end up with some more or less heuristic. BR Andreas Best regards , Hesham Included message : Hi, > Frank van der Hulst hat am 17. März 2016 um > 08:34 > geschrieben: > > > Spaces don't exist as characters in PDFs. To identify spaces, you have > to > compare the X coordinates of adjacent characters against their widths. That's not correct, spaces exist but in most cases pdf engines omit them and replace spaces by a splitted text with an appropriate positioning. BTW, latex uses the same strategy. Here is a excerpt from your pdf: [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 (Article) -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384 (the) -383 (right) ] TJ The text is in between the braces and the numbers are used for horizontal positioning. BR Andreas > > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. > wrote: > > > Hello , > > > > I have a PDF file created using Latex. I am trying to read and print > > all > > letters in that file using PDFBox, but when doing this all spaces in > > that > > file are ignored. Here is the code I am using: > > PDPage page = (PDPage)allPages.get( 0 ); > > PDStream contents = page.getContents(); > > if ( contents != null ) { > > PDFTextStripperProcessor pdfTextStripperProcessor = new > > PDFTextStripperProcessor(); > > pdfTextStripperProcessor.processStream( page, > > page.findResources(), > > contents.getStream() ); > > } > > > > public class PDFTextStripperProcessor extends PDFTextStripper { > > @Override > > public void processTextPosition( TextPosition text ) { > > System.out.println( text.getCharacter() ); > > } > > } > > > > And you can check a one page file sample here to test it: > > > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf > > > > What is the cause of this issue please? > > > > > > Best regards , > > Hesham - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
John, I think I have got the idea ... Thumps up Best regards , Hesham Included message : I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s no need to do that. We do that already! Looking at the original code the problem is right here: > public class PDFTextStripperProcessor extends PDFTextStripper { >@Override >public void processTextPosition( TextPosition text ) { >System.out.println( text.getCharacter() ); >} > } The processTextPosition method is used to pass an unprocessed TextPosition *in* to PDFTextStripper, but this override prevents that from happening, and is just printing the unprocessed token before PDFTextStripper has had a chance to do its job, such as inferring the missing spaces. You should follow our PrintTextLocations.java example which shows you how to get the processed TextPositions from PDFTextStripper. It’s really easy to do. — John > On 17 Mar 2016, at 04:44, Hesham G. wrote: > > Andreas, > > You're absolutely right. I am testing it now, but it seems very complicated. > I hope there might be another easier solution. > > > Best regards , > Hesham > > > Included message : > >> "Hesham G." hat am 17. März 2016 um 11:20 >> geschrieben: >> >> >> Andreas, >> >> That is very helpful. >> >> I can get the x location of each character using TextPosition.getX(), ex: >> W: 102.88399 >> i: 114.18165 >> t: 117.660614 >> h: 121.55801 >> d: 133.09477 >> u: 140.3994 >> e: 147.60838 >> >> So to detect the space between the 2 words "With" & "due" should I make >> subtraction calculations between X of the last letter(h) and the X of the >> first letter (d) and if the number is large than normal then this is a >> space? I think this way might be risky in the detection, or what? > That's the short story. To decide what is normal could be quite tricky. You > have > to take the following facts into account: > > - different fonts have different widths (important if the font before the > space > isn't the same than the font after the space) > - keep in mind that you have to take a scaling and sometimes a rotation into > account > - the "space" between characters may vary if the text is jusitified > > There are certainly some other details which may be important as well, so that > you end up with some more or less heuristic. > > BR > Andreas > >> Best regards , >> Hesham >> >> >> Included message : >> >> Hi, >> >> > Frank van der Hulst hat am 17. März 2016 um >> > 08:34 >> > geschrieben: >> > >> > >> > Spaces don't exist as characters in PDFs. To identify spaces, you have > to >> > compare the X coordinates of adjacent characters against their widths. >> That's not correct, spaces exist but in most cases pdf engines omit them and >> replace spaces by a splitted text with an appropriate positioning. >> >> BTW, latex uses the same strategy. Here is a excerpt from your pdf: >> >> [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 >> (Article) >> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384 >> (the) -383 (right) ] TJ >> >> The text is in between the braces and the numbers are used for horizontal >> positioning. >> >> BR >> Andreas >> >> > >> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. > >> > wrote: >> > >> > > Hello , >> > > >> > > I have a PDF file created using Latex. I am trying to read and print > > >> > > all >> > > letters in that file using PDFBox, but when doing this all spaces in >> > > that >> > > file are ignored. Here is the code I am using: >> > > PDPage page = (PDPage)allPages.get( 0 ); >> > > PDStream contents = page.getContents(); >> > > if ( contents != null ) { >> > > PDFTextStripperProcessor pdfTextStripperProcessor = new >> > > PDFTextStripperProcessor(); >> > > pdfTextStripperProcessor.processStream( page, > > >> > > page.findResources(), >> > > contents.getStream() ); >> > > } >> > > >> > > public class PDFTextStripperProcessor extends PDFTextStripper { >> > > @Override >> > > public void processTextPosition( TextPosition text ) { >> > > System.out.println( text.getCharacter() ); >> > > } >> > > } >> > > >> > > And you can check a one page file sample here to test it: >> > > >> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf >> > > >> > > What is the cause of this issue please? >> > > >> > > >> > > Best regards , >> > > Hesham >> >> - >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> >> ---
Re: Spaces are ignored when reading a PDF file
Spaces don't exist as characters in PDFs. To identify spaces, you have to compare the X coordinates of adjacent characters against their widths. On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. wrote: > Hello , > > I have a PDF file created using Latex. I am trying to read and print all > letters in that file using PDFBox, but when doing this all spaces in that > file are ignored. Here is the code I am using: > PDPage page = (PDPage)allPages.get( 0 ); > PDStream contents = page.getContents(); > if ( contents != null ) { > PDFTextStripperProcessor pdfTextStripperProcessor = new > PDFTextStripperProcessor(); > pdfTextStripperProcessor.processStream( page, page.findResources(), > contents.getStream() ); > } > > public class PDFTextStripperProcessor extends PDFTextStripper { > @Override > public void processTextPosition( TextPosition text ) { > System.out.println( text.getCharacter() ); > } > } > > And you can check a one page file sample here to test it: > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf > > What is the cause of this issue please? > > > Best regards , > Hesham
Spaces are ignored when reading a PDF file
Hello , I have a PDF file created using Latex. I am trying to read and print all letters in that file using PDFBox, but when doing this all spaces in that file are ignored. Here is the code I am using: PDPage page = (PDPage)allPages.get( 0 ); PDStream contents = page.getContents(); if ( contents != null ) { PDFTextStripperProcessor pdfTextStripperProcessor = new PDFTextStripperProcessor(); pdfTextStripperProcessor.processStream( page, page.findResources(), contents.getStream() ); } public class PDFTextStripperProcessor extends PDFTextStripper { @Override public void processTextPosition( TextPosition text ) { System.out.println( text.getCharacter() ); } } And you can check a one page file sample here to test it: https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf What is the cause of this issue please? Best regards , Hesham
Re: Spaces are ignored when reading a PDF file
Clovis, Thanks a lot :) I will have to follow this solution if there is no alternative. The problem is that if I am extracting text of 500 or 600 pages PDF, that will consume much additional memory and time. In addition I guess it's only a special case for latex books only. Best regards , Hesham Included message : just an idea from whom is not fluent in pdfbox nor PDF. if you just want to know there is a space in between the letters and not the amount of spaces, you can use your code to get character details and then use extractText to get the words. 2016-03-17 7:20 GMT-03:00 Hesham G. : Andreas, That is very helpful. I can get the x location of each character using TextPosition.getX(), ex: W: 102.88399 i: 114.18165 t: 117.660614 h: 121.55801 d: 133.09477 u: 140.3994 e: 147.60838 So to detect the space between the 2 words "With" & "due" should I make subtraction calculations between X of the last letter(h) and the X of the first letter (d) and if the number is large than normal then this is a space? I think this way might be risky in the detection, or what? Best regards , Hesham Included message : Hi, Frank van der Hulst hat am 17. März 2016 um 08:34 geschrieben: Spaces don't exist as characters in PDFs. To identify spaces, you have to compare the X coordinates of adjacent characters against their widths. That's not correct, spaces exist but in most cases pdf engines omit them and replace spaces by a splitted text with an appropriate positioning. BTW, latex uses the same strategy. Here is a excerpt from your pdf: [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 (Article) -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384 (the) -383 (right) ] TJ The text is in between the braces and the numbers are used for horizontal positioning. BR Andreas On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. wrote: > Hello , > > I have a PDF file created using Latex. I am trying to read and print > all > letters in that file using PDFBox, but when doing this all spaces in > that > file are ignored. Here is the code I am using: > PDPage page = (PDPage)allPages.get( 0 ); > PDStream contents = page.getContents(); > if ( contents != null ) { > PDFTextStripperProcessor pdfTextStripperProcessor = new > PDFTextStripperProcessor(); > pdfTextStripperProcessor.processStream( page, page.findResources(), > contents.getStream() ); > } > > public class PDFTextStripperProcessor extends PDFTextStripper { > @Override > public void processTextPosition( TextPosition text ) { > System.out.println( text.getCharacter() ); > } > } > > And you can check a one page file sample here to test it: > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf > > What is the cause of this issue please? > > > Best regards , > Hesham - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
Am 17.03.2016 um 07:12 schrieb Hesham G.: Hello , I have a PDF file created using Latex. I am trying to read and print all letters in that file using PDFBox, but when doing this all spaces in that file are ignored. Here's what I get with ExtractText (your code is unusual), this looks excellent to me: article titles c©by Michael O’Kane are not part of the law mu7ami.com Article [220] Right to Regulate With due regard to Article (219), the competent authority has the right of monitoring the companies with regard to application of the provisions set forth in the law and the company’s articles of association and bylaw including the authority to inspect the company and check its account and ask for data from the board of directors or the company managers through a representative or more of its personnel or experts it chooses for this pur- pose. Article [221] Access to Records All the company officials shall acquaint the Ministry representatives and the Authority, fi the company is listed in the financial market or seeking to be listed, with regard to the works stated in Article (220), all that they ask of company books and records and documents and provide them with all related information or clarification. 94 version 0.2 provided by mu7ami.com - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
Tilman, I am using this code to extract the text from the pdf because I need font information about the extracted characters like determining the font name used. Using the normal extraction code will not work in my case. Best regards , Hesham Included message : Am 17.03.2016 um 07:12 schrieb Hesham G.: Hello , I have a PDF file created using Latex. I am trying to read and print all letters in that file using PDFBox, but when doing this all spaces in that file are ignored. Here's what I get with ExtractText (your code is unusual), this looks excellent to me: article titles c©by Michael O’Kane are not part of the law mu7ami.com Article [220] Right to Regulate With due regard to Article (219), the competent authority has the right of monitoring the companies with regard to application of the provisions set forth in the law and the company’s articles of association and bylaw including the authority to inspect the company and check its account and ask for data from the board of directors or the company managers through a representative or more of its personnel or experts it chooses for this pur- pose. Article [221] Access to Records All the company officials shall acquaint the Ministry representatives and the Authority, fi the company is listed in the financial market or seeking to be listed, with regard to the works stated in Article (220), all that they ask of company books and records and documents and provide them with all related information or clarification. 94 version 0.2 provided by mu7ami.com - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
John , I have checked the PrintTextLocations.java example. I have tested using this code for the "With due" term in my book sample, using this code: System.out.println( "String[" + text.getCharacter() + ": " + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" ); And here are the results: String[W: 102.88399,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=11.9552] String[i: 114.18165,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=3.4789658] String[t: 117.660614,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=3.8973923] String[h: 121.55801,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=6.957924] String[d: 133.09477,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=7.3046265] String[u: 140.3994,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=7.2089844] String[e: 147.60838,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=5.7265472] So which method do you mean? .. The getXDirAdj() ? Best regards , Hesham Included message : I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s no need to do that. We do that already! Looking at the original code the problem is right here: > public class PDFTextStripperProcessor extends PDFTextStripper { >@Override >public void processTextPosition( TextPosition text ) { >System.out.println( text.getCharacter() ); >} > } The processTextPosition method is used to pass an unprocessed TextPosition *in* to PDFTextStripper, but this override prevents that from happening, and is just printing the unprocessed token before PDFTextStripper has had a chance to do its job, such as inferring the missing spaces. You should follow our PrintTextLocations.java example which shows you how to get the processed TextPositions from PDFTextStripper. It’s really easy to do. — John > On 17 Mar 2016, at 04:44, Hesham G. wrote: > > Andreas, > > You're absolutely right. I am testing it now, but it seems very complicated. > I hope there might be another easier solution. > > > Best regards , > Hesham > > > Included message : > >> "Hesham G." hat am 17. März 2016 um 11:20 >> geschrieben: >> >> >> Andreas, >> >> That is very helpful. >> >> I can get the x location of each character using TextPosition.getX(), ex: >> W: 102.88399 >> i: 114.18165 >> t: 117.660614 >> h: 121.55801 >> d: 133.09477 >> u: 140.3994 >> e: 147.60838 >> >> So to detect the space between the 2 words "With" & "due" should I make >> subtraction calculations between X of the last letter(h) and the X of the >> first letter (d) and if the number is large than normal then this is a >> space? I think this way might be risky in the detection, or what? > That's the short story. To decide what is normal could be quite tricky. You > have > to take the following facts into account: > > - different fonts have different widths (important if the font before the > space > isn't the same than the font after the space) > - keep in mind that you have to take a scaling and sometimes a rotation into > account > - the "space" between characters may vary if the text is jusitified > > There are certainly some other details which may be important as well, so that > you end up with some more or less heuristic. > > BR > Andreas > >> Best regards , >> Hesham >> >> >> Included message : >> >> Hi, >> >> > Frank van der Hulst hat am 17. März 2016 um >> > 08:34 >> > geschrieben: >> > >> > >> > Spaces don't exist as characters in PDFs. To identify spaces, you have > to >> > compare the X coordinates of adjacent characters against their widths. >> That's not correct, spaces exist but in most cases pdf engines omit them and >> replace spaces by a splitted text with an appropriate positioning. >> >> BTW, latex uses the same strategy. Here is a excerpt from your pdf: >> >> [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 >> (Article) >> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384 >> (the) -383 (right) ] TJ >> >> The text is in between the braces and the numbers are used for horizontal >> positioning. >> >> BR >> Andreas >> >> > >> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. > >> > wrote: >> > >> > > Hello , >> > > >> > > I have a PDF file created using Latex. I am trying to read and print > > >> > >
Re: Spaces are ignored when reading a PDF file
> "Hesham G." hat am 17. März 2016 um 11:20 > geschrieben: > > > Andreas, > > That is very helpful. > > I can get the x location of each character using TextPosition.getX(), ex: > W: 102.88399 > i: 114.18165 > t: 117.660614 > h: 121.55801 > d: 133.09477 > u: 140.3994 > e: 147.60838 > > So to detect the space between the 2 words "With" & "due" should I make > subtraction calculations between X of the last letter(h) and the X of the > first letter (d) and if the number is large than normal then this is a > space? I think this way might be risky in the detection, or what? That's the short story. To decide what is normal could be quite tricky. You have to take the following facts into account: - different fonts have different widths (important if the font before the space isn't the same than the font after the space) - keep in mind that you have to take a scaling and sometimes a rotation into account - the "space" between characters may vary if the text is jusitified There are certainly some other details which may be important as well, so that you end up with some more or less heuristic. BR Andreas > Best regards , > Hesham > > > Included message : > > Hi, > > > Frank van der Hulst hat am 17. März 2016 um > > 08:34 > > geschrieben: > > > > > > Spaces don't exist as characters in PDFs. To identify spaces, you have to > > compare the X coordinates of adjacent characters against their widths. > That's not correct, spaces exist but in most cases pdf engines omit them and > replace spaces by a splitted text with an appropriate positioning. > > BTW, latex uses the same strategy. Here is a excerpt from your pdf: > >[ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 > (Article) > -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384 > (the) -383 (right) ] TJ > > The text is in between the braces and the numbers are used for horizontal > positioning. > > BR > Andreas > > > > > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. wrote: > > > > > Hello , > > > > > > I have a PDF file created using Latex. I am trying to read and print all > > > letters in that file using PDFBox, but when doing this all spaces in > > > that > > > file are ignored. Here is the code I am using: > > > PDPage page = (PDPage)allPages.get( 0 ); > > > PDStream contents = page.getContents(); > > > if ( contents != null ) { > > > PDFTextStripperProcessor pdfTextStripperProcessor = new > > > PDFTextStripperProcessor(); > > > pdfTextStripperProcessor.processStream( page, page.findResources(), > > > contents.getStream() ); > > > } > > > > > > public class PDFTextStripperProcessor extends PDFTextStripper { > > > @Override > > > public void processTextPosition( TextPosition text ) { > > > System.out.println( text.getCharacter() ); > > > } > > > } > > > > > > And you can check a one page file sample here to test it: > > > > > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf > > > > > > What is the cause of this issue please? > > > > > > > > > Best regards , > > > Hesham > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Spaces are ignored when reading a PDF file
I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s no need to do that. We do that already! Looking at the original code the problem is right here: > public class PDFTextStripperProcessor extends PDFTextStripper { >@Override >public void processTextPosition( TextPosition text ) { >System.out.println( text.getCharacter() ); >} > } The processTextPosition method is used to pass an unprocessed TextPosition *in* to PDFTextStripper, but this override prevents that from happening, and is just printing the unprocessed token before PDFTextStripper has had a chance to do its job, such as inferring the missing spaces. You should follow our PrintTextLocations.java example which shows you how to get the processed TextPositions from PDFTextStripper. It’s really easy to do. — John > On 17 Mar 2016, at 04:44, Hesham G. wrote: > > Andreas, > > You're absolutely right. I am testing it now, but it seems very complicated. > I hope there might be another easier solution. > > > Best regards , > Hesham > > > Included message : > >> "Hesham G." hat am 17. März 2016 um 11:20 >> geschrieben: >> >> >> Andreas, >> >> That is very helpful. >> >> I can get the x location of each character using TextPosition.getX(), ex: >> W: 102.88399 >> i: 114.18165 >> t: 117.660614 >> h: 121.55801 >> d: 133.09477 >> u: 140.3994 >> e: 147.60838 >> >> So to detect the space between the 2 words "With" & "due" should I make >> subtraction calculations between X of the last letter(h) and the X of the >> first letter (d) and if the number is large than normal then this is a >> space? I think this way might be risky in the detection, or what? > That's the short story. To decide what is normal could be quite tricky. You > have > to take the following facts into account: > > - different fonts have different widths (important if the font before the > space > isn't the same than the font after the space) > - keep in mind that you have to take a scaling and sometimes a rotation into > account > - the "space" between characters may vary if the text is jusitified > > There are certainly some other details which may be important as well, so that > you end up with some more or less heuristic. > > BR > Andreas > >> Best regards , >> Hesham >> >> >> Included message : >> >> Hi, >> >> > Frank van der Hulst hat am 17. März 2016 um >> > 08:34 >> > geschrieben: >> > >> > >> > Spaces don't exist as characters in PDFs. To identify spaces, you have > to >> > compare the X coordinates of adjacent characters against their widths. >> That's not correct, spaces exist but in most cases pdf engines omit them and >> replace spaces by a splitted text with an appropriate positioning. >> >> BTW, latex uses the same strategy. Here is a excerpt from your pdf: >> >> [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 >> (Article) >> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384 >> (the) -383 (right) ] TJ >> >> The text is in between the braces and the numbers are used for horizontal >> positioning. >> >> BR >> Andreas >> >> > >> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. > >> > wrote: >> > >> > > Hello , >> > > >> > > I have a PDF file created using Latex. I am trying to read and print > > >> > > all >> > > letters in that file using PDFBox, but when doing this all spaces in >> > > that >> > > file are ignored. Here is the code I am using: >> > > PDPage page = (PDPage)allPages.get( 0 ); >> > > PDStream contents = page.getContents(); >> > > if ( contents != null ) { >> > > PDFTextStripperProcessor pdfTextStripperProcessor = new >> > > PDFTextStripperProcessor(); >> > > pdfTextStripperProcessor.processStream( page, > > >> > > page.findResources(), >> > > contents.getStream() ); >> > > } >> > > >> > > public class PDFTextStripperProcessor extends PDFTextStripper { >> > > @Override >> > > public void processTextPosition( TextPosition text ) { >> > > System.out.println( text.getCharacter() ); >> > > } >> > > } >> > > >> > > And you can check a one page file sample here to test it: >> > > >> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf >> > > >> > > What is the cause of this issue please? >> > > >> > > >> > > Best regards , >> > > Hesham >> >> - >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> >> - >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfb