This subject of this thread is "Spaces are ignored when reading a PDF file. Please post new questions to a new thread.
— John > On 18 Mar 2016, at 04:02, 风云天空 <[email protected]> wrote: > > who can help me > i get this error in multithreading > java.lang.NullPointerException > at > java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086) > at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742) > at > sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95) > at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775) > at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013) > at > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119) > at > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.<init>(PDICCBased.java:89) > at > org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182) > at > org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172) > at > org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142) > at > org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) > at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187) > at > org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80) > at > com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109) > at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178) > at > com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > java.util.ConcurrentModificationException > at java.util.Vector$Itr.checkForComodification(Vector.java:1156) > at java.util.Vector$Itr.next(Vector.java:1133) > > > > ------------------ 原始邮件 ------------------ > 发件人: "Hesham G.";<[email protected]>; > 发送时间: 2016年3月18日(星期五) 下午4:44 > 收件人: "users"<[email protected]>; > > 主题: Re: Spaces are ignored when reading a PDF file > > > > John, > > I think I have got the idea ... Thumps up > > > Best regards , > Hesham > > ------------------------------------------------------------------------ > Included message : > > I’m rather confused by this thread, inferring spaces is one of the the main > features of PDFTextStripper. I’m not sure why anyone is suggesting to process > the text manually - there’s no need to do that. We do that already! > > Looking at the original code the problem is right here: > >> public class PDFTextStripperProcessor extends PDFTextStripper { >> @Override >> public void processTextPosition( TextPosition text ) { >> System.out.println( text.getCharacter() ); >> } >> } > > The processTextPosition method is used to pass an unprocessed TextPosition > *in* to PDFTextStripper, but this override prevents that from happening, and > is just printing the unprocessed token before PDFTextStripper has had a > chance to do its job, such as inferring the missing spaces. > > You should follow our PrintTextLocations.java example which shows you how to > get the processed TextPositions from PDFTextStripper. It’s really easy to do. > > — John > >> On 17 Mar 2016, at 04:44, Hesham G. <[email protected]> wrote: >> >> Andreas, >> >> You're absolutely right. I am testing it now, but it seems very >> complicated. I hope there might be another easier solution. >> >> >> Best regards , >> Hesham >> >> ------------------------------------------------------------------------ >> Included message : >> >>> "Hesham G." <[email protected]> hat am 17. März 2016 um 11:20 >>> geschrieben: >>> >>> >>> Andreas, >>> >>> That is very helpful. >>> >>> I can get the x location of each character using TextPosition.getX(), ex: >>> W: 102.88399 >>> i: 114.18165 >>> t: 117.660614 >>> h: 121.55801 >>> d: 133.09477 >>> u: 140.3994 >>> e: 147.60838 >>> >>> So to detect the space between the 2 words "With" & "due" should I make >>> subtraction calculations between X of the last letter(h) and the X of the >>> first letter (d) and if the number is large than normal then this is a >>> space? I think this way might be risky in the detection, or what? >> That's the short story. To decide what is normal could be quite tricky. You >> have >> to take the following facts into account: >> >> - different fonts have different widths (important if the font before the >> space >> isn't the same than the font after the space) >> - keep in mind that you have to take a scaling and sometimes a rotation into >> account >> - the "space" between characters may vary if the text is jusitified >> >> There are certainly some other details which may be important as well, so >> that >> you end up with some more or less heuristic. >> >> BR >> Andreas >> >>> Best regards , >>> Hesham >>> >>> ------------------------------------------------------------------------ >>> Included message : >>> >>> Hi, >>> >>>> Frank van der Hulst <[email protected]> hat am 17. März 2016 um >>>> 08:34 >>>> geschrieben: >>>> >>>> >>>> Spaces don't exist as characters in PDFs. To identify spaces, you have > >>>> to >>>> compare the X coordinates of adjacent characters against their widths. >>> That's not correct, spaces exist but in most cases pdf engines omit them >>> and >>> replace spaces by a splitted text with an appropriate positioning. >>> >>> BTW, latex uses the same strategy. Here is a excerpt from your pdf: >>> >>> [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 >>> (Article) >>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) >>> -384 >>> (the) -383 (right) ] TJ >>> >>> The text is in between the braces and the numbers are used for horizontal >>> positioning. >>> >>> BR >>> Andreas >>> >>>> >>>> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <[email protected]> > >>>> wrote: >>>> >>>>> Hello , >>>>> >>>>> I have a PDF file created using Latex. I am trying to read and print > > >>>>> all >>>>> letters in that file using PDFBox, but when doing this all spaces in >>>>> that >>>>> file are ignored. Here is the code I am using: >>>>> PDPage page = (PDPage)allPages.get( 0 ); >>>>> PDStream contents = page.getContents(); >>>>> if ( contents != null ) { >>>>> PDFTextStripperProcessor pdfTextStripperProcessor = new >>>>> PDFTextStripperProcessor(); >>>>> pdfTextStripperProcessor.processStream( page, > > >>>>> page.findResources(), >>>>> contents.getStream() ); >>>>> } >>>>> >>>>> public class PDFTextStripperProcessor extends PDFTextStripper { >>>>> @Override >>>>> public void processTextPosition( TextPosition text ) { >>>>> System.out.println( text.getCharacter() ); >>>>> } >>>>> } >>>>> >>>>> And you can check a one page file sample here to test it: >>>>> >>>>> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf >>>>> >>>>> What is the cause of this issue please? >>>>> >>>>> >>>>> Best regards , >>>>> Hesham >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

