回复： Spaces are ignored when reading a PDF file

风云天空 Sat, 19 Mar 2016 20:38:50 -0700

who can help me 
i get this error in multithreading
java.lang.NullPointerException
        at 
java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086)
        at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742)
        at 
sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95)
        at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775)
        at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013)
        at 
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119)
        at 
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.<init>(PDICCBased.java:89)
        at 
org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182)
        at 
org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172)
        at 
org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142)
        at 
org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
        at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80)
        at 
com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109)
        at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178)
        at 
com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
java.util.ConcurrentModificationException
        at java.util.Vector$Itr.checkForComodification(Vector.java:1156)
        at java.util.Vector$Itr.next(Vector.java:1133)




------------------ 原始邮件 ------------------
发件人: "Hesham G.";<[email protected]>;
发送时间: 2016年3月18日(星期五) 下午4:44
收件人: "users"<[email protected]>; 

主题: Re: Spaces are ignored when reading a PDF file



   John,
  
 I think I have got the idea ... Thumps up 
  
  
 Best regards ,
 Hesham 
  
 ------------------------------------------------------------------------
 Included message :
  
 I’m rather confused by this thread, inferring spaces is one of the the main  
features of PDFTextStripper. I’m not sure why anyone is suggesting to process  
the text manually - there’s no need to do that. We do that already!
  
 Looking at the original code the problem is right here:
  
 > public class PDFTextStripperProcessor extends PDFTextStripper {
 >    @Override
 >    public void processTextPosition( TextPosition text  )  {
 >        System.out.println(  text.getCharacter() );
 >    }
 > }
  
 The processTextPosition method is used to pass an unprocessed TextPosition  
*in* to PDFTextStripper, but this override prevents that from happening, and is 
 just printing the unprocessed token before PDFTextStripper has had a chance to 
 do its job, such as inferring the missing spaces.
  
 You should follow our PrintTextLocations.java example which shows you how  to 
get the processed TextPositions from PDFTextStripper. It’s really easy to  do.
  
 — John
  
 > On 17 Mar 2016, at 04:44, Hesham G. <[email protected]>  wrote:
 > 
 > Andreas,
 > 
 > You're absolutely right. I am testing it now, but it seems very  
 > complicated. I hope there might be another easier solution.
 > 
 > 
 > Best regards ,
 > Hesham
 > 
 >  ------------------------------------------------------------------------
 > Included message :
 > 
 >> "Hesham G." <[email protected]> hat am 17. März 2016 um  11:20
 >> geschrieben:
 >> 
 >> 
 >> Andreas,
 >> 
 >> That is very helpful.
 >> 
 >> I can get the x location of each character using  TextPosition.getX(), ex:
 >> W: 102.88399
 >> i: 114.18165
 >> t: 117.660614
 >> h: 121.55801
 >> d: 133.09477
 >> u: 140.3994
 >> e: 147.60838
 >> 
 >> So to detect the space between the 2 words "With" & "due"  should I make
 >> subtraction calculations between X of the last letter(h) and the X  of the
 >> first letter (d) and if the number is large than normal then this  is a
 >> space? I think this way might be risky in the detection, or  what?
 > That's the short story. To decide what is normal could be quite  tricky. You 
 > have
 > to take the following facts into account:
 > 
 > - different fonts have different widths (important if the font before  the 
 > space
 > isn't the same than the font after the space)
 > - keep in mind that you have to take a scaling and sometimes a  rotation into
 > account
 > - the "space" between characters may vary if the text is  jusitified
 > 
 > There are certainly some other details which may be important as well,  so 
 > that
 > you end up with some more or less heuristic.
 > 
 > BR
 > Andreas
 > 
 >> Best regards ,
 >> Hesham
 >> 
 >>  ------------------------------------------------------------------------
 >> Included message :
 >> 
 >> Hi,
 >> 
 >> > Frank van der Hulst <[email protected]> hat am  17. März 2016 um
 >> > 08:34
 >> > geschrieben:
 >> >
 >> >
 >> > Spaces don't exist as characters in PDFs. To identify spaces,  you have > 
 >> > to
 >> > compare the X coordinates of adjacent characters against  their widths.
 >> That's not correct, spaces exist but in most cases pdf engines  omit them 
 >> and
 >> replace spaces by a splitted text with an appropriate  positioning.
 >> 
 >> BTW, latex uses the same strategy. Here is a excerpt from your  pdf:
 >> 
 >>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d)  -383 (to) -383
 >> (Article)
 >> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383  (has) 
 >> -384
 >> (the) -383 (right) ] TJ
 >> 
 >> The text is in between the braces and the numbers are used for  horizontal
 >> positioning.
 >> 
 >> BR
 >> Andreas
 >> 
 >> >
 >> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  <[email protected]> > 
 >> > wrote:
 >> >
 >> > > Hello ,
 >> > >
 >> > > I have a PDF file created using Latex. I am trying to  read and print > 
 >> > > > all
 >> > > letters in that file using PDFBox, but when doing this  all spaces in
 >> > > that
 >> > > file are ignored. Here is the code I am using:
 >> > > PDPage page = (PDPage)allPages.get( 0 );
 >> > > PDStream contents = page.getContents();
 >> > > if ( contents != null ) {
 >> > >     PDFTextStripperProcessor  pdfTextStripperProcessor = new
 >> > > PDFTextStripperProcessor();
 >> > >      pdfTextStripperProcessor.processStream( page, > >  
 >> > > page.findResources(),
 >> > > contents.getStream() );
 >> > > }
 >> > >
 >> > > public class PDFTextStripperProcessor extends  PDFTextStripper {
 >> > >     @Override
 >> > >     public void processTextPosition(  TextPosition text )  {
 >> > >          System.out.println( text.getCharacter() );
 >> > >     }
 >> > > }
 >> > >
 >> > > And you can check a one page file sample here to test  it:
 >> > >
 >> > >  
 >> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
 >> > >
 >> > > What is the cause of this issue please?
 >> > >
 >> > >
 >> > > Best regards ,
 >> > > Hesham
 >> 
 >>  ---------------------------------------------------------------------
 >> To unsubscribe, e-mail: [email protected]
 >> For additional commands, e-mail:  [email protected]
 >> 
 >> 
 >>  ---------------------------------------------------------------------
 >> To unsubscribe, e-mail: [email protected]
 >> For additional commands, e-mail:  [email protected]
 >> 
 > 
 >  ---------------------------------------------------------------------
 > To unsubscribe, e-mail: [email protected]
 > For additional commands, e-mail: [email protected]
 > 
 > 
 >  ---------------------------------------------------------------------
 > To unsubscribe, e-mail: [email protected]
 > For additional commands, e-mail: [email protected]
 > 
  
  
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: [email protected]
 For additional commands, e-mail: [email protected]

回复： Spaces are ignored when reading a PDF file

Reply via email to