Re: Spaces are ignored when reading a PDF file

John Hewson Sun, 20 Mar 2016 01:28:20 -0700

This subject of this thread is "Spaces are ignored when reading a PDF file. 
Please post new questions to a new thread.


— John

> On 18 Mar 2016, at 04:02, 风云天空 <[email protected]> wrote:
> 
> who can help me 
> i get this error in multithreading
> java.lang.NullPointerException
>       at 
> java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086)
>       at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742)
>       at 
> sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95)
>       at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775)
>       at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013)
>       at 
> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119)
>       at 
> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.<init>(PDICCBased.java:89)
>       at 
> org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182)
>       at 
> org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172)
>       at 
> org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142)
>       at 
> org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>       at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80)
>       at 
> com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109)
>       at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178)
>       at 
> com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> java.util.ConcurrentModificationException
>       at java.util.Vector$Itr.checkForComodification(Vector.java:1156)
>       at java.util.Vector$Itr.next(Vector.java:1133)
> 
> 
> 
> ------------------ 原始邮件 ------------------
> 发件人: "Hesham G.";<[email protected]>;
> 发送时间: 2016年3月18日(星期五) 下午4:44
> 收件人: "users"<[email protected]>; 
> 
> 主题: Re: Spaces are ignored when reading a PDF file
> 
> 
> 
>   John,
> 
> I think I have got the idea ... Thumps up 
> 
> 
> Best regards ,
> Hesham 
> 
> ------------------------------------------------------------------------
> Included message :
> 
> I’m rather confused by this thread, inferring spaces is one of the the main  
> features of PDFTextStripper. I’m not sure why anyone is suggesting to process 
>  the text manually - there’s no need to do that. We do that already!
> 
> Looking at the original code the problem is right here:
> 
>> public class PDFTextStripperProcessor extends PDFTextStripper {
>>   @Override
>>   public void processTextPosition( TextPosition text  )  {
>>       System.out.println(  text.getCharacter() );
>>   }
>> }
> 
> The processTextPosition method is used to pass an unprocessed TextPosition  
> *in* to PDFTextStripper, but this override prevents that from happening, and 
> is  just printing the unprocessed token before PDFTextStripper has had a 
> chance to  do its job, such as inferring the missing spaces.
> 
> You should follow our PrintTextLocations.java example which shows you how  to 
> get the processed TextPositions from PDFTextStripper. It’s really easy to  do.
> 
> — John
> 
>> On 17 Mar 2016, at 04:44, Hesham G. <[email protected]>  wrote:
>> 
>> Andreas,
>> 
>> You're absolutely right. I am testing it now, but it seems very  
>> complicated. I hope there might be another easier solution.
>> 
>> 
>> Best regards ,
>> Hesham
>> 
>> ------------------------------------------------------------------------
>> Included message :
>> 
>>> "Hesham G." <[email protected]> hat am 17. März 2016 um  11:20
>>> geschrieben:
>>> 
>>> 
>>> Andreas,
>>> 
>>> That is very helpful.
>>> 
>>> I can get the x location of each character using  TextPosition.getX(), ex:
>>> W: 102.88399
>>> i: 114.18165
>>> t: 117.660614
>>> h: 121.55801
>>> d: 133.09477
>>> u: 140.3994
>>> e: 147.60838
>>> 
>>> So to detect the space between the 2 words "With" & "due"  should I make
>>> subtraction calculations between X of the last letter(h) and the X  of the
>>> first letter (d) and if the number is large than normal then this  is a
>>> space? I think this way might be risky in the detection, or  what?
>> That's the short story. To decide what is normal could be quite  tricky. You 
>> have
>> to take the following facts into account:
>> 
>> - different fonts have different widths (important if the font before  the 
>> space
>> isn't the same than the font after the space)
>> - keep in mind that you have to take a scaling and sometimes a  rotation into
>> account
>> - the "space" between characters may vary if the text is  jusitified
>> 
>> There are certainly some other details which may be important as well,  so 
>> that
>> you end up with some more or less heuristic.
>> 
>> BR
>> Andreas
>> 
>>> Best regards ,
>>> Hesham
>>> 
>>> ------------------------------------------------------------------------
>>> Included message :
>>> 
>>> Hi,
>>> 
>>>> Frank van der Hulst <[email protected]> hat am  17. März 2016 um
>>>> 08:34
>>>> geschrieben:
>>>> 
>>>> 
>>>> Spaces don't exist as characters in PDFs. To identify spaces,  you have > 
>>>> to
>>>> compare the X coordinates of adjacent characters against  their widths.
>>> That's not correct, spaces exist but in most cases pdf engines  omit them 
>>> and
>>> replace spaces by a splitted text with an appropriate  positioning.
>>> 
>>> BTW, latex uses the same strategy. Here is a excerpt from your  pdf:
>>> 
>>>  [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d)  -383 (to) -383
>>> (Article)
>>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383  (has) 
>>> -384
>>> (the) -383 (right) ] TJ
>>> 
>>> The text is in between the braces and the numbers are used for  horizontal
>>> positioning.
>>> 
>>> BR
>>> Andreas
>>> 
>>>> 
>>>> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  <[email protected]> > 
>>>> wrote:
>>>> 
>>>>> Hello ,
>>>>> 
>>>>> I have a PDF file created using Latex. I am trying to  read and print > > 
>>>>> all
>>>>> letters in that file using PDFBox, but when doing this  all spaces in
>>>>> that
>>>>> file are ignored. Here is the code I am using:
>>>>> PDPage page = (PDPage)allPages.get( 0 );
>>>>> PDStream contents = page.getContents();
>>>>> if ( contents != null ) {
>>>>>    PDFTextStripperProcessor  pdfTextStripperProcessor = new
>>>>> PDFTextStripperProcessor();
>>>>>     pdfTextStripperProcessor.processStream( page, > >  
>>>>> page.findResources(),
>>>>> contents.getStream() );
>>>>> }
>>>>> 
>>>>> public class PDFTextStripperProcessor extends  PDFTextStripper {
>>>>>    @Override
>>>>>    public void processTextPosition(  TextPosition text )  {
>>>>>         System.out.println( text.getCharacter() );
>>>>>    }
>>>>> }
>>>>> 
>>>>> And you can check a one page file sample here to test  it:
>>>>> 
>>>>> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>>>>> 
>>>>> What is the cause of this issue please?
>>>>> 
>>>>> 
>>>>> Best regards ,
>>>>> Hesham
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail:  [email protected]
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail:  [email protected]
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Spaces are ignored when reading a PDF file

Reply via email to