RE: getText() performance in PDFBox 1.5 release

Zhang, Lisheng Fri, 04 Nov 2011 12:34:48 -0700

Hi Mike,

Thanks very much, I tested and result is the same, from source code
it seems that suppressDuplicateOverlappingText parameter does not
have effect if I call PDFTextStripper.getText(..) directly. I will 
check more to see if I can use method processEncodedText(..).


Which version of PDFBox did you use (Tika has not used PDFBox 1.5 yet)?

Best regards, Lisheng

-----Original Message-----
From: Michael McCandless [mailto:[email protected]]
Sent: Friday, November 04, 2011 10:39 AM
To: [email protected]
Subject: Re: getText() performance in PDFBox 1.5 release


Is it possible you're hitting this issue?

    https://issues.apache.org/jira/browse/PDFBOX-956

Try setting suppressDuplicateOverlappingText to false and see if it
changes the extraction time?

Mike McCandless

http://blog.mikemccandless.com

On Fri, Nov 4, 2011 at 12:23 PM, Zhang, Lisheng
<[email protected]> wrote:
> Hi,
>
> I have been usiing PDFBox to extract text from PDF files for full text search 
> for a few years,
> and found it is a great product. Recently I downloaded PDFBox 1.5 and found 
> that it can
> extract text from many PDF files which cannot be processed previously, 
> thanks!!
>
> The problem I have is that it took long time for PDFTextStripper.getText(..) 
> to finish, for example
> our client has a 27MB PDF file which contains some graphics, it took 
> getText(..) 50m to finish
> even though it only extract 100K text eventually.
>
> I tried to change input parameters and results are same essentially, I would 
> like to know if this
> speed is expected and the possibility to improve?
>
> Thanks very much for helps, Lisheng
>

RE: getText() performance in PDFBox 1.5 release

Reply via email to