Hi Mike, Thanks very much, I tested and result is the same, from source code it seems that suppressDuplicateOverlappingText parameter does not have effect if I call PDFTextStripper.getText(..) directly. I will check more to see if I can use method processEncodedText(..).
Which version of PDFBox did you use (Tika has not used PDFBox 1.5 yet)? Best regards, Lisheng -----Original Message----- From: Michael McCandless [mailto:[email protected]] Sent: Friday, November 04, 2011 10:39 AM To: [email protected] Subject: Re: getText() performance in PDFBox 1.5 release Is it possible you're hitting this issue? https://issues.apache.org/jira/browse/PDFBOX-956 Try setting suppressDuplicateOverlappingText to false and see if it changes the extraction time? Mike McCandless http://blog.mikemccandless.com On Fri, Nov 4, 2011 at 12:23 PM, Zhang, Lisheng <[email protected]> wrote: > Hi, > > I have been usiing PDFBox to extract text from PDF files for full text search > for a few years, > and found it is a great product. Recently I downloaded PDFBox 1.5 and found > that it can > extract text from many PDF files which cannot be processed previously, > thanks!! > > The problem I have is that it took long time for PDFTextStripper.getText(..) > to finish, for example > our client has a 27MB PDF file which contains some graphics, it took > getText(..) 50m to finish > even though it only extract 100K text eventually. > > I tried to change input parameters and results are same essentially, I would > like to know if this > speed is expected and the possibility to improve? > > Thanks very much for helps, Lisheng >

