Hi, By using PDFBox 1.6 my problem is solved, the time is reduced to 33s (with 1.5 is 50m). The parameter suppressDuplicateOverlappingText did not make much difference, I guess that's because my PDF does not have big overlap (the resulting TXT is slightly different, but not very much).
Thanks very much for helps!!! Lisheng -----Original Message----- From: Zhang, Lisheng [mailto:[email protected]] Sent: Friday, November 04, 2011 4:02 PM To: [email protected] Subject: RE: getText() performance in PDFBox 1.5 release Thanks very much for pointing that out!!! I downloaded Tika 0.10 a few days ago and CHANGES.txt attached did not mention PDFBox 1.6, based on that CHANGES.txt I thought Tika used 1.4. I will download PDFBox 1.6 and retest. Best regards, Lisheng -----Original Message----- From: Andreas Lehmkuehler [mailto:[email protected]] Sent: Friday, November 04, 2011 3:40 PM To: [email protected] Subject: Re: getText() performance in PDFBox 1.5 release Hi, Am 04.11.2011 20:34, schrieb Zhang, Lisheng: > Hi Mike, > > Thanks very much, I tested and result is the same, from source code > it seems that suppressDuplicateOverlappingText parameter does not > have effect if I call PDFTextStripper.getText(..) directly. I will > check more to see if I can use method processEncodedText(..). > > Which version of PDFBox did you use (Tika has not used PDFBox 1.5 yet)? According to [1] Tika 0.10 uses PDFBox 1.6. which includes some improvements related to performance. > Best regards, Lisheng > <SNIP> BR Andreas Lehmkühler [1] http://www.apache.org/dist/tika/CHANGES-0.10.txt

