Re: Fwd: Very slow PDF parsing.

Tim Allison Wed, 27 Feb 2019 04:47:42 -0800

Any chance you could try with the 2.0.14 release candidate...unless you
have already?


https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/


On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote:

> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 hours
> and still counting...
> It's seems to be a PDFBox issue.
>
> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> wrote:
>
>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>> or *curl* bash client to parse your 65Mo PDF.
>> It can be easier to investigate the problem.
>>
>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>
>>
>>
>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a
>> écrit :
>>
>>> Just looking at the stack trace it won't be the same anymore due to
>>> PDFBOX-4453
>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>> how decryption is handled. Not sure if related though.
>>>
>>> Can you duplicate the problem without Tika using just PDFBox
>>> command-line ExtractText command (
>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>
>>>
>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote:
>>>
>>>> This is the code :
>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>> PDFParser tmpPdf = new PDFParser();
>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>> config.setMaxMainMemoryBytes(31457280);
>>>> config.setExtractAcroFormContent(false);
>>>> config.setExtractBookmarksText(false);
>>>> config.setCatchIntermediateIOExceptions(true);
>>>> Metadata metadata = new Metadata();
>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>> ParseContext());
>>>>
>>>>
>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> This is the default in Tika, where the default for
>>>>> maxMainMemoryBytes=500MB.
>>>>>
>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>>> tika-app or tika-server or something else?
>>>>>
>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>> memoryUsageSetting =
>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>> }
>>>>> if (tstream != null && tstream.hasFile()) {
>>>>> // File based -- send file directly to PDFBox
>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>> memoryUsageSetting);
>>>>> } else {
>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>> password, memoryUsageSetting);
>>>>> }
>>>>>
>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>>>> profiler.
>>>>>>
>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>
>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>> password.
>>>>>>
>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>> MemoryUsageSetting when load() is called.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>> > PDFBox Colleagues,
>>>>>> >    Any ideas?
>>>>>> >
>>>>>> > ---------- Forwarded message ---------
>>>>>> > From: Tim Allison <[email protected]>
>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>> > To: <[email protected]>
>>>>>> >
>>>>>> >
>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>> 'tesseract' on
>>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>>> this
>>>>>> > isn't your problem, though.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote:
>>>>>> >
>>>>>> >> Thanks Tim,
>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>> tessercat is in
>>>>>> >> this context 🙂
>>>>>> >>
>>>>>> >> Thanks
>>>>>> >>
>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>> Thank you, Slava!
>>>>>> >>>
>>>>>> >>> Do you have tesseract installed?
>>>>>> >>>
>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>> >>>
>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]>
>>>>>> wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>>>>> images.
>>>>>> >>>>
>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>> 1.19.1
>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>> disk, running
>>>>>> >>> CentOS Linux).
>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>> it's a bug
>>>>>> >>> in PDFBox ?
>>>>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>>>>> >>>>
>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>> >>>> at
>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>> >>>>
>>>>>> >>>> at
>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>> >>>>
>>>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>> >>>>
>>>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>> >>>>
>>>>>> >>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>>

Re: Fwd: Very slow PDF parsing.

Reply via email to