Re: Fwd: Very slow PDF parsing.

Slava G Wed, 27 Feb 2019 05:30:12 -0800

With 2.0.14 it's 40 minutes running, no result, still working...
Seems that issue is still there.
Thanks


On Wed, Feb 27, 2019 at 2:52 PM Slava G <[email protected]> wrote:

> Checking with 2.0.14. Started as an app. Will update soon.
>
> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <[email protected]> wrote:
>
>> Any chance you could try with the 2.0.14 release candidate...unless you
>> have already?
>>
>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>
>>
>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote:
>>
>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>> hours and still counting...
>>> It's seems to be a PDFBox issue.
>>>
>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> wrote:
>>>
>>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>>> or *curl* bash client to parse your 65Mo PDF.
>>>> It can be easier to investigate the problem.
>>>>
>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>
>>>>
>>>>
>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a
>>>> écrit :
>>>>
>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>> PDFBOX-4453
>>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>>> how decryption is handled. Not sure if related though.
>>>>>
>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>> command-line ExtractText command (
>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote:
>>>>>
>>>>>> This is the code :
>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>> config.setExtractAcroFormContent(false);
>>>>>> config.setExtractBookmarksText(false);
>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>> Metadata metadata = new Metadata();
>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>> ParseContext());
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> This is the default in Tika, where the default for
>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>
>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>>>>> tika-app or tika-server or something else?
>>>>>>>
>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>> memoryUsageSetting =
>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>> }
>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>> // File based -- send file directly to PDFBox
>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>>>> memoryUsageSetting);
>>>>>>> } else {
>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>> password, memoryUsageSetting);
>>>>>>> }
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>> the
>>>>>>>> profiler.
>>>>>>>>
>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>
>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>> password.
>>>>>>>>
>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>> > PDFBox Colleagues,
>>>>>>>> >    Any ideas?
>>>>>>>> >
>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>> > From: Tim Allison <[email protected]>
>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>> > To: <[email protected]>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>> processing
>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>>> 'tesseract' on
>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>>>>> this
>>>>>>>> > isn't your problem, though.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> >> Thanks Tim,
>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>>> tessercat is in
>>>>>>>> >> this context 🙂
>>>>>>>> >>
>>>>>>>> >> Thanks
>>>>>>>> >>
>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >>> Thank you, Slava!
>>>>>>>> >>>
>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>> >>>
>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>> >>>
>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >>>> Hi,
>>>>>>>> >>>>
>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>> some images.
>>>>>>>> >>>>
>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>> 1.19.1
>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>> disk, running
>>>>>>>> >>> CentOS Linux).
>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>> it's a bug
>>>>>>>> >>> in PDFBox ?
>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>>>>>>> stack :
>>>>>>>> >>>>
>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>>

Re: Fwd: Very slow PDF parsing.

Reply via email to