Re: Fwd: Very slow PDF parsing.

JB Data31 Tue, 26 Feb 2019 23:51:38 -0800

Why don't you do a basic test with tika server in a 3thrd and a *wget* or
*curl* bash client to parse your 65Mo PDF.
It can be easier to investigate the problem.


@*JB*Δ <http://jbigdata.fr/jbigdata/index.html>



Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a
écrit :

> Just looking at the stack trace it won't be the same anymore due to
> PDFBOX-4453
> Some changes present in not yet released pdfbox 2.0.14 and it changes how
> decryption is handled. Not sure if related though.
>
> Can you duplicate the problem without Tika using just PDFBox command-line
> ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) on
> that file?
>
>
> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote:
>
>> This is the code :
>> InputStream in = TikaInputStream.get(inputFile.toPath());
>> PDFParser tmpPdf = new PDFParser();
>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>> config.setMaxMainMemoryBytes(31457280);
>> config.setExtractAcroFormContent(false);
>> config.setExtractBookmarksText(false);
>> config.setCatchIntermediateIOExceptions(true);
>> Metadata metadata = new Metadata();
>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>> tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());
>>
>>
>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> wrote:
>>
>>>
>>> This is the default in Tika, where the default for
>>> maxMainMemoryBytes=500MB.
>>>
>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>> tika-app or tika-server or something else?
>>>
>>> MemoryUsageSetting memoryUsageSetting =
>>> MemoryUsageSetting.setupMainMemoryOnly();
>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>> memoryUsageSetting =
>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>> }
>>> if (tstream != null && tstream.hasFile()) {
>>> // File based -- send file directly to PDFBox
>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>> memoryUsageSetting);
>>> } else {
>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>> password, memoryUsageSetting);
>>> }
>>>
>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>> profiler.
>>>>
>>>> The HashSet is used to avoid decrypting objects twice.
>>>>
>>>> The "not encrypted" file is likely encrypted with an empty user
>>>> password.
>>>>
>>>> It would also be interesting to hear what parameter is passed to
>>>> MemoryUsageSetting when load() is called.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>> > PDFBox Colleagues,
>>>> >    Any ideas?
>>>> >
>>>> > ---------- Forwarded message ---------
>>>> > From: Tim Allison <[email protected]>
>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>> > Subject: Re: Very slow PDF parsing.
>>>> > To: <[email protected]>
>>>> >
>>>> >
>>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>>> > dramatically is if you have tesseract installed (try typing
>>>> 'tesseract' on
>>>> > your commandline) and if you've turned it on for PDFs.  I suspect this
>>>> > isn't your problem, though.
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote:
>>>> >
>>>> >> Thanks Tim,
>>>> >> But frankly speaking, it's a shame, but don't know what is tessercat
>>>> is in
>>>> >> this context 🙂
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> wrote:
>>>> >>
>>>> >>> Thank you, Slava!
>>>> >>>
>>>> >>> Do you have tesseract installed?
>>>> >>>
>>>> >>> Colleagues on PDFBox, any recommendations?
>>>> >>>
>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> wrote:
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>>> images.
>>>> >>>>
>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>>>> running
>>>> >>> CentOS Linux).
>>>> >>>> Please advise if there anything I can do to speedup.Or maybe it's
>>>> a bug
>>>> >>> in PDFBox ?
>>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>>> >>>>
>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>> >>>>
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>> >>>>
>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>> >>>>
>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>> >>>>
>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>> >>>>
>>>> >>>>
>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>> >>>>
>>>> >>>> Thanks
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>

Re: Fwd: Very slow PDF parsing.

Reply via email to