Re: Fwd: Very slow PDF parsing.

Slava G Wed, 27 Feb 2019 00:04:49 -0800

Well, I ran (as was suggested) PDFBox app to extract text , so far 2 hours
and still counting...
It's seems to be a PDFBox issue.


On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> wrote:

> Why don't you do a basic test with tika server in a 3thrd and a *wget* or
> *curl* bash client to parse your 65Mo PDF.
> It can be easier to investigate the problem.
>
> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>
>
>
> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a
> écrit :
>
>> Just looking at the stack trace it won't be the same anymore due to
>> PDFBOX-4453
>> Some changes present in not yet released pdfbox 2.0.14 and it changes how
>> decryption is handled. Not sure if related though.
>>
>> Can you duplicate the problem without Tika using just PDFBox command-line
>> ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html )
>> on that file?
>>
>>
>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote:
>>
>>> This is the code :
>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>> PDFParser tmpPdf = new PDFParser();
>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>> config.setMaxMainMemoryBytes(31457280);
>>> config.setExtractAcroFormContent(false);
>>> config.setExtractBookmarksText(false);
>>> config.setCatchIntermediateIOExceptions(true);
>>> Metadata metadata = new Metadata();
>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>> ParseContext());
>>>
>>>
>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> wrote:
>>>
>>>>
>>>> This is the default in Tika, where the default for
>>>> maxMainMemoryBytes=500MB.
>>>>
>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>> tika-app or tika-server or something else?
>>>>
>>>> MemoryUsageSetting memoryUsageSetting =
>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>> memoryUsageSetting =
>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>> }
>>>> if (tstream != null && tstream.hasFile()) {
>>>> // File based -- send file directly to PDFBox
>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>> memoryUsageSetting);
>>>> } else {
>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>> password, memoryUsageSetting);
>>>> }
>>>>
>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>>> profiler.
>>>>>
>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>
>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>> password.
>>>>>
>>>>> It would also be interesting to hear what parameter is passed to
>>>>> MemoryUsageSetting when load() is called.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>> > PDFBox Colleagues,
>>>>> >    Any ideas?
>>>>> >
>>>>> > ---------- Forwarded message ---------
>>>>> > From: Tim Allison <[email protected]>
>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>> > Subject: Re: Very slow PDF parsing.
>>>>> > To: <[email protected]>
>>>>> >
>>>>> >
>>>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>>>> > dramatically is if you have tesseract installed (try typing
>>>>> 'tesseract' on
>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>> this
>>>>> > isn't your problem, though.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote:
>>>>> >
>>>>> >> Thanks Tim,
>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>> tessercat is in
>>>>> >> this context 🙂
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >>> Thank you, Slava!
>>>>> >>>
>>>>> >>> Do you have tesseract installed?
>>>>> >>>
>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>> >>>
>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]>
>>>>> wrote:
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>>>> images.
>>>>> >>>>
>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>> 1.19.1
>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>> disk, running
>>>>> >>> CentOS Linux).
>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe it's
>>>>> a bug
>>>>> >>> in PDFBox ?
>>>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>>>> >>>>
>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>> >>>>
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>> >>>>
>>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>> >>>>
>>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>> >>>>
>>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>> >>>>
>>>>> >>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>

Re: Fwd: Very slow PDF parsing.

Reply via email to