Re: Fwd: Very slow PDF parsing.

Slava G Thu, 28 Feb 2019 06:04:59 -0800

I'll once I'll get customer's approval.
Meanwhile I can do any checks, if you can specify what to check.
Thanks


On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <[email protected]> wrote:

> Any chance you can share the file directly w me or someone else on the
> PDFBox team?
>
> On Wed, Feb 27, 2019 at 11:24 AM Slava G <[email protected]> wrote:
>
> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> > Thanks
> >
> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <[email protected]> wrote:
> >
> >> With 2.0.14 it's 40 minutes running, no result, still working...
> >> Seems that issue is still there.
> >> Thanks
> >>
> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <[email protected]> wrote:
> >>
> >>> Checking with 2.0.14. Started as an app. Will update soon.
> >>>
> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <[email protected]>
> wrote:
> >>>
> >>>> Any chance you could try with the 2.0.14 release candidate...unless
> you
> >>>> have already?
> >>>>
> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
> >>>>
> >>>>
> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote:
> >>>>
> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
> >>>>> hours and still counting...
> >>>>> It's seems to be a PDFBox issue.
> >>>>>
> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]>
> wrote:
> >>>>>
> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
> >>>>>> It can be easier to investigate the problem.
> >>>>>>
> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]
> >
> >>>>>> a écrit :
> >>>>>>
> >>>>>>> Just looking at the stack trace it won't be the same anymore due to
> >>>>>>> PDFBOX-4453
> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
> >>>>>>> changes how decryption is handled. Not sure if related though.
> >>>>>>>
> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
> >>>>>>> command-line ExtractText command (
> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote:
> >>>>>>>
> >>>>>>>> This is the code :
> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
> >>>>>>>> PDFParser tmpPdf = new PDFParser();
> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
> >>>>>>>> config.setExtractAcroFormContent(false);
> >>>>>>>> config.setExtractBookmarksText(false);
> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
> >>>>>>>> Metadata metadata = new Metadata();
> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
> >>>>>>>> ParseContext());
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This is the default in Tika, where the default for
> >>>>>>>>> maxMainMemoryBytes=500MB.
> >>>>>>>>>
> >>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
> >>>>>>>>> via tika-app or tika-server or something else?
> >>>>>>>>>
> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> >>>>>>>>> memoryUsageSetting =
> >>>>>>>>>
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> >>>>>>>>> }
> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
> >>>>>>>>> // File based -- send file directly to PDFBox
> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
> >>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>> } else {
> >>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
> >>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
> >>>>>>>>> [email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
> >>>>>>>>>> the
> >>>>>>>>>> profiler.
> >>>>>>>>>>
> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
> >>>>>>>>>>
> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
> >>>>>>>>>> password.
> >>>>>>>>>>
> >>>>>>>>>> It would also be interesting to hear what parameter is passed to
> >>>>>>>>>> MemoryUsageSetting when load() is called.
> >>>>>>>>>>
> >>>>>>>>>> Tilman
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> >>>>>>>>>> > PDFBox Colleagues,
> >>>>>>>>>> >    Any ideas?
> >>>>>>>>>> >
> >>>>>>>>>> > ---------- Forwarded message ---------
> >>>>>>>>>> > From: Tim Allison <[email protected]>
> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
> >>>>>>>>>> > To: <[email protected]>
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
> >>>>>>>>>> processing
> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
> >>>>>>>>>> 'tesseract' on
> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
> >>>>>>>>>> suspect this
> >>>>>>>>>> > isn't your problem, though.
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>> >
> >>>>>>>>>> >> Thanks Tim,
> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
> >>>>>>>>>> tessercat is in
> >>>>>>>>>> >> this context 🙂
> >>>>>>>>>> >>
> >>>>>>>>>> >> Thanks
> >>>>>>>>>> >>
> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]
> >
> >>>>>>>>>> wrote:
> >>>>>>>>>> >>
> >>>>>>>>>> >>> Thank you, Slava!
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> Do you have tesseract installed?
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]
> >
> >>>>>>>>>> wrote:
> >>>>>>>>>> >>>> Hi,
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
> >>>>>>>>>> some images.
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
> (TIKA
> >>>>>>>>>> 1.19.1
> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
> SSD
> >>>>>>>>>> disk, running
> >>>>>>>>>> >>> CentOS Linux).
> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
> maybe
> >>>>>>>>>> it's a bug
> >>>>>>>>>> >>> in PDFBox ?
> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
> >>>>>>>>>> stack :
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>>>>
> >>>>>>>>>>
>

Re: Fwd: Very slow PDF parsing.

Reply via email to