With 2.0.14 it's 40 minutes running, no result, still working... Seems that issue is still there. Thanks
On Wed, Feb 27, 2019 at 2:52 PM Slava G <[email protected]> wrote: > Checking with 2.0.14. Started as an app. Will update soon. > > On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <[email protected]> wrote: > >> Any chance you could try with the 2.0.14 release candidate...unless you >> have already? >> >> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ >> >> >> On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote: >> >>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 >>> hours and still counting... >>> It's seems to be a PDFBox issue. >>> >>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> wrote: >>> >>>> Why don't you do a basic test with tika server in a 3thrd and a *wget* >>>> or *curl* bash client to parse your 65Mo PDF. >>>> It can be easier to investigate the problem. >>>> >>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> >>>> >>>> >>>> >>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a >>>> écrit : >>>> >>>>> Just looking at the stack trace it won't be the same anymore due to >>>>> PDFBOX-4453 >>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes >>>>> how decryption is handled. Not sure if related though. >>>>> >>>>> Can you duplicate the problem without Tika using just PDFBox >>>>> command-line ExtractText command ( >>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? >>>>> >>>>> >>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote: >>>>> >>>>>> This is the code : >>>>>> InputStream in = TikaInputStream.get(inputFile.toPath()); >>>>>> PDFParser tmpPdf = new PDFParser(); >>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>>>>> config.setMaxMainMemoryBytes(31457280); >>>>>> config.setExtractAcroFormContent(false); >>>>>> config.setExtractBookmarksText(false); >>>>>> config.setCatchIntermediateIOExceptions(true); >>>>>> Metadata metadata = new Metadata(); >>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >>>>>> ParseContext()); >>>>>> >>>>>> >>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> This is the default in Tika, where the default for >>>>>>> maxMainMemoryBytes=500MB. >>>>>>> >>>>>>> Slava, how are you calling this in Tika? With a TikaInputStream via >>>>>>> tika-app or tika-server or something else? >>>>>>> >>>>>>> MemoryUsageSetting memoryUsageSetting = >>>>>>> MemoryUsageSetting.setupMainMemoryOnly(); >>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>>>>>> memoryUsageSetting = >>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>>>>>> } >>>>>>> if (tstream != null && tstream.hasFile()) { >>>>>>> // File based -- send file directly to PDFBox >>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password, >>>>>>> memoryUsageSetting); >>>>>>> } else { >>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), >>>>>>> password, memoryUsageSetting); >>>>>>> } >>>>>>> >>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> As usual, it would be nice to have the PDF, so that we could run >>>>>>>> the >>>>>>>> profiler. >>>>>>>> >>>>>>>> The HashSet is used to avoid decrypting objects twice. >>>>>>>> >>>>>>>> The "not encrypted" file is likely encrypted with an empty user >>>>>>>> password. >>>>>>>> >>>>>>>> It would also be interesting to hear what parameter is passed to >>>>>>>> MemoryUsageSetting when load() is called. >>>>>>>> >>>>>>>> Tilman >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>>>>>>> > PDFBox Colleagues, >>>>>>>> > Any ideas? >>>>>>>> > >>>>>>>> > ---------- Forwarded message --------- >>>>>>>> > From: Tim Allison <[email protected]> >>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>>>>>>> > Subject: Re: Very slow PDF parsing. >>>>>>>> > To: <[email protected]> >>>>>>>> > >>>>>>>> > >>>>>>>> > Sorry...that's an OCR tool. One thing that can slow down >>>>>>>> processing >>>>>>>> > dramatically is if you have tesseract installed (try typing >>>>>>>> 'tesseract' on >>>>>>>> > your commandline) and if you've turned it on for PDFs. I suspect >>>>>>>> this >>>>>>>> > isn't your problem, though. >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> >> Thanks Tim, >>>>>>>> >> But frankly speaking, it's a shame, but don't know what is >>>>>>>> tessercat is in >>>>>>>> >> this context 🙂 >>>>>>>> >> >>>>>>>> >> Thanks >>>>>>>> >> >>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> >>>>>>>> wrote: >>>>>>>> >> >>>>>>>> >>> Thank you, Slava! >>>>>>>> >>> >>>>>>>> >>> Do you have tesseract installed? >>>>>>>> >>> >>>>>>>> >>> Colleagues on PDFBox, any recommendations? >>>>>>>> >>> >>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>> Hi, >>>>>>>> >>>> >>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and >>>>>>>> some images. >>>>>>>> >>>> >>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA >>>>>>>> 1.19.1 >>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD >>>>>>>> disk, running >>>>>>>> >>> CentOS Linux). >>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe >>>>>>>> it's a bug >>>>>>>> >>> in PDFBox ? >>>>>>>> >>>> When I'm printing java stack , I see all the time in this >>>>>>>> stack : >>>>>>>> >>>> >>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source) >>>>>>>> >>>> >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>>>>>>> >>>> at >>>>>>>> >>> >>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>>>>>>> >>>> at >>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>>>>>>> >>>> >>>>>>>> >>>> at >>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>>>>>>> >>>> >>>>>>>> >>>> at >>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>>>>>>> >>>> >>>>>>>> >>>> at >>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>>>>>>> >>>> >>>>>>>> >>>> Thanks >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>> >>>>>>>>
