Checking with 2.0.14. Started as an app. Will update soon. On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <[email protected]> wrote:
> Any chance you could try with the 2.0.14 release candidate...unless you > have already? > > https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ > > > On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote: > >> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 >> hours and still counting... >> It's seems to be a PDFBox issue. >> >> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> wrote: >> >>> Why don't you do a basic test with tika server in a 3thrd and a *wget* >>> or *curl* bash client to parse your 65Mo PDF. >>> It can be easier to investigate the problem. >>> >>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> >>> >>> >>> >>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a >>> écrit : >>> >>>> Just looking at the stack trace it won't be the same anymore due to >>>> PDFBOX-4453 >>>> Some changes present in not yet released pdfbox 2.0.14 and it changes >>>> how decryption is handled. Not sure if related though. >>>> >>>> Can you duplicate the problem without Tika using just PDFBox >>>> command-line ExtractText command ( >>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? >>>> >>>> >>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote: >>>> >>>>> This is the code : >>>>> InputStream in = TikaInputStream.get(inputFile.toPath()); >>>>> PDFParser tmpPdf = new PDFParser(); >>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>>>> config.setMaxMainMemoryBytes(31457280); >>>>> config.setExtractAcroFormContent(false); >>>>> config.setExtractBookmarksText(false); >>>>> config.setCatchIntermediateIOExceptions(true); >>>>> Metadata metadata = new Metadata(); >>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >>>>> ParseContext()); >>>>> >>>>> >>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> This is the default in Tika, where the default for >>>>>> maxMainMemoryBytes=500MB. >>>>>> >>>>>> Slava, how are you calling this in Tika? With a TikaInputStream via >>>>>> tika-app or tika-server or something else? >>>>>> >>>>>> MemoryUsageSetting memoryUsageSetting = >>>>>> MemoryUsageSetting.setupMainMemoryOnly(); >>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>>>>> memoryUsageSetting = >>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>>>>> } >>>>>> if (tstream != null && tstream.hasFile()) { >>>>>> // File based -- send file directly to PDFBox >>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password, >>>>>> memoryUsageSetting); >>>>>> } else { >>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), >>>>>> password, memoryUsageSetting); >>>>>> } >>>>>> >>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> As usual, it would be nice to have the PDF, so that we could run the >>>>>>> profiler. >>>>>>> >>>>>>> The HashSet is used to avoid decrypting objects twice. >>>>>>> >>>>>>> The "not encrypted" file is likely encrypted with an empty user >>>>>>> password. >>>>>>> >>>>>>> It would also be interesting to hear what parameter is passed to >>>>>>> MemoryUsageSetting when load() is called. >>>>>>> >>>>>>> Tilman >>>>>>> >>>>>>> >>>>>>> >>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>>>>>> > PDFBox Colleagues, >>>>>>> > Any ideas? >>>>>>> > >>>>>>> > ---------- Forwarded message --------- >>>>>>> > From: Tim Allison <[email protected]> >>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>>>>>> > Subject: Re: Very slow PDF parsing. >>>>>>> > To: <[email protected]> >>>>>>> > >>>>>>> > >>>>>>> > Sorry...that's an OCR tool. One thing that can slow down >>>>>>> processing >>>>>>> > dramatically is if you have tesseract installed (try typing >>>>>>> 'tesseract' on >>>>>>> > your commandline) and if you've turned it on for PDFs. I suspect >>>>>>> this >>>>>>> > isn't your problem, though. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> >>>>>>> wrote: >>>>>>> > >>>>>>> >> Thanks Tim, >>>>>>> >> But frankly speaking, it's a shame, but don't know what is >>>>>>> tessercat is in >>>>>>> >> this context 🙂 >>>>>>> >> >>>>>>> >> Thanks >>>>>>> >> >>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> >>>>>>> wrote: >>>>>>> >> >>>>>>> >>> Thank you, Slava! >>>>>>> >>> >>>>>>> >>> Do you have tesseract installed? >>>>>>> >>> >>>>>>> >>> Colleagues on PDFBox, any recommendations? >>>>>>> >>> >>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> >>>>>>> wrote: >>>>>>> >>>> Hi, >>>>>>> >>>> >>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and >>>>>>> some images. >>>>>>> >>>> >>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA >>>>>>> 1.19.1 >>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD >>>>>>> disk, running >>>>>>> >>> CentOS Linux). >>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe >>>>>>> it's a bug >>>>>>> >>> in PDFBox ? >>>>>>> >>>> When I'm printing java stack , I see all the time in this stack >>>>>>> : >>>>>>> >>>> >>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at java.util.HashSet.contains(Unknown Source) >>>>>>> >>>> >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>>>>>> >>>> at >>>>>>> >>> >>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>>>>>> >>>> at >>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>>>>>> >>>> >>>>>>> >>>> at >>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>>>>>> >>>> >>>>>>> >>>> at >>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>>>>>> >>>> >>>>>>> >>>> at >>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>>>>>> >>>> >>>>>>> >>>> Thanks >>>>>>> >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>>>
