Any chance you could try with the 2.0.14 release candidate...unless you have already?
https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote: > Well, I ran (as was suggested) PDFBox app to extract text , so far 2 hours > and still counting... > It's seems to be a PDFBox issue. > > On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> wrote: > >> Why don't you do a basic test with tika server in a 3thrd and a *wget* >> or *curl* bash client to parse your 65Mo PDF. >> It can be easier to investigate the problem. >> >> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> >> >> >> >> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a >> écrit : >> >>> Just looking at the stack trace it won't be the same anymore due to >>> PDFBOX-4453 >>> Some changes present in not yet released pdfbox 2.0.14 and it changes >>> how decryption is handled. Not sure if related though. >>> >>> Can you duplicate the problem without Tika using just PDFBox >>> command-line ExtractText command ( >>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? >>> >>> >>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote: >>> >>>> This is the code : >>>> InputStream in = TikaInputStream.get(inputFile.toPath()); >>>> PDFParser tmpPdf = new PDFParser(); >>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>>> config.setMaxMainMemoryBytes(31457280); >>>> config.setExtractAcroFormContent(false); >>>> config.setExtractBookmarksText(false); >>>> config.setCatchIntermediateIOExceptions(true); >>>> Metadata metadata = new Metadata(); >>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >>>> ParseContext()); >>>> >>>> >>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> >>>> wrote: >>>> >>>>> >>>>> This is the default in Tika, where the default for >>>>> maxMainMemoryBytes=500MB. >>>>> >>>>> Slava, how are you calling this in Tika? With a TikaInputStream via >>>>> tika-app or tika-server or something else? >>>>> >>>>> MemoryUsageSetting memoryUsageSetting = >>>>> MemoryUsageSetting.setupMainMemoryOnly(); >>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>>>> memoryUsageSetting = >>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>>>> } >>>>> if (tstream != null && tstream.hasFile()) { >>>>> // File based -- send file directly to PDFBox >>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password, >>>>> memoryUsageSetting); >>>>> } else { >>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), >>>>> password, memoryUsageSetting); >>>>> } >>>>> >>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> As usual, it would be nice to have the PDF, so that we could run the >>>>>> profiler. >>>>>> >>>>>> The HashSet is used to avoid decrypting objects twice. >>>>>> >>>>>> The "not encrypted" file is likely encrypted with an empty user >>>>>> password. >>>>>> >>>>>> It would also be interesting to hear what parameter is passed to >>>>>> MemoryUsageSetting when load() is called. >>>>>> >>>>>> Tilman >>>>>> >>>>>> >>>>>> >>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>>>>> > PDFBox Colleagues, >>>>>> > Any ideas? >>>>>> > >>>>>> > ---------- Forwarded message --------- >>>>>> > From: Tim Allison <[email protected]> >>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>>>>> > Subject: Re: Very slow PDF parsing. >>>>>> > To: <[email protected]> >>>>>> > >>>>>> > >>>>>> > Sorry...that's an OCR tool. One thing that can slow down processing >>>>>> > dramatically is if you have tesseract installed (try typing >>>>>> 'tesseract' on >>>>>> > your commandline) and if you've turned it on for PDFs. I suspect >>>>>> this >>>>>> > isn't your problem, though. >>>>>> > >>>>>> > >>>>>> > >>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote: >>>>>> > >>>>>> >> Thanks Tim, >>>>>> >> But frankly speaking, it's a shame, but don't know what is >>>>>> tessercat is in >>>>>> >> this context 🙂 >>>>>> >> >>>>>> >> Thanks >>>>>> >> >>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> >>>>>> wrote: >>>>>> >> >>>>>> >>> Thank you, Slava! >>>>>> >>> >>>>>> >>> Do you have tesseract installed? >>>>>> >>> >>>>>> >>> Colleagues on PDFBox, any recommendations? >>>>>> >>> >>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> >>>>>> wrote: >>>>>> >>>> Hi, >>>>>> >>>> >>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some >>>>>> images. >>>>>> >>>> >>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA >>>>>> 1.19.1 >>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD >>>>>> disk, running >>>>>> >>> CentOS Linux). >>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe >>>>>> it's a bug >>>>>> >>> in PDFBox ? >>>>>> >>>> When I'm printing java stack , I see all the time in this stack : >>>>>> >>>> >>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>>>>> >>>> >>>>>> >>>> at java.util.HashSet.contains(Unknown Source) >>>>>> >>>> >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>>>>> >>>> at >>>>>> >>> >>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>>>>> >>>> at >>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>>>>> >>>> >>>>>> >>>> at >>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>>>>> >>>> >>>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>>>>> >>>> >>>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>>>>> >>>> >>>>>> >>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>>>
