Why don't you do a basic test with tika server in a 3thrd and a *wget* or *curl* bash client to parse your 65Mo PDF. It can be easier to investigate the problem.
@*JB*Δ <http://jbigdata.fr/jbigdata/index.html> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a écrit : > Just looking at the stack trace it won't be the same anymore due to > PDFBOX-4453 > Some changes present in not yet released pdfbox 2.0.14 and it changes how > decryption is handled. Not sure if related though. > > Can you duplicate the problem without Tika using just PDFBox command-line > ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) on > that file? > > > On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote: > >> This is the code : >> InputStream in = TikaInputStream.get(inputFile.toPath()); >> PDFParser tmpPdf = new PDFParser(); >> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >> config.setMaxMainMemoryBytes(31457280); >> config.setExtractAcroFormContent(false); >> config.setExtractBookmarksText(false); >> config.setCatchIntermediateIOExceptions(true); >> Metadata metadata = new Metadata(); >> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >> tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext()); >> >> >> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> wrote: >> >>> >>> This is the default in Tika, where the default for >>> maxMainMemoryBytes=500MB. >>> >>> Slava, how are you calling this in Tika? With a TikaInputStream via >>> tika-app or tika-server or something else? >>> >>> MemoryUsageSetting memoryUsageSetting = >>> MemoryUsageSetting.setupMainMemoryOnly(); >>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>> memoryUsageSetting = >>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>> } >>> if (tstream != null && tstream.hasFile()) { >>> // File based -- send file directly to PDFBox >>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password, >>> memoryUsageSetting); >>> } else { >>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), >>> password, memoryUsageSetting); >>> } >>> >>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> As usual, it would be nice to have the PDF, so that we could run the >>>> profiler. >>>> >>>> The HashSet is used to avoid decrypting objects twice. >>>> >>>> The "not encrypted" file is likely encrypted with an empty user >>>> password. >>>> >>>> It would also be interesting to hear what parameter is passed to >>>> MemoryUsageSetting when load() is called. >>>> >>>> Tilman >>>> >>>> >>>> >>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>>> > PDFBox Colleagues, >>>> > Any ideas? >>>> > >>>> > ---------- Forwarded message --------- >>>> > From: Tim Allison <[email protected]> >>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>>> > Subject: Re: Very slow PDF parsing. >>>> > To: <[email protected]> >>>> > >>>> > >>>> > Sorry...that's an OCR tool. One thing that can slow down processing >>>> > dramatically is if you have tesseract installed (try typing >>>> 'tesseract' on >>>> > your commandline) and if you've turned it on for PDFs. I suspect this >>>> > isn't your problem, though. >>>> > >>>> > >>>> > >>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote: >>>> > >>>> >> Thanks Tim, >>>> >> But frankly speaking, it's a shame, but don't know what is tessercat >>>> is in >>>> >> this context 🙂 >>>> >> >>>> >> Thanks >>>> >> >>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> wrote: >>>> >> >>>> >>> Thank you, Slava! >>>> >>> >>>> >>> Do you have tesseract installed? >>>> >>> >>>> >>> Colleagues on PDFBox, any recommendations? >>>> >>> >>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>> I have large PDF (about 65mb) that contains mainly text and some >>>> images. >>>> >>>> >>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 >>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, >>>> running >>>> >>> CentOS Linux). >>>> >>>> Please advise if there anything I can do to speedup.Or maybe it's >>>> a bug >>>> >>> in PDFBox ? >>>> >>>> When I'm printing java stack , I see all the time in this stack : >>>> >>>> >>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>>> >>>> >>>> >>>> at java.util.HashSet.contains(Unknown Source) >>>> >>>> >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>>> >>>> at >>>> >>> >>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>>> >>>> >>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>>> >>>> >>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>>> >>>> >>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>>> >>>> >>>> >>>> >>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>>> >>>> >>>> >>>> Thanks >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>>
