Well, I ran (as was suggested) PDFBox app to extract text , so far 2 hours and still counting... It's seems to be a PDFBox issue.
On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> wrote: > Why don't you do a basic test with tika server in a 3thrd and a *wget* or > *curl* bash client to parse your 65Mo PDF. > It can be easier to investigate the problem. > > @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> > > > > Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]> a > écrit : > >> Just looking at the stack trace it won't be the same anymore due to >> PDFBOX-4453 >> Some changes present in not yet released pdfbox 2.0.14 and it changes how >> decryption is handled. Not sure if related though. >> >> Can you duplicate the problem without Tika using just PDFBox command-line >> ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) >> on that file? >> >> >> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote: >> >>> This is the code : >>> InputStream in = TikaInputStream.get(inputFile.toPath()); >>> PDFParser tmpPdf = new PDFParser(); >>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>> config.setMaxMainMemoryBytes(31457280); >>> config.setExtractAcroFormContent(false); >>> config.setExtractBookmarksText(false); >>> config.setCatchIntermediateIOExceptions(true); >>> Metadata metadata = new Metadata(); >>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >>> ParseContext()); >>> >>> >>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> wrote: >>> >>>> >>>> This is the default in Tika, where the default for >>>> maxMainMemoryBytes=500MB. >>>> >>>> Slava, how are you calling this in Tika? With a TikaInputStream via >>>> tika-app or tika-server or something else? >>>> >>>> MemoryUsageSetting memoryUsageSetting = >>>> MemoryUsageSetting.setupMainMemoryOnly(); >>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>>> memoryUsageSetting = >>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>>> } >>>> if (tstream != null && tstream.hasFile()) { >>>> // File based -- send file directly to PDFBox >>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password, >>>> memoryUsageSetting); >>>> } else { >>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), >>>> password, memoryUsageSetting); >>>> } >>>> >>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> As usual, it would be nice to have the PDF, so that we could run the >>>>> profiler. >>>>> >>>>> The HashSet is used to avoid decrypting objects twice. >>>>> >>>>> The "not encrypted" file is likely encrypted with an empty user >>>>> password. >>>>> >>>>> It would also be interesting to hear what parameter is passed to >>>>> MemoryUsageSetting when load() is called. >>>>> >>>>> Tilman >>>>> >>>>> >>>>> >>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>>>> > PDFBox Colleagues, >>>>> > Any ideas? >>>>> > >>>>> > ---------- Forwarded message --------- >>>>> > From: Tim Allison <[email protected]> >>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>>>> > Subject: Re: Very slow PDF parsing. >>>>> > To: <[email protected]> >>>>> > >>>>> > >>>>> > Sorry...that's an OCR tool. One thing that can slow down processing >>>>> > dramatically is if you have tesseract installed (try typing >>>>> 'tesseract' on >>>>> > your commandline) and if you've turned it on for PDFs. I suspect >>>>> this >>>>> > isn't your problem, though. >>>>> > >>>>> > >>>>> > >>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote: >>>>> > >>>>> >> Thanks Tim, >>>>> >> But frankly speaking, it's a shame, but don't know what is >>>>> tessercat is in >>>>> >> this context 🙂 >>>>> >> >>>>> >> Thanks >>>>> >> >>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> >>>>> wrote: >>>>> >> >>>>> >>> Thank you, Slava! >>>>> >>> >>>>> >>> Do you have tesseract installed? >>>>> >>> >>>>> >>> Colleagues on PDFBox, any recommendations? >>>>> >>> >>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> >>>>> wrote: >>>>> >>>> Hi, >>>>> >>>> >>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some >>>>> images. >>>>> >>>> >>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA >>>>> 1.19.1 >>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD >>>>> disk, running >>>>> >>> CentOS Linux). >>>>> >>>> Please advise if there anything I can do to speedup.Or maybe it's >>>>> a bug >>>>> >>> in PDFBox ? >>>>> >>>> When I'm printing java stack , I see all the time in this stack : >>>>> >>>> >>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>>>> >>>> >>>>> >>>> at java.util.HashSet.contains(Unknown Source) >>>>> >>>> >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>>>> >>>> at >>>>> >>> >>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>>>> >>>> >>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>>>> >>>> >>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>>>> >>>> >>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>>>> >>>> >>>>> >>>> >>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>>>> >>>> >>>>> >>>> Thanks >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>>
