This is the code : InputStream in = TikaInputStream.get(inputFile.toPath()); PDFParser tmpPdf = new PDFParser(); PDFParserConfig config = tmpPdf.getPDFParserConfig(); config.setMaxMainMemoryBytes(31457280); config.setExtractAcroFormContent(false); config.setExtractBookmarksText(false); config.setCatchIntermediateIOExceptions(true); Metadata metadata = new Metadata(); metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());
On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> wrote: > > This is the default in Tika, where the default for > maxMainMemoryBytes=500MB. > > Slava, how are you calling this in Tika? With a TikaInputStream via > tika-app or tika-server or something else? > > MemoryUsageSetting memoryUsageSetting = > MemoryUsageSetting.setupMainMemoryOnly(); > if (localConfig.getMaxMainMemoryBytes() >= 0) { > memoryUsageSetting = > MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); > } > if (tstream != null && tstream.hasFile()) { > // File based -- send file directly to PDFBox > pdfDocument = PDDocument.load(tstream.getPath().toFile(), password, > memoryUsageSetting); > } else { > pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), > password, memoryUsageSetting); > } > > On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <[email protected]> > wrote: > >> Hi, >> >> As usual, it would be nice to have the PDF, so that we could run the >> profiler. >> >> The HashSet is used to avoid decrypting objects twice. >> >> The "not encrypted" file is likely encrypted with an empty user password. >> >> It would also be interesting to hear what parameter is passed to >> MemoryUsageSetting when load() is called. >> >> Tilman >> >> >> >> Am 26.02.2019 um 18:14 schrieb Tim Allison: >> > PDFBox Colleagues, >> > Any ideas? >> > >> > ---------- Forwarded message --------- >> > From: Tim Allison <[email protected]> >> > Date: Tue, Feb 26, 2019 at 12:13 PM >> > Subject: Re: Very slow PDF parsing. >> > To: <[email protected]> >> > >> > >> > Sorry...that's an OCR tool. One thing that can slow down processing >> > dramatically is if you have tesseract installed (try typing 'tesseract' >> on >> > your commandline) and if you've turned it on for PDFs. I suspect this >> > isn't your problem, though. >> > >> > >> > >> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote: >> > >> >> Thanks Tim, >> >> But frankly speaking, it's a shame, but don't know what is tessercat >> is in >> >> this context 🙂 >> >> >> >> Thanks >> >> >> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> wrote: >> >> >> >>> Thank you, Slava! >> >>> >> >>> Do you have tesseract installed? >> >>> >> >>> Colleagues on PDFBox, any recommendations? >> >>> >> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> wrote: >> >>>> Hi, >> >>>> >> >>>> I have large PDF (about 65mb) that contains mainly text and some >> images. >> >>>> >> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 >> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, >> running >> >>> CentOS Linux). >> >>>> Please advise if there anything I can do to speedup.Or maybe it's a >> bug >> >>> in PDFBox ? >> >>>> When I'm printing java stack , I see all the time in this stack : >> >>>> >> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>> >> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >> >>>> >> >>>> at java.util.HashMap.getNode(Unknown Source) >> >>>> >> >>>> at java.util.HashMap.containsKey(Unknown Source) >> >>>> >> >>>> at java.util.HashSet.contains(Unknown Source) >> >>>> >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> >>>> at >> >>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> >>>> at >> >>> >> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >> >>>> at >> >>> >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >> >>>> at >> >>> >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >> >>>> at >> >>> >> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >> >>>> at >> >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >> >>>> >> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >> >>>> >> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >> >>>> >> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >> >>>> >> >>>> >> >>>> P.S. Btw, the PDF is not encrypted at all. >> >>>> >> >>>> Thanks >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
