I've tried to find this on this specific Linux server, and no, there's no tesseract installed. I'm configuring the pdf parser with those parameters:
PDFParser tmpPdf = new PDFParser(); PDFParserConfig config = tmpPdf.getPDFParserConfig(); config.setMaxMainMemoryBytes(31457280); config.setExtractAcroFormContent(false); config.setExtractBookmarksText(false); config.setCatchIntermediateIOExceptions(true); On Tue, Feb 26, 2019 at 7:13 PM Tim Allison <[email protected]> wrote: > Sorry...that's an OCR tool. One thing that can slow down processing > dramatically is if you have tesseract installed (try typing 'tesseract' on > your commandline) and if you've turned it on for PDFs. I suspect this > isn't your problem, though. > > > > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote: > >> Thanks Tim, >> But frankly speaking, it's a shame, but don't know what is tessercat is >> in this context 🙂 >> >> Thanks >> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> wrote: >> >>> Thank you, Slava! >>> >>> Do you have tesseract installed? >>> >>> Colleagues on PDFBox, any recommendations? >>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> wrote: >>> > >>> > Hi, >>> > >>> > I have large PDF (about 65mb) that contains mainly text and some >>> images. >>> > >>> > Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running >>> CentOS Linux). >>> > >>> > Please advise if there anything I can do to speedup.Or maybe it's a >>> bug in PDFBox ? >>> > >>> > When I'm printing java stack , I see all the time in this stack : >>> > >>> > at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.find(Unknown Source) >>> > >>> > at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>> > >>> > at java.util.HashMap.getNode(Unknown Source) >>> > >>> > at java.util.HashMap.containsKey(Unknown Source) >>> > >>> > at java.util.HashSet.contains(Unknown Source) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>> > >>> > at >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>> > >>> > at >>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>> > >>> > at >>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>> > >>> > at >>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>> > >>> > at >>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>> > >>> > at >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>> > >>> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>> > >>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>> > >>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>> > >>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>> > >>> > >>> > P.S. Btw, the PDF is not encrypted at all. >>> > >>> > Thanks >>> >>
