Thanks Tim, But frankly speaking, it's a shame, but don't know what is tessercat is in this context 🙂
Thanks On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> wrote: > Thank you, Slava! > > Do you have tesseract installed? > > Colleagues on PDFBox, any recommendations? > > On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> wrote: > > > > Hi, > > > > I have large PDF (about 65mb) that contains mainly text and some images. > > > > Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 > running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running > CentOS Linux). > > > > Please advise if there anything I can do to speedup.Or maybe it's a bug > in PDFBox ? > > > > When I'm printing java stack , I see all the time in this stack : > > > > at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.find(Unknown Source) > > > > at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) > > > > at java.util.HashMap.getNode(Unknown Source) > > > > at java.util.HashMap.containsKey(Unknown Source) > > > > at java.util.HashSet.contains(Unknown Source) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > > > > at > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > > > > at > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) > > > > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) > > > > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) > > > > at > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) > > > > at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) > > > > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) > > > > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) > > > > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) > > > > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) > > > > > > P.S. Btw, the PDF is not encrypted at all. > > > > Thanks >
