Thank you, Slava!

Do you have tesseract installed?

Colleagues on PDFBox, any recommendations?

On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> wrote:
>
> Hi,
>
> I have large PDF (about 65mb) that contains mainly text and some images.
>
> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 running 
> on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running CentOS 
> Linux).
>
> Please advise if there anything I can do to speedup.Or maybe it's a bug in 
> PDFBox ?
>
> When I'm printing java stack , I see all the time in this stack :
>
> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>
> at java.util.HashMap.getNode(Unknown Source)
>
> at java.util.HashMap.containsKey(Unknown Source)
>
> at java.util.HashSet.contains(Unknown Source)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>
> at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>
> at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>
> at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>
> at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>
>
> P.S. Btw, the PDF is not encrypted at all.
>
> Thanks

Reply via email to