That is likely too small. It should be retested with a higher value or with memory only.

Tilman

Am 26.02.2019 um 19:02 schrieb Tim Allison:
This is the default in Tika, where the default for maxMainMemoryBytes=500MB.

Slava, how are you calling this in Tika?  With a TikaInputStream via
tika-app or tika-server or something else?

MemoryUsageSetting memoryUsageSetting =
MemoryUsageSetting.setupMainMemoryOnly();
if (localConfig.getMaxMainMemoryBytes() >= 0) {
memoryUsageSetting =
MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
}
if (tstream != null && tstream.hasFile()) {
// File based -- send file directly to PDFBox
pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
memoryUsageSetting);
} else {
pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), password,
memoryUsageSetting);
}

On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <[email protected]>
wrote:

Hi,

As usual, it would be nice to have the PDF, so that we could run the
profiler.

The HashSet is used to avoid decrypting objects twice.

The "not encrypted" file is likely encrypted with an empty user password.

It would also be interesting to hear what parameter is passed to
MemoryUsageSetting when load() is called.

Tilman



Am 26.02.2019 um 18:14 schrieb Tim Allison:
PDFBox Colleagues,
    Any ideas?

---------- Forwarded message ---------
From: Tim Allison <[email protected]>
Date: Tue, Feb 26, 2019 at 12:13 PM
Subject: Re: Very slow PDF parsing.
To: <[email protected]>


Sorry...that's an OCR tool.  One thing that can slow down processing
dramatically is if you have tesseract installed (try typing 'tesseract'
on
your commandline) and if you've turned it on for PDFs.  I suspect this
isn't your problem, though.



On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote:

Thanks Tim,
But frankly speaking, it's a shame, but don't know what is tessercat is
in
this context 🙂

Thanks

On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> wrote:

Thank you, Slava!

Do you have tesseract installed?

Colleagues on PDFBox, any recommendations?

On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> wrote:
Hi,

I have large PDF (about 65mb) that contains mainly text and some
images.
Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
running
CentOS Linux).
Please advise if there anything I can do to speedup.Or maybe it's a
bug
in PDFBox ?
When I'm printing java stack , I see all the time in this stack :

at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)

at java.util.HashMap.getNode(Unknown Source)

at java.util.HashMap.containsKey(Unknown Source)

at java.util.HashSet.contains(Unknown Source)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
at
org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
at
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
at
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
at
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
at
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)

at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)

at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)

at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)


P.S. Btw, the PDF is not encrypted at all.

Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to