Re: Fwd: Very slow PDF parsing.

Tilman Hausherr Wed, 27 Feb 2019 09:06:11 -0800

Yes, will do. Use a sharehoster (e.g. filedropper.com ) and put the fileinto an encrypted ZIP. Please send the link and the password to tilman at snafu dot de. Make sure you're not breaking any laws bysending the file.


Tilman



Am 27.02.2019 um 17:33 schrieb Slava G:

As this is customer file, I can share it in private and I'll ask you to
dispose it after the investigation is done.
So, how can I share it with you?
Checking now with 2.0.6 app. Will update...


On Wed, Feb 27, 2019, 18:28 Tilman Hausherr <[email protected]> wrote:

We really need the file to find out what's going on.

If you can't share it, you'll have to investigate yourself by using the
profiler. Before that, try with old 2.0.* versions to see if these are
faster.

Tilman

Am 27.02.2019 um 17:23 schrieb Slava G:

After 3h 40m it's still parsing using PDFBox 2.0.14 app...
Thanks

On Wed, Feb 27, 2019 at 3:29 PM Slava G <[email protected]> wrote:

With 2.0.14 it's 40 minutes running, no result, still working...
Seems that issue is still there.
Thanks

On Wed, Feb 27, 2019 at 2:52 PM Slava G <[email protected]> wrote:

Checking with 2.0.14. Started as an app. Will update soon.

On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <[email protected]>

wrote:

Any chance you could try with the 2.0.14 release candidate...unless

you

have already?

https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/


On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote:

Well, I ran (as was suggested) PDFBox app to extract text , so far 2
hours and still counting...
It's seems to be a PDFBox issue.

On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]>

wrote:

Why don't you do a basic test with tika server in a 3thrd and a

*wget*

or *curl* bash client to parse your 65Mo PDF.
It can be easier to investigate the problem.

@*JB*Δ <http://jbigdata.fr/jbigdata/index.html>



Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected]
a écrit :

Just looking at the stack trace it won't be the same anymore due to
PDFBOX-4453
Some changes present in not yet released pdfbox 2.0.14 and it

changes

how decryption is handled. Not sure if related though.

Can you duplicate the problem without Tika using just PDFBox
command-line ExtractText command (
https://pdfbox.apache.org/2.0/commandline.html ) on that file?


On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote:

This is the code :
InputStream in = TikaInputStream.get(inputFile.toPath());
PDFParser tmpPdf = new PDFParser();
PDFParserConfig config = tmpPdf.getPDFParserConfig();
config.setMaxMainMemoryBytes(31457280);
config.setExtractAcroFormContent(false);
config.setExtractBookmarksText(false);
config.setCatchIntermediateIOExceptions(true);
Metadata metadata = new Metadata();
metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
tmpPdf.parse(inputStream, textHandler, this.metadata, new
ParseContext());


On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]>
wrote:

This is the default in Tika, where the default for
maxMainMemoryBytes=500MB.

Slava, how are you calling this in Tika?  With a TikaInputStream
via tika-app or tika-server or something else?

MemoryUsageSetting memoryUsageSetting =
MemoryUsageSetting.setupMainMemoryOnly();
if (localConfig.getMaxMainMemoryBytes() >= 0) {
memoryUsageSetting =

MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());

}
if (tstream != null && tstream.hasFile()) {
// File based -- send file directly to PDFBox
pdfDocument = PDDocument.load(tstream.getPath().toFile(),

password,

memoryUsageSetting);
} else {
pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
password, memoryUsageSetting);
}

On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
[email protected]> wrote:

Hi,

As usual, it would be nice to have the PDF, so that we could run
the
profiler.

The HashSet is used to avoid decrypting objects twice.

The "not encrypted" file is likely encrypted with an empty user
password.

It would also be interesting to hear what parameter is passed to
MemoryUsageSetting when load() is called.

Tilman



Am 26.02.2019 um 18:14 schrieb Tim Allison:

PDFBox Colleagues,
     Any ideas?

---------- Forwarded message ---------
From: Tim Allison <[email protected]>
Date: Tue, Feb 26, 2019 at 12:13 PM
Subject: Re: Very slow PDF parsing.
To: <[email protected]>


Sorry...that's an OCR tool.  One thing that can slow down

processing

dramatically is if you have tesseract installed (try typing

'tesseract' on

your commandline) and if you've turned it on for PDFs.  I

suspect this

isn't your problem, though.



On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]>

wrote:

Thanks Tim,
But frankly speaking, it's a shame, but don't know what is

tessercat is in

this context 🙂

Thanks

On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]>

wrote:

Thank you, Slava!

Do you have tesseract installed?

Colleagues on PDFBox, any recommendations?

On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]>

wrote:

Hi,

I have large PDF (about 65mb) that contains mainly text and

some images.

Parsing of such PDF can take about 2 days or even more (TIKA

1.19.1

running on XEON server with 4 cores CPU and 30GB RAM with SSD

disk, running

CentOS Linux).

Please advise if there anything I can do to speedup.Or maybe

it's a bug

in PDFBox ?

When I'm printing java stack , I see all the time in this

stack :

at

org.apache.pdfbox.cos.COSString.equals(COSString.java:259)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)

at java.util.HashMap.getNode(Unknown Source)

at java.util.HashMap.containsKey(Unknown Source)

at java.util.HashSet.contains(Unknown Source)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)

at

org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)

at

org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)

at

org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)

at

org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)

at

org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)

at

org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)

at

org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)

at

org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)

at

org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)

at

org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)

P.S. Btw, the PDF is not encrypted at all.

Thanks

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Fwd: Very slow PDF parsing.

Reply via email to