Re: Fwd: Very slow PDF parsing.

Slava G Tue, 26 Feb 2019 10:24:23 -0800

This is the code :
InputStream in = TikaInputStream.get(inputFile.toPath());
PDFParser tmpPdf = new PDFParser();
PDFParserConfig config = tmpPdf.getPDFParserConfig();
config.setMaxMainMemoryBytes(31457280);
config.setExtractAcroFormContent(false);
config.setExtractBookmarksText(false);
config.setCatchIntermediateIOExceptions(true);
Metadata metadata = new Metadata();
metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());



On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> wrote:

>
> This is the default in Tika, where the default for
> maxMainMemoryBytes=500MB.
>
> Slava, how are you calling this in Tika?  With a TikaInputStream via
> tika-app or tika-server or something else?
>
> MemoryUsageSetting memoryUsageSetting =
> MemoryUsageSetting.setupMainMemoryOnly();
> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> memoryUsageSetting =
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> }
> if (tstream != null && tstream.hasFile()) {
> // File based -- send file directly to PDFBox
> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
> memoryUsageSetting);
> } else {
> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
> password, memoryUsageSetting);
> }
>
> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <[email protected]>
> wrote:
>
>> Hi,
>>
>> As usual, it would be nice to have the PDF, so that we could run the
>> profiler.
>>
>> The HashSet is used to avoid decrypting objects twice.
>>
>> The "not encrypted" file is likely encrypted with an empty user password.
>>
>> It would also be interesting to hear what parameter is passed to
>> MemoryUsageSetting when load() is called.
>>
>> Tilman
>>
>>
>>
>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>> > PDFBox Colleagues,
>> >    Any ideas?
>> >
>> > ---------- Forwarded message ---------
>> > From: Tim Allison <[email protected]>
>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>> > Subject: Re: Very slow PDF parsing.
>> > To: <[email protected]>
>> >
>> >
>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>> > dramatically is if you have tesseract installed (try typing 'tesseract'
>> on
>> > your commandline) and if you've turned it on for PDFs.  I suspect this
>> > isn't your problem, though.
>> >
>> >
>> >
>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> wrote:
>> >
>> >> Thanks Tim,
>> >> But frankly speaking, it's a shame, but don't know what is tessercat
>> is in
>> >> this context 🙂
>> >>
>> >> Thanks
>> >>
>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected]> wrote:
>> >>
>> >>> Thank you, Slava!
>> >>>
>> >>> Do you have tesseract installed?
>> >>>
>> >>> Colleagues on PDFBox, any recommendations?
>> >>>
>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected]> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>> images.
>> >>>>
>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>> running
>> >>> CentOS Linux).
>> >>>> Please advise if there anything I can do to speedup.Or maybe it's a
>> bug
>> >>> in PDFBox ?
>> >>>> When I'm printing java stack , I see all the time in this stack :
>> >>>>
>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap.getNode(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>> >>>>
>> >>>> at java.util.HashSet.contains(Unknown Source)
>> >>>>
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >>>> at
>> >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >>>>
>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >>>>
>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >>>>
>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >>>>
>> >>>>
>> >>>> P.S. Btw, the PDF is not encrypted at all.
>> >>>>
>> >>>> Thanks
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Fwd: Very slow PDF parsing.

Reply via email to