Slava, Could you please forward this pdf to [email protected] (Tika PMC only private list)? I had similar issues with some pdf but were unable to get them from client to look into it with profiler.
-- Best regards, Konstantin Gribov. On Thu, Feb 28, 2019 at 7:27 PM Slava G <[email protected]> wrote: > Tim, to what email to send you the PDF ? > Thanks > > On Thu, Feb 28, 2019 at 3:57 PM Slava G <[email protected]> wrote: > >> I'll once I'll get customer's approval. >> Meanwhile I can do any checks, if you can specify what to check. >> Thanks >> >> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <[email protected]> wrote: >> >>> Any chance you can share the file directly w me or someone else on the >>> PDFBox team? >>> >>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <[email protected]> wrote: >>> >>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app... >>> > Thanks >>> > >>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <[email protected]> wrote: >>> > >>> >> With 2.0.14 it's 40 minutes running, no result, still working... >>> >> Seems that issue is still there. >>> >> Thanks >>> >> >>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <[email protected]> wrote: >>> >> >>> >>> Checking with 2.0.14. Started as an app. Will update soon. >>> >>> >>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <[email protected]> >>> wrote: >>> >>> >>> >>>> Any chance you could try with the 2.0.14 release candidate...unless >>> you >>> >>>> have already? >>> >>>> >>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ >>> >>>> >>> >>>> >>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote: >>> >>>> >>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far >>> 2 >>> >>>>> hours and still counting... >>> >>>>> It's seems to be a PDFBox issue. >>> >>>>> >>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> >>> wrote: >>> >>>>> >>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a >>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF. >>> >>>>>> It can be easier to investigate the problem. >>> >>>>>> >>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat < >>> [email protected]> >>> >>>>>> a écrit : >>> >>>>>> >>> >>>>>>> Just looking at the stack trace it won't be the same anymore due >>> to >>> >>>>>>> PDFBOX-4453 >>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it >>> >>>>>>> changes how decryption is handled. Not sure if related though. >>> >>>>>>> >>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox >>> >>>>>>> command-line ExtractText command ( >>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> >>> wrote: >>> >>>>>>> >>> >>>>>>>> This is the code : >>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath()); >>> >>>>>>>> PDFParser tmpPdf = new PDFParser(); >>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>> >>>>>>>> config.setMaxMainMemoryBytes(31457280); >>> >>>>>>>> config.setExtractAcroFormContent(false); >>> >>>>>>>> config.setExtractBookmarksText(false); >>> >>>>>>>> config.setCatchIntermediateIOExceptions(true); >>> >>>>>>>> Metadata metadata = new Metadata(); >>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >>> >>>>>>>> ParseContext()); >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison < >>> [email protected]> >>> >>>>>>>> wrote: >>> >>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> This is the default in Tika, where the default for >>> >>>>>>>>> maxMainMemoryBytes=500MB. >>> >>>>>>>>> >>> >>>>>>>>> Slava, how are you calling this in Tika? With a >>> TikaInputStream >>> >>>>>>>>> via tika-app or tika-server or something else? >>> >>>>>>>>> >>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting = >>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly(); >>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>> >>>>>>>>> memoryUsageSetting = >>> >>>>>>>>> >>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>> >>>>>>>>> } >>> >>>>>>>>> if (tstream != null && tstream.hasFile()) { >>> >>>>>>>>> // File based -- send file directly to PDFBox >>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), >>> >>>>>>>>> password, memoryUsageSetting); >>> >>>>>>>>> } else { >>> >>>>>>>>> pdfDocument = PDDocument.load(new >>> CloseShieldInputStream(stream), >>> >>>>>>>>> password, memoryUsageSetting); >>> >>>>>>>>> } >>> >>>>>>>>> >>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < >>> >>>>>>>>> [email protected]> wrote: >>> >>>>>>>>> >>> >>>>>>>>>> Hi, >>> >>>>>>>>>> >>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could >>> run >>> >>>>>>>>>> the >>> >>>>>>>>>> profiler. >>> >>>>>>>>>> >>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice. >>> >>>>>>>>>> >>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty >>> user >>> >>>>>>>>>> password. >>> >>>>>>>>>> >>> >>>>>>>>>> It would also be interesting to hear what parameter is passed >>> to >>> >>>>>>>>>> MemoryUsageSetting when load() is called. >>> >>>>>>>>>> >>> >>>>>>>>>> Tilman >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>> >>>>>>>>>> > PDFBox Colleagues, >>> >>>>>>>>>> > Any ideas? >>> >>>>>>>>>> > >>> >>>>>>>>>> > ---------- Forwarded message --------- >>> >>>>>>>>>> > From: Tim Allison <[email protected]> >>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing. >>> >>>>>>>>>> > To: <[email protected]> >>> >>>>>>>>>> > >>> >>>>>>>>>> > >>> >>>>>>>>>> > Sorry...that's an OCR tool. One thing that can slow down >>> >>>>>>>>>> processing >>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing >>> >>>>>>>>>> 'tesseract' on >>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs. I >>> >>>>>>>>>> suspect this >>> >>>>>>>>>> > isn't your problem, though. >>> >>>>>>>>>> > >>> >>>>>>>>>> > >>> >>>>>>>>>> > >>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected] >>> > >>> >>>>>>>>>> wrote: >>> >>>>>>>>>> > >>> >>>>>>>>>> >> Thanks Tim, >>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is >>> >>>>>>>>>> tessercat is in >>> >>>>>>>>>> >> this context 🙂 >>> >>>>>>>>>> >> >>> >>>>>>>>>> >> Thanks >>> >>>>>>>>>> >> >>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison < >>> [email protected]> >>> >>>>>>>>>> wrote: >>> >>>>>>>>>> >> >>> >>>>>>>>>> >>> Thank you, Slava! >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> Do you have tesseract installed? >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations? >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G < >>> [email protected]> >>> >>>>>>>>>> wrote: >>> >>>>>>>>>> >>>> Hi, >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text >>> and >>> >>>>>>>>>> some images. >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more >>> (TIKA >>> >>>>>>>>>> 1.19.1 >>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with >>> SSD >>> >>>>>>>>>> disk, running >>> >>>>>>>>>> >>> CentOS Linux). >>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or >>> maybe >>> >>>>>>>>>> it's a bug >>> >>>>>>>>>> >>> in PDFBox ? >>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this >>> >>>>>>>>>> stack : >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at >>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> >>> >>>>>>>>>> >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> >>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> at >>> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>> >>>>>>>>>> >>>> >>> >>>>>>>>>> >>>> Thanks >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >>> --------------------------------------------------------------------- >>> >>>>>>>>>> To unsubscribe, e-mail: [email protected] >>> >>>>>>>>>> For additional commands, e-mail: [email protected] >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>
