Follow up, it seems to be fixed, so not actual for me anymore. Sorry for this bit of noise in lists)
-- Best regards, Konstantin Gribov. On Thu, Mar 21, 2019 at 7:56 PM Konstantin Gribov <[email protected]> wrote: > Slava, > > Could you please forward this pdf to [email protected] (Tika PMC > only private list)? I had similar issues with some pdf but were unable to > get them from client to look into it with profiler. > > -- > Best regards, > Konstantin Gribov. > > > On Thu, Feb 28, 2019 at 7:27 PM Slava G <[email protected]> wrote: > >> Tim, to what email to send you the PDF ? >> Thanks >> >> On Thu, Feb 28, 2019 at 3:57 PM Slava G <[email protected]> wrote: >> >>> I'll once I'll get customer's approval. >>> Meanwhile I can do any checks, if you can specify what to check. >>> Thanks >>> >>> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <[email protected]> wrote: >>> >>>> Any chance you can share the file directly w me or someone else on the >>>> PDFBox team? >>>> >>>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <[email protected]> wrote: >>>> >>>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app... >>>> > Thanks >>>> > >>>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <[email protected]> wrote: >>>> > >>>> >> With 2.0.14 it's 40 minutes running, no result, still working... >>>> >> Seems that issue is still there. >>>> >> Thanks >>>> >> >>>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <[email protected]> wrote: >>>> >> >>>> >>> Checking with 2.0.14. Started as an app. Will update soon. >>>> >>> >>>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <[email protected]> >>>> wrote: >>>> >>> >>>> >>>> Any chance you could try with the 2.0.14 release >>>> candidate...unless you >>>> >>>> have already? >>>> >>>> >>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote: >>>> >>>> >>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so >>>> far 2 >>>> >>>>> hours and still counting... >>>> >>>>> It's seems to be a PDFBox issue. >>>> >>>>> >>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> >>>> wrote: >>>> >>>>> >>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a >>>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF. >>>> >>>>>> It can be easier to investigate the problem. >>>> >>>>>> >>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat < >>>> [email protected]> >>>> >>>>>> a écrit : >>>> >>>>>> >>>> >>>>>>> Just looking at the stack trace it won't be the same anymore >>>> due to >>>> >>>>>>> PDFBOX-4453 >>>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it >>>> >>>>>>> changes how decryption is handled. Not sure if related though. >>>> >>>>>>> >>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox >>>> >>>>>>> command-line ExtractText command ( >>>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> >>>> wrote: >>>> >>>>>>> >>>> >>>>>>>> This is the code : >>>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath()); >>>> >>>>>>>> PDFParser tmpPdf = new PDFParser(); >>>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>>> >>>>>>>> config.setMaxMainMemoryBytes(31457280); >>>> >>>>>>>> config.setExtractAcroFormContent(false); >>>> >>>>>>>> config.setExtractBookmarksText(false); >>>> >>>>>>>> config.setCatchIntermediateIOExceptions(true); >>>> >>>>>>>> Metadata metadata = new Metadata(); >>>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >>>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >>>> >>>>>>>> ParseContext()); >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison < >>>> [email protected]> >>>> >>>>>>>> wrote: >>>> >>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> This is the default in Tika, where the default for >>>> >>>>>>>>> maxMainMemoryBytes=500MB. >>>> >>>>>>>>> >>>> >>>>>>>>> Slava, how are you calling this in Tika? With a >>>> TikaInputStream >>>> >>>>>>>>> via tika-app or tika-server or something else? >>>> >>>>>>>>> >>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting = >>>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly(); >>>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>>> >>>>>>>>> memoryUsageSetting = >>>> >>>>>>>>> >>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>>> >>>>>>>>> } >>>> >>>>>>>>> if (tstream != null && tstream.hasFile()) { >>>> >>>>>>>>> // File based -- send file directly to PDFBox >>>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), >>>> >>>>>>>>> password, memoryUsageSetting); >>>> >>>>>>>>> } else { >>>> >>>>>>>>> pdfDocument = PDDocument.load(new >>>> CloseShieldInputStream(stream), >>>> >>>>>>>>> password, memoryUsageSetting); >>>> >>>>>>>>> } >>>> >>>>>>>>> >>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < >>>> >>>>>>>>> [email protected]> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>>> Hi, >>>> >>>>>>>>>> >>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could >>>> run >>>> >>>>>>>>>> the >>>> >>>>>>>>>> profiler. >>>> >>>>>>>>>> >>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice. >>>> >>>>>>>>>> >>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty >>>> user >>>> >>>>>>>>>> password. >>>> >>>>>>>>>> >>>> >>>>>>>>>> It would also be interesting to hear what parameter is >>>> passed to >>>> >>>>>>>>>> MemoryUsageSetting when load() is called. >>>> >>>>>>>>>> >>>> >>>>>>>>>> Tilman >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>>> >>>>>>>>>> > PDFBox Colleagues, >>>> >>>>>>>>>> > Any ideas? >>>> >>>>>>>>>> > >>>> >>>>>>>>>> > ---------- Forwarded message --------- >>>> >>>>>>>>>> > From: Tim Allison <[email protected]> >>>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing. >>>> >>>>>>>>>> > To: <[email protected]> >>>> >>>>>>>>>> > >>>> >>>>>>>>>> > >>>> >>>>>>>>>> > Sorry...that's an OCR tool. One thing that can slow down >>>> >>>>>>>>>> processing >>>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing >>>> >>>>>>>>>> 'tesseract' on >>>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs. I >>>> >>>>>>>>>> suspect this >>>> >>>>>>>>>> > isn't your problem, though. >>>> >>>>>>>>>> > >>>> >>>>>>>>>> > >>>> >>>>>>>>>> > >>>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G < >>>> [email protected]> >>>> >>>>>>>>>> wrote: >>>> >>>>>>>>>> > >>>> >>>>>>>>>> >> Thanks Tim, >>>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is >>>> >>>>>>>>>> tessercat is in >>>> >>>>>>>>>> >> this context 🙂 >>>> >>>>>>>>>> >> >>>> >>>>>>>>>> >> Thanks >>>> >>>>>>>>>> >> >>>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison < >>>> [email protected]> >>>> >>>>>>>>>> wrote: >>>> >>>>>>>>>> >> >>>> >>>>>>>>>> >>> Thank you, Slava! >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>> Do you have tesseract installed? >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations? >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G < >>>> [email protected]> >>>> >>>>>>>>>> wrote: >>>> >>>>>>>>>> >>>> Hi, >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text >>>> and >>>> >>>>>>>>>> some images. >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more >>>> (TIKA >>>> >>>>>>>>>> 1.19.1 >>>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM >>>> with SSD >>>> >>>>>>>>>> disk, running >>>> >>>>>>>>>> >>> CentOS Linux). >>>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or >>>> maybe >>>> >>>>>>>>>> it's a bug >>>> >>>>>>>>>> >>> in PDFBox ? >>>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in >>>> this >>>> >>>>>>>>>> stack : >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at >>>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown >>>> Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>> >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> at >>>> >>>>>>>>>> >>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>>> >>>>>>>>>> >>>> >>>> >>>>>>>>>> >>>> Thanks >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>> >>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>> >>>>>>>>>> For additional commands, e-mail: >>>> [email protected] >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>
