I'll once I'll get customer's approval. Meanwhile I can do any checks, if you can specify what to check. Thanks
On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <[email protected]> wrote: > Any chance you can share the file directly w me or someone else on the > PDFBox team? > > On Wed, Feb 27, 2019 at 11:24 AM Slava G <[email protected]> wrote: > > > After 3h 40m it's still parsing using PDFBox 2.0.14 app... > > Thanks > > > > On Wed, Feb 27, 2019 at 3:29 PM Slava G <[email protected]> wrote: > > > >> With 2.0.14 it's 40 minutes running, no result, still working... > >> Seems that issue is still there. > >> Thanks > >> > >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <[email protected]> wrote: > >> > >>> Checking with 2.0.14. Started as an app. Will update soon. > >>> > >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <[email protected]> > wrote: > >>> > >>>> Any chance you could try with the 2.0.14 release candidate...unless > you > >>>> have already? > >>>> > >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ > >>>> > >>>> > >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <[email protected]> wrote: > >>>> > >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 > >>>>> hours and still counting... > >>>>> It's seems to be a PDFBox issue. > >>>>> > >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <[email protected]> > wrote: > >>>>> > >>>>>> Why don't you do a basic test with tika server in a 3thrd and a > >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF. > >>>>>> It can be easier to investigate the problem. > >>>>>> > >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <[email protected] > > > >>>>>> a écrit : > >>>>>> > >>>>>>> Just looking at the stack trace it won't be the same anymore due to > >>>>>>> PDFBOX-4453 > >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it > >>>>>>> changes how decryption is handled. Not sure if related though. > >>>>>>> > >>>>>>> Can you duplicate the problem without Tika using just PDFBox > >>>>>>> command-line ExtractText command ( > >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <[email protected]> wrote: > >>>>>>> > >>>>>>>> This is the code : > >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath()); > >>>>>>>> PDFParser tmpPdf = new PDFParser(); > >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); > >>>>>>>> config.setMaxMainMemoryBytes(31457280); > >>>>>>>> config.setExtractAcroFormContent(false); > >>>>>>>> config.setExtractBookmarksText(false); > >>>>>>>> config.setCatchIntermediateIOExceptions(true); > >>>>>>>> Metadata metadata = new Metadata(); > >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); > >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new > >>>>>>>> ParseContext()); > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <[email protected]> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> This is the default in Tika, where the default for > >>>>>>>>> maxMainMemoryBytes=500MB. > >>>>>>>>> > >>>>>>>>> Slava, how are you calling this in Tika? With a TikaInputStream > >>>>>>>>> via tika-app or tika-server or something else? > >>>>>>>>> > >>>>>>>>> MemoryUsageSetting memoryUsageSetting = > >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly(); > >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { > >>>>>>>>> memoryUsageSetting = > >>>>>>>>> > MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); > >>>>>>>>> } > >>>>>>>>> if (tstream != null && tstream.hasFile()) { > >>>>>>>>> // File based -- send file directly to PDFBox > >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), > >>>>>>>>> password, memoryUsageSetting); > >>>>>>>>> } else { > >>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), > >>>>>>>>> password, memoryUsageSetting); > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < > >>>>>>>>> [email protected]> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run > >>>>>>>>>> the > >>>>>>>>>> profiler. > >>>>>>>>>> > >>>>>>>>>> The HashSet is used to avoid decrypting objects twice. > >>>>>>>>>> > >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user > >>>>>>>>>> password. > >>>>>>>>>> > >>>>>>>>>> It would also be interesting to hear what parameter is passed to > >>>>>>>>>> MemoryUsageSetting when load() is called. > >>>>>>>>>> > >>>>>>>>>> Tilman > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: > >>>>>>>>>> > PDFBox Colleagues, > >>>>>>>>>> > Any ideas? > >>>>>>>>>> > > >>>>>>>>>> > ---------- Forwarded message --------- > >>>>>>>>>> > From: Tim Allison <[email protected]> > >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM > >>>>>>>>>> > Subject: Re: Very slow PDF parsing. > >>>>>>>>>> > To: <[email protected]> > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > Sorry...that's an OCR tool. One thing that can slow down > >>>>>>>>>> processing > >>>>>>>>>> > dramatically is if you have tesseract installed (try typing > >>>>>>>>>> 'tesseract' on > >>>>>>>>>> > your commandline) and if you've turned it on for PDFs. I > >>>>>>>>>> suspect this > >>>>>>>>>> > isn't your problem, though. > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <[email protected]> > >>>>>>>>>> wrote: > >>>>>>>>>> > > >>>>>>>>>> >> Thanks Tim, > >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is > >>>>>>>>>> tessercat is in > >>>>>>>>>> >> this context 🙂 > >>>>>>>>>> >> > >>>>>>>>>> >> Thanks > >>>>>>>>>> >> > >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <[email protected] > > > >>>>>>>>>> wrote: > >>>>>>>>>> >> > >>>>>>>>>> >>> Thank you, Slava! > >>>>>>>>>> >>> > >>>>>>>>>> >>> Do you have tesseract installed? > >>>>>>>>>> >>> > >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations? > >>>>>>>>>> >>> > >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <[email protected] > > > >>>>>>>>>> wrote: > >>>>>>>>>> >>>> Hi, > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and > >>>>>>>>>> some images. > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more > (TIKA > >>>>>>>>>> 1.19.1 > >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with > SSD > >>>>>>>>>> disk, running > >>>>>>>>>> >>> CentOS Linux). > >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or > maybe > >>>>>>>>>> it's a bug > >>>>>>>>>> >>> in PDFBox ? > >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this > >>>>>>>>>> stack : > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at > org.apache.pdfbox.cos.COSString.equals(COSString.java:259) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) > >>>>>>>>>> >>>> at > >>>>>>>>>> >>> > >>>>>>>>>> > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) > >>>>>>>>>> >>>> at > >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at > >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at > >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> at > >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> Thanks > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > --------------------------------------------------------------------- > >>>>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>>>> > >>>>>>>>>> >
