This is indeed a Tika bug, or a bug in the underlying PDFBox code it uses. In order to make progress, we need a sample document that demonstrates the problem. Once we have that, I can open a Tika ticket.
Thanks, Karl On Tue, May 29, 2018 at 12:06 PM msaunier <[email protected]> wrote: > Hello Karl, > > > > PS: at this moment, I have 24 document bloqued. 20 status > «Processing » and 4 status « About to Process ». > > > > So, I have test and they are they sames. So, I have import the file and > used tika-app.jar to test in local and I have this error for they files: > > > > WARN Invalid XObject Subtype: null > > WARN Invalid XObject Subtype: null > > WARN Invalid XObject Subtype: null > > … > > WARN Invalid XObject Subtype: null > > WARN Invalid XObject Subtype: null > > WARN Invalid XObject Subtype: null > > WARN Invalid XObject Subtype: null > > Exception in thread "main" java.lang.StackOverflowError > > at java.util.zip.Inflater.<init>(Inflater.java:102) > > at > org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:99) > > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) > > at org.apache.pdfbox.filter.Filter.decode(Filter.java:87) > > at > org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77) > > at > org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175) > > at > org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) > > at > org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject.getContents(PDFormXObject.java:144) > > at > org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:493) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163) > > at > org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163) > > at > org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60) > > … > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163) > > at > org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163) > > at > org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163) > > at > org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163) > > at > org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163) > > at > org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60) > > > > If I open the file with « Edge », it’s good. > > > > Any idea? > > > > Thanks, > > Maxence, > > > > > > *De :* Karl Wright [mailto:[email protected]] > *Envoyé :* lundi 28 mai 2018 18:47 > *À :* [email protected] > *Objet :* Re: > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) > error SPAM 10Go/hour > > > > This sounds potentially like a problem in Tika, but in order to be sure I > would need a complete stack trace, not just a piece of one. > > If it is a Tika issue, it should appear reliably on the same document, > again and again. > > > > Is there any way you can crawl ONLY one of the documents that got > blocked? I suspect that when you paused and restarted, you just postponed > the problem and it will happen again. > > > > Karl > > > > > > On Mon, May 28, 2018 at 9:50 AM msaunier <[email protected]> wrote: > > Hello Karl, > > > > In Manifoldcf 2.9 for all jobs at the end of the job, several documents, > around twenty, remain blocked. > > A single error appears and it spam the logs of several gigabytes in a > short time which filled the servers : > > > > [?:?] > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) > ~[?:?] > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495) > ~[?:?] > > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:231) > ~[?:?] > > > > If I paused the job and start, documents are send and it working. But, if > I’m not there, we have problems. > > > > Do you now this problem and do you have a solution ? It’s a bad > configuration ? > > > > Thanks you. > >
