Hello Karl,
PS: at this moment, I have 24 document bloqued. 20 status «Processing » and 4
status « About to Process ».
So, I have test and they are they sames. So, I have import the file and used
tika-app.jar to test in local and I have this error for they files:
WARN Invalid XObject Subtype: null
WARN Invalid XObject Subtype: null
WARN Invalid XObject Subtype: null
…
WARN Invalid XObject Subtype: null
WARN Invalid XObject Subtype: null
WARN Invalid XObject Subtype: null
WARN Invalid XObject Subtype: null
Exception in thread "main" java.lang.StackOverflowError
at java.util.zip.Inflater.<init>(Inflater.java:102)
at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:99)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
at
org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject.getContents(PDFormXObject.java:144)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:493)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
at
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
at
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
…
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
at
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
at
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
at
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
at
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
at
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
If I open the file with « Edge », it’s good.
Any idea?
Thanks,
Maxence,
De : Karl Wright [mailto:[email protected]]
Envoyé : lundi 28 mai 2018 18:47
À : [email protected]
Objet : Re:
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
error SPAM 10Go/hour
This sounds potentially like a problem in Tika, but in order to be sure I would
need a complete stack trace, not just a piece of one.
If it is a Tika issue, it should appear reliably on the same document, again
and again.
Is there any way you can crawl ONLY one of the documents that got blocked? I
suspect that when you paused and restarted, you just postponed the problem and
it will happen again.
Karl
On Mon, May 28, 2018 at 9:50 AM msaunier <[email protected]
<mailto:[email protected]> > wrote:
Hello Karl,
In Manifoldcf 2.9 for all jobs at the end of the job, several documents, around
twenty, remain blocked.
A single error appears and it spam the logs of several gigabytes in a short
time which filled the servers :
[?:?]
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
~[?:?]
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
~[?:?]
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:231)
~[?:?]
If I paused the job and start, documents are send and it working. But, if I’m
not there, we have problems.
Do you now this problem and do you have a solution ? It’s a bad configuration ?
Thanks you.