Re: [jira] [Commented] (TIKA-3040) PDF inline OCR: Exception while processing certain image (others in same PDF work)

2020-02-12 Thread Tim Allison
Eric,
  Are you talking about the different OCR strategies for PDFs?  The
challenge that it really isn't simple.

I've tried to explain it:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR

and better, here, with the 2 primary strategies:
https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)


Then there's this on the horizon (??):
https://issues.apache.org/jira/browse/TIKA-2749

On Wed, Feb 12, 2020 at 8:53 AM Eric Pugh 
wrote:

> Is there a way that a mere mortal could understand to make that change?
>  Tika at times can be rather opaque in how the parameters all interact iwth
> each other.
>
>
> > On Feb 12, 2020, at 7:38 AM, Tim Allison (Jira)  wrote:
> >
> >
> >[
> https://issues.apache.org/jira/browse/TIKA-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035359#comment-17035359
> ]
> >
> > Tim Allison commented on TIKA-3040:
> > ---
> >
> > Great!  Thank you [~Mandalka]!
> >
> >> PDF inline OCR: Exception while processing certain image (others in
> same PDF work)
> >>
> --
> >>
> >>Key: TIKA-3040
> >>URL: https://issues.apache.org/jira/browse/TIKA-3040
> >>Project: Tika
> >> Issue Type: Bug
> >> Components: ocr
> >>   Affects Versions: 1.23
> >>Environment: Debian 10
> >> Tesseract
> >>   Reporter: Markus Mandalka
> >>   Priority: Minor
> >>
> >> There is a PDF document (without plain text content) in which text
> content are scans of multiple pages.
> >> OCR for one of the images (text of a page) fails by tika-server with
> activated inline OCR for PDF.
> >> My fallback/alternate in Open Semantic ETL / Open Semantic Search using
> pdfimages of Debian package poppler-utils to extract the images works for
> all images in that PDF document).
> >> I can not attach/upload this document here to the public because of
> Copyright/Classified issues, but if interested, i could send it to certain
> developer(s).
> >> Following tika-server exception in result field
> X-TIKA:EXCEPTION:embedded_stream_exception:
> >> javax.imageio.IIOException: Bogus input colorspace at
> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeImage(Native
> Method) at
> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeOnThread(JPEGImageWriter.java:1007)
> at
> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:371)
> at
> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:316)
> at
> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:189)
> at
> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:166)
> at
> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:148)
> at org.apache.tika.parser.pdf.PDF2XHTML.writeToBuffer(PDF2XHTML.java:304)
> at
> org.apache.tika.parser.pdf.PDF2XHTML.processImageObject(PDF2XHTML.java:268)
> at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:194)
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:165) at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:162) at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233)
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409)
> at
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:147)
> at
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:123)
> at jdk.internal.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566) at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) at
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at
> org.apache.cxf.intercep

Re: [jira] [Commented] (TIKA-3040) PDF inline OCR: Exception while processing certain image (others in same PDF work)

2020-02-12 Thread Eric Pugh
Is there a way that a mere mortal could understand to make that change?   Tika 
at times can be rather opaque in how the parameters all interact iwth each 
other.


> On Feb 12, 2020, at 7:38 AM, Tim Allison (Jira)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/TIKA-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035359#comment-17035359
>  ] 
> 
> Tim Allison commented on TIKA-3040:
> ---
> 
> Great!  Thank you [~Mandalka]!
> 
>> PDF inline OCR: Exception while processing certain image (others in same PDF 
>> work)
>> --
>> 
>>Key: TIKA-3040
>>URL: https://issues.apache.org/jira/browse/TIKA-3040
>>Project: Tika
>> Issue Type: Bug
>> Components: ocr
>>   Affects Versions: 1.23
>>Environment: Debian 10
>> Tesseract
>>   Reporter: Markus Mandalka
>>   Priority: Minor
>> 
>> There is a PDF document (without plain text content) in which text content 
>> are scans of multiple pages.
>> OCR for one of the images (text of a page) fails by tika-server with 
>> activated inline OCR for PDF.
>> My fallback/alternate in Open Semantic ETL / Open Semantic Search using 
>> pdfimages of Debian package poppler-utils to extract the images works for 
>> all images in that PDF document).
>> I can not attach/upload this document here to the public because of 
>> Copyright/Classified issues, but if interested, i could send it to certain 
>> developer(s).
>> Following tika-server exception in result field 
>> X-TIKA:EXCEPTION:embedded_stream_exception:
>> javax.imageio.IIOException: Bogus input colorspace at 
>> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeImage(Native 
>> Method) at 
>> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeOnThread(JPEGImageWriter.java:1007)
>>  at 
>> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:371)
>>  at 
>> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:316) 
>> at 
>> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:189) 
>> at 
>> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:166) 
>> at 
>> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:148) 
>> at org.apache.tika.parser.pdf.PDF2XHTML.writeToBuffer(PDF2XHTML.java:304) at 
>> org.apache.tika.parser.pdf.PDF2XHTML.processImageObject(PDF2XHTML.java:268) 
>> at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:194) at 
>> org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:165) at 
>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) 
>> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) at 
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867)
>>  at 
>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) 
>> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) at 
>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:162) at 
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
>> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233)
>>  at 
>> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409) at 
>> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:147)
>>  at 
>> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:123)
>>  at jdk.internal.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
>> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>>  at 
>> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>>  at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) at 
>> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) at 
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>>  at 
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>>  at 
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>>  at 
>> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>>  at 
>> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>>  at 
>> org.apache.cxf.transport.h