Re: [jira] [Commented] (TIKA-3040) PDF inline OCR: Exception while processing certain image (others in same PDF work)
Eric, Are you talking about the different OCR strategies for PDFs? The challenge that it really isn't simple. I've tried to explain it: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR and better, here, with the 2 primary strategies: https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) Then there's this on the horizon (??): https://issues.apache.org/jira/browse/TIKA-2749 On Wed, Feb 12, 2020 at 8:53 AM Eric Pugh wrote: > Is there a way that a mere mortal could understand to make that change? > Tika at times can be rather opaque in how the parameters all interact iwth > each other. > > > > On Feb 12, 2020, at 7:38 AM, Tim Allison (Jira) wrote: > > > > > >[ > https://issues.apache.org/jira/browse/TIKA-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035359#comment-17035359 > ] > > > > Tim Allison commented on TIKA-3040: > > --- > > > > Great! Thank you [~Mandalka]! > > > >> PDF inline OCR: Exception while processing certain image (others in > same PDF work) > >> > -- > >> > >>Key: TIKA-3040 > >>URL: https://issues.apache.org/jira/browse/TIKA-3040 > >>Project: Tika > >> Issue Type: Bug > >> Components: ocr > >> Affects Versions: 1.23 > >>Environment: Debian 10 > >> Tesseract > >> Reporter: Markus Mandalka > >> Priority: Minor > >> > >> There is a PDF document (without plain text content) in which text > content are scans of multiple pages. > >> OCR for one of the images (text of a page) fails by tika-server with > activated inline OCR for PDF. > >> My fallback/alternate in Open Semantic ETL / Open Semantic Search using > pdfimages of Debian package poppler-utils to extract the images works for > all images in that PDF document). > >> I can not attach/upload this document here to the public because of > Copyright/Classified issues, but if interested, i could send it to certain > developer(s). > >> Following tika-server exception in result field > X-TIKA:EXCEPTION:embedded_stream_exception: > >> javax.imageio.IIOException: Bogus input colorspace at > java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeImage(Native > Method) at > java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeOnThread(JPEGImageWriter.java:1007) > at > java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:371) > at > org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:316) > at > org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:189) > at > org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:166) > at > org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:148) > at org.apache.tika.parser.pdf.PDF2XHTML.writeToBuffer(PDF2XHTML.java:304) > at > org.apache.tika.parser.pdf.PDF2XHTML.processImageObject(PDF2XHTML.java:268) > at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:194) > at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:165) at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:162) at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409) > at > org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:147) > at > org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:123) > at jdk.internal.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) at > org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.intercep
Re: [jira] [Commented] (TIKA-3040) PDF inline OCR: Exception while processing certain image (others in same PDF work)
Is there a way that a mere mortal could understand to make that change? Tika at times can be rather opaque in how the parameters all interact iwth each other. > On Feb 12, 2020, at 7:38 AM, Tim Allison (Jira) wrote: > > >[ > https://issues.apache.org/jira/browse/TIKA-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035359#comment-17035359 > ] > > Tim Allison commented on TIKA-3040: > --- > > Great! Thank you [~Mandalka]! > >> PDF inline OCR: Exception while processing certain image (others in same PDF >> work) >> -- >> >>Key: TIKA-3040 >>URL: https://issues.apache.org/jira/browse/TIKA-3040 >>Project: Tika >> Issue Type: Bug >> Components: ocr >> Affects Versions: 1.23 >>Environment: Debian 10 >> Tesseract >> Reporter: Markus Mandalka >> Priority: Minor >> >> There is a PDF document (without plain text content) in which text content >> are scans of multiple pages. >> OCR for one of the images (text of a page) fails by tika-server with >> activated inline OCR for PDF. >> My fallback/alternate in Open Semantic ETL / Open Semantic Search using >> pdfimages of Debian package poppler-utils to extract the images works for >> all images in that PDF document). >> I can not attach/upload this document here to the public because of >> Copyright/Classified issues, but if interested, i could send it to certain >> developer(s). >> Following tika-server exception in result field >> X-TIKA:EXCEPTION:embedded_stream_exception: >> javax.imageio.IIOException: Bogus input colorspace at >> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeImage(Native >> Method) at >> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeOnThread(JPEGImageWriter.java:1007) >> at >> java.desktop/com.sun.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:371) >> at >> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:316) >> at >> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:189) >> at >> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:166) >> at >> org.apache.pdfbox.tools.imageio.ImageIOUtil.writeImage(ImageIOUtil.java:148) >> at org.apache.tika.parser.pdf.PDF2XHTML.writeToBuffer(PDF2XHTML.java:304) at >> org.apache.tika.parser.pdf.PDF2XHTML.processImageObject(PDF2XHTML.java:268) >> at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:194) at >> org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:165) at >> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) >> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) at >> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867) >> at >> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) >> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) at >> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:162) at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at >> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233) >> at >> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409) at >> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:147) >> at >> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:123) >> at jdk.internal.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at >> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.base/java.lang.reflect.Method.invoke(Method.java:566) at >> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) >> at >> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) >> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) at >> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) at >> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) >> at >> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) >> at >> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) >> at >> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) >> at >> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) >> at >> org.apache.cxf.transport.h