Sorry for not responding sooner. The file that you attached helps me understand this question quite a bit.
The basic answer is: no, not yet, not generally. The correct way to do OCR on PDFs might be to render the page without rendering the stored text and then run OCR on the page (minus text). We're not yet doing this. As mentioned, see our wiki ( https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) for the two main options of running OCR on PDFs. On your specific test file (see code and output below) you can use option 1 for PDFs (e.g. extract inline images), and you get what you want, but this will not generalize because some PDFs can use thousands of images per page. PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); ParseContext parseContext = new ParseContext(); parseContext.set(PDFParserConfig.class, config); Parser p = new AutoDetectParser(); ContentHandler handler = new ToXMLContentHandler(); Metadata metadata = new Metadata(); Path path = Paths.get("/..../Simple-text-image.pdf"); try (InputStream tis = TikaInputStream.get(path, metadata)) { p.parse(tis, handler, metadata, parseContext); } System.out.println(handler.toString()); <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="date" content="2020-12-31T20:08:09Z" /> <meta name="pdf:PDFVersion" content="1.7" /> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" /> <meta name="pdf:hasXFA" content="false" /> <meta name="access_permission:modify_annotations" content="true" /> <meta name="access_permission:can_print_degraded" content="true" /> <meta name="dc:creator" content="Peter Kronenberg" /> <meta name="language" content="en-US" /> <meta name="dcterms:created" content="2020-12-31T20:08:09Z" /> <meta name="Last-Modified" content="2020-12-31T20:08:09Z" /> <meta name="dcterms:modified" content="2020-12-31T20:08:09Z" /> <meta name="dc:format" content="application/pdf; version=1.7" /> <meta name="xmpMM:DocumentID" content="uuid:2C3CDC87-5F9C-49B7-9E4F-0E16A7AE27BC" /> <meta name="Last-Save-Date" content="2020-12-31T20:08:09Z" /> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365" /> <meta name="access_permission:fill_in_form" content="true" /> <meta name="pdf:docinfo:modified" content="2020-12-31T20:08:09Z" /> <meta name="meta:save-date" content="2020-12-31T20:08:09Z" /> <meta name="pdf:encrypted" content="false" /> <meta name="xmp:CreateDate" content="2020-12-31T15:08:09Z" /> <meta name="modified" content="2020-12-31T20:08:09Z" /> <meta name="Content-Length" content="47113" /> <meta name="pdf:hasMarkedContent" content="true" /> <meta name="Content-Type" content="application/pdf" /> <meta name="xmp:ModifyDate" content="2020-12-31T15:08:09Z" /> <meta name="pdf:docinfo:creator" content="Peter Kronenberg" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" /> <meta name="creator" content="Peter Kronenberg" /> <meta name="dc:language" content="en-US" /> <meta name="meta:author" content="Peter Kronenberg" /> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" /> <meta name="meta:creation-date" content="2020-12-31T20:08:09Z" /> <meta name="created" content="2020-12-31T20:08:09Z" /> <meta name="access_permission:extract_for_accessibility" content="true" /> <meta name="access_permission:assemble_document" content="true" /> <meta name="xmpTPg:NPages" content="2" /> <meta name="Creation-Date" content="2020-12-31T20:08:09Z" /> <meta name="resourceName" content="Simple-text-image.pdf" /> <meta name="pdf:hasXMP" content="true" /> <meta name="access_permission:extract_content" content="true" /> <meta name="access_permission:can_print" content="true" /> <meta name="Author" content="Peter Kronenberg" /> <meta name="producer" content="Microsoft® Word for Microsoft 365" /> <meta name="access_permission:can_modify" content="true" /> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365" /> <meta name="pdf:docinfo:created" content="2020-12-31T20:08:09Z" /> <title></title> </head> <body><div class="page"><p /> <p>Start of text </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut </p> <p>labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc sed augue lacus. Et netus </p> <p>et malesuada fames. Nunc sed id semper risus in hendrerit gravida. Scelerisque fermentum dui </p> <p>faucibus in ornare quam viverra orci. Dolor morbi non arcu risus. Pharetra massa massa ultricies </p> <p>mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit adipiscing. Auctor </p> <p>augue mauris augue neque gravida in fermentum et sollicitudin. Sed vulputate mi sit amet mauris </p> <p>commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum quisque non tellus </p> <p>orci ac auctor augue. </p> <p>Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien nec sagittis. Vestibulum </p> <p>rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Diam ut venenatis </p> <p>tellus in metus vulputate eu scelerisque felis. Nulla malesuada pellentesque elit eget gravida cum </p> <p>sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Dictum varius duis at </p> <p>consectetur lorem donec massa sapien faucibus. Integer malesuada nunc vel risus. Sit amet </p> <p>consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non blandit massa enim nec dui </p> <p>nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at tellus at urna </p> <p>condimentum mattis pellentesque id. Egestas tellus rutrum tellus pellentesque eu tincidunt tortor </p> <p>aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna. </p> <p>Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae ultricies leo. Gravida neque </p> <p>convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at quis risus sed. </p> <p>Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id semper risus in hendrerit </p> <p>gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam in. Pharetra sit amet </p> <p>aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida rutrum quisque non </p> <p>tellus orci. </p> <p>Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio euismod. Mollis </p> <p>aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh. Tristique senectus et </p> <p>netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet consectetur adipiscing elit ut </p> <p>aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in fermentum et sollicitudin </p> <p>ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat maecenas volutpat blandit </p> <p>aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu dui. Interdum posuere lorem </p> <p>ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor purus non. Ac turpis egestas </p> <p>sed tempus urna. Nam aliquam sem et tortor consequat id porta. </p> <p>End of text </p> <p> </p> <p> </p> <p /> </div> <div class="page"><p /> <p>Start of image </p> <p>End of image </p> <p> </p> <p /> <img src="embedded:image0.png" alt="image0.png" /><div class="ocr">Pellentesque adipiscing commodo elit at imperdiet dui. Consectetur purus ut faucibus pulvinar. Tincidunt praesent semper feugiat nibh sed pulvinar. Sagittis aliquam malesuada bibendum arcu vitae elementum curabitur vitae. Velit euismod in pellentesque massa placerat duis. Fermentum et sollicitudin ac orci phasellus egestas tellus. Ante in nibh mauris cursus mattis molestie. Commodo quis imperdiet massa fincidunt nunc pulvinar sapien et ligula. Lorem mollis aliquam ut porttitor leo a diam sollicitudin tempor. Amet aliquam id diam maecenas ultricies mi eget mauris pharetra. Ullamcorper dignissim cras tincidunt lobortis feugiat vivamus at augue eget. Orci eu lobortis elementum nibh tellus. Id aliquet lectus proin nibh nis! condimentum. Vitae elementum curabitur vitae nunc sed velit. Rhoncus dolor purus non enim praesent elementum facilisis leo vel. Velit egestas dui id ornare arcu odio ut sem nulla. Purus sit amet luctus venenatis lectus magna fringilla. Maecenas sed enim ut sem. </div> </div> </body></html> On Thu, Dec 31, 2020 at 9:58 AM Peter Kronenberg <[email protected]> wrote: > I’ve got Tika working with Tesseract on PDF files, but it seems that if I > give it a PDF file that has both searchable text and images, the text is > OCRed twice. Is there a way to avoid this? Even if it has to make two > passes, one for the straight text and then another for just the images >
