Surfacing hOCR output from Tika Server

Eric Pugh Fri, 09 Aug 2019 13:53:35 -0700

I’m working with the Tika Server directly instead of working with the Tika code 
directly, and I have the it set so that when I post a PDF to the server that I 
get back the xml instead of the text version by specifying in 
TesseractOCRConfig.properties file that I want outputType=hocr.


However, I’m looking to get back all the hOCR metadata as well, ie the bounding 
boxes around each word.  Is returning that up the chain possible?   



Eric

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Surfacing hOCR output from Tika Server

Reply via email to