I’m working with the Tika Server directly instead of working with the Tika code 
directly, and I have the it set so that when I post a PDF to the server that I 
get back the xml instead of the text version by specifying in 
TesseractOCRConfig.properties file that I want outputType=hocr.

However, I’m looking to get back all the hOCR metadata as well, ie the bounding 
boxes around each word.  Is returning that up the chain possible?   



Eric

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to