I’m working with the Tika Server directly instead of working with the Tika code directly, and I have the it set so that when I post a PDF to the server that I get back the xml instead of the text version by specifying in TesseractOCRConfig.properties file that I want outputType=hocr.
However, I’m looking to get back all the hOCR metadata as well, ie the bounding boxes around each word. Is returning that up the chain possible? Eric _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
