I wanted to share the magic set of parameters that worked for me:

curl -T mypdf.pdf http://localhost:9998/rmeta --header "X-Tika-OCRLanguage: 
eng" --header "X-Tika-PDFOcrStrategy: ocr_only" --header "X-Tika-OCRoutputType: 
hocr”

This returns the output in a JSON format, and under the key X-TIKA:content in a 
awful escaped XML format is the HOCR output:

\u003cspan class\u003d\"ocrx_word\" id\u003d\"word_1_11\" title\u003d\"bbox 400 
453 518 475; x_wconf 96\"\u003ePerspectives\u003c/span\u003e

I’m going to play around some more and see if maybe I can get a nicer structure 
to be returned!

Eric

> On Aug 9, 2019, at 4:52 PM, Eric Pugh <[email protected]> wrote:
> 
> I’m working with the Tika Server directly instead of working with the Tika 
> code directly, and I have the it set so that when I post a PDF to the server 
> that I get back the xml instead of the text version by specifying in 
> TesseractOCRConfig.properties file that I want outputType=hocr.
> 
> However, I’m looking to get back all the hOCR metadata as well, ie the 
> bounding boxes around each word.  Is returning that up the chain possible?   
> 
> 
> 
> Eric
> 
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
> http://www.opensourceconnections.com <http://www.opensourceconnections.com/> 
> | My Free/Busy <http://tinyurl.com/eric-cal>  
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>   
> This e-mail and all contents, including attachments, is considered to be 
> Company Confidential unless explicitly stated otherwise, regardless of 
> whether attachments are marked as such.
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to