Self serve!  Perfect!  Y, that's what I was going to recommend.

If you don't want the metadata (/rmeta), try the basic /tika handler
(if you haven't already!).

On Mon, Aug 12, 2019 at 8:53 AM Eric Pugh
<[email protected]> wrote:
>
> I wanted to share the magic set of parameters that worked for me:
>
> curl -T mypdf.pdf http://localhost:9998/rmeta --header "X-Tika-OCRLanguage: 
> eng" --header "X-Tika-PDFOcrStrategy: ocr_only" --header 
> "X-Tika-OCRoutputType: hocr”
>
> This returns the output in a JSON format, and under the key X-TIKA:content in 
> a awful escaped XML format is the HOCR output:
>
> \u003cspan class\u003d\"ocrx_word\" id\u003d\"word_1_11\" title\u003d\"bbox 
> 400 453 518 475; x_wconf 96\"\u003ePerspectives\u003c/span\u003e
>
> I’m going to play around some more and see if maybe I can get a nicer 
> structure to be returned!
>
> Eric
>
> On Aug 9, 2019, at 4:52 PM, Eric Pugh <[email protected]> wrote:
>
> I’m working with the Tika Server directly instead of working with the Tika 
> code directly, and I have the it set so that when I post a PDF to the server 
> that I get back the xml instead of the text version by specifying in 
> TesseractOCRConfig.properties file that I want outputType=hocr.
>
> However, I’m looking to get back all the hOCR metadata as well, ie the 
> bounding boxes around each word.  Is returning that up the chain possible?
>
>
>
> Eric
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
> http://www.opensourceconnections.com | My Free/Busy
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> This e-mail and all contents, including attachments, is considered to be 
> Company Confidential unless explicitly stated otherwise, regardless of 
> whether attachments are marked as such.
>
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
> http://www.opensourceconnections.com | My Free/Busy
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> This e-mail and all contents, including attachments, is considered to be 
> Company Confidential unless explicitly stated otherwise, regardless of 
> whether attachments are marked as such.
>

Reply via email to