Self serve! Perfect! Y, that's what I was going to recommend. If you don't want the metadata (/rmeta), try the basic /tika handler (if you haven't already!).
On Mon, Aug 12, 2019 at 8:53 AM Eric Pugh <[email protected]> wrote: > > I wanted to share the magic set of parameters that worked for me: > > curl -T mypdf.pdf http://localhost:9998/rmeta --header "X-Tika-OCRLanguage: > eng" --header "X-Tika-PDFOcrStrategy: ocr_only" --header > "X-Tika-OCRoutputType: hocr” > > This returns the output in a JSON format, and under the key X-TIKA:content in > a awful escaped XML format is the HOCR output: > > \u003cspan class\u003d\"ocrx_word\" id\u003d\"word_1_11\" title\u003d\"bbox > 400 453 518 475; x_wconf 96\"\u003ePerspectives\u003c/span\u003e > > I’m going to play around some more and see if maybe I can get a nicer > structure to be returned! > > Eric > > On Aug 9, 2019, at 4:52 PM, Eric Pugh <[email protected]> wrote: > > I’m working with the Tika Server directly instead of working with the Tika > code directly, and I have the it set so that when I post a PDF to the server > that I get back the xml instead of the text version by specifying in > TesseractOCRConfig.properties file that I want outputType=hocr. > > However, I’m looking to get back all the hOCR metadata as well, ie the > bounding boxes around each word. Is returning that up the chain possible? > > > > Eric > > _______________________ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com | My Free/Busy > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. > > > _______________________ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com | My Free/Busy > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. >
