I wanted to share the magic set of parameters that worked for me: curl -T mypdf.pdf http://localhost:9998/rmeta --header "X-Tika-OCRLanguage: eng" --header "X-Tika-PDFOcrStrategy: ocr_only" --header "X-Tika-OCRoutputType: hocr”
This returns the output in a JSON format, and under the key X-TIKA:content in a awful escaped XML format is the HOCR output: \u003cspan class\u003d\"ocrx_word\" id\u003d\"word_1_11\" title\u003d\"bbox 400 453 518 475; x_wconf 96\"\u003ePerspectives\u003c/span\u003e I’m going to play around some more and see if maybe I can get a nicer structure to be returned! Eric > On Aug 9, 2019, at 4:52 PM, Eric Pugh <[email protected]> wrote: > > I’m working with the Tika Server directly instead of working with the Tika > code directly, and I have the it set so that when I post a PDF to the server > that I get back the xml instead of the text version by specifying in > TesseractOCRConfig.properties file that I want outputType=hocr. > > However, I’m looking to get back all the hOCR metadata as well, ie the > bounding boxes around each word. Is returning that up the chain possible? > > > > Eric > > _______________________ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com <http://www.opensourceconnections.com/> > | My Free/Busy <http://tinyurl.com/eric-cal> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed > <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. > _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
