Hi, Christian. I have tesseract binary in /usr/bin (which is in PATH), its data in /usr/share/tessdata and ocr works fine in current Tika release.
Does it fail if tessdata path properly configured when building tesseract (default is $PREFIX/share/tessdata)? Or ocr fails only when tessdata is placed in non-standard directory which requires start tesseract with `/path/to/tesseract --tessdata-dir /path/to/tessdata ...` always? Providing explicit property for tessdata-dir shouldn't break anything if its default would be empty string and `--tessdata-dir` argument wouldn't be added to command line if it's empty. чт, 23 июля 2015 г. в 7:42, Mattmann, Chris A (3980) < [email protected]>: > this would be a very welcome change, Christian. Please create a JIRA > issue at: > > http://issues.apache.org/jira/browse/TIKA > > And update the wiki page here http://wiki.apache.org/tika/TikaOCR > > Would be happy for you to contribute via SVN and Jira/patch and/or > from Github per here: > > http://github.com/apache/tika/#contributing > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -----Original Message----- > From: Christian Wolfe <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, July 22, 2015 at 7:11 PM > To: "[email protected]" <[email protected]> > Subject: TesseractOCRParser on Linux > > >Hi folks, > > > >It looks to me that TesseractOCRParser doesn't work on Linux unless the > >Tesseract executable and the 'tessdata' folder are in the same location > >on the filesystem. This makes sense in a Windows environment > > (where everything is installed together by default), but in linux, > >package managers (*and* source code installations) tend to split the > >files up across the filesystem. > > > > > >I believe this could be alleviated by creating a second property in > >TesseractOCRConfig that points to the 'tessdata' folder separately from > >the Tesseract executable. That, or a bit of documentation > > that clarifies that the files need to be together. > > > > > >I would be more than willing to work on either solution, but only if the > >team considered it worthwhile. > > > > > >Anyway, thanks for making a great library, and for taking time to read > >this. > > > > -- Best regards, Konstantin Gribov
