this would be a very welcome change, Christian. Please create a JIRA issue at:
http://issues.apache.org/jira/browse/TIKA And update the wiki page here http://wiki.apache.org/tika/TikaOCR Would be happy for you to contribute via SVN and Jira/patch and/or from Github per here: http://github.com/apache/tika/#contributing ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Christian Wolfe <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, July 22, 2015 at 7:11 PM To: "[email protected]" <[email protected]> Subject: TesseractOCRParser on Linux >Hi folks, > >It looks to me that TesseractOCRParser doesn't work on Linux unless the >Tesseract executable and the 'tessdata' folder are in the same location >on the filesystem. This makes sense in a Windows environment > (where everything is installed together by default), but in linux, >package managers (*and* source code installations) tend to split the >files up across the filesystem. > > >I believe this could be alleviated by creating a second property in >TesseractOCRConfig that points to the 'tessdata' folder separately from >the Tesseract executable. That, or a bit of documentation > that clarifies that the files need to be together. > > >I would be more than willing to work on either solution, but only if the >team considered it worthwhile. > > >Anyway, thanks for making a great library, and for taking time to read >this. >
