Thank you for reporting this, it's definitely a bug. Feel free to contribute by creating issue in JIRA and patch or pull request on github as Chris suggested.
чт, 23 июля 2015 г. в 15:36, Christian Wolfe <[email protected]>: > Hello Gribov, > > I built tesseract from source, and it installed binaries into > /usr/local/bin, and tessdata into /usr/local/share/tessdata. It works fine > when I run Tesseract commands straight from the command line. > > When I run it through Tika, I create a TesseractOCRConfig object, and set > the Tesseract Path property to '/usr/local/bin'. This seems to trip things > up in the following method in TesseractOCRParser: > > private void setEnv(TesseractOCRConfig config, ProcessBuilder pb) { > if(!config.getTesseractPath().isEmpty()) { > Map<String, String> env = pb.environment(); > env.put("TESSDATA_PREFIX", config.getTesseractPath()); > } > } > > Which indicates that if you manually set the Tesseract path, it assumes > that the executable and the tessdata folder will be located together, and > if you *don't* set it, it will use the system default values. Sure enough, > I ran my code without explicitly setting the path, and it worked. > > That's good to know, but there could be situations in which these files > are installed to nonstandard paths, or a situation in which a user wants to > manually set the path based on what OS they're running in. > > > > On Thu, Jul 23, 2015 at 5:17 AM, Konstantin Gribov <[email protected]> > wrote: > >> Hi, Christian. >> >> I have tesseract binary in /usr/bin (which is in PATH), its data in >> /usr/share/tessdata and ocr works fine in current Tika release. >> >> Does it fail if tessdata path properly configured when building tesseract >> (default is $PREFIX/share/tessdata)? Or ocr fails only when tessdata is >> placed in non-standard directory which requires start tesseract with >> `/path/to/tesseract --tessdata-dir /path/to/tessdata ...` always? >> >> Providing explicit property for tessdata-dir shouldn't break anything if >> its default would be empty string and `--tessdata-dir` argument wouldn't be >> added to command line if it's empty. >> >> >> чт, 23 июля 2015 г. в 7:42, Mattmann, Chris A (3980) < >> [email protected]>: >> >>> this would be a very welcome change, Christian. Please create a JIRA >>> issue at: >>> >>> http://issues.apache.org/jira/browse/TIKA >>> >>> And update the wiki page here http://wiki.apache.org/tika/TikaOCR >>> >>> Would be happy for you to contribute via SVN and Jira/patch and/or >>> from Github per here: >>> >>> http://github.com/apache/tika/#contributing >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Christian Wolfe <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Wednesday, July 22, 2015 at 7:11 PM >>> To: "[email protected]" <[email protected]> >>> Subject: TesseractOCRParser on Linux >>> >>> >Hi folks, >>> > >>> >It looks to me that TesseractOCRParser doesn't work on Linux unless the >>> >Tesseract executable and the 'tessdata' folder are in the same location >>> >on the filesystem. This makes sense in a Windows environment >>> > (where everything is installed together by default), but in linux, >>> >package managers (*and* source code installations) tend to split the >>> >files up across the filesystem. >>> > >>> > >>> >I believe this could be alleviated by creating a second property in >>> >TesseractOCRConfig that points to the 'tessdata' folder separately from >>> >the Tesseract executable. That, or a bit of documentation >>> > that clarifies that the files need to be together. >>> > >>> > >>> >I would be more than willing to work on either solution, but only if the >>> >team considered it worthwhile. >>> > >>> > >>> >Anyway, thanks for making a great library, and for taking time to read >>> >this. >>> > >>> >>> -- >> Best regards, >> Konstantin Gribov >> > > -- Best regards, Konstantin Gribov
