Hello Gribov,
I built tesseract from source, and it installed binaries into
/usr/local/bin, and tessdata into /usr/local/share/tessdata. It works fine
when I run Tesseract commands straight from the command line.
When I run it through Tika, I create a TesseractOCRConfig object, and set
the Tesseract Path property to '/usr/local/bin'. This seems to trip things
up in the following method in TesseractOCRParser:
private void setEnv(TesseractOCRConfig config, ProcessBuilder pb) {
if(!config.getTesseractPath().isEmpty()) {
Map<String, String> env = pb.environment();
env.put("TESSDATA_PREFIX", config.getTesseractPath());
}
}
Which indicates that if you manually set the Tesseract path, it assumes
that the executable and the tessdata folder will be located together, and
if you *don't* set it, it will use the system default values. Sure enough,
I ran my code without explicitly setting the path, and it worked.
That's good to know, but there could be situations in which these files are
installed to nonstandard paths, or a situation in which a user wants to
manually set the path based on what OS they're running in.
On Thu, Jul 23, 2015 at 5:17 AM, Konstantin Gribov <[email protected]>
wrote:
> Hi, Christian.
>
> I have tesseract binary in /usr/bin (which is in PATH), its data in
> /usr/share/tessdata and ocr works fine in current Tika release.
>
> Does it fail if tessdata path properly configured when building tesseract
> (default is $PREFIX/share/tessdata)? Or ocr fails only when tessdata is
> placed in non-standard directory which requires start tesseract with
> `/path/to/tesseract --tessdata-dir /path/to/tessdata ...` always?
>
> Providing explicit property for tessdata-dir shouldn't break anything if
> its default would be empty string and `--tessdata-dir` argument wouldn't be
> added to command line if it's empty.
>
>
> чт, 23 июля 2015 г. в 7:42, Mattmann, Chris A (3980) <
> [email protected]>:
>
>> this would be a very welcome change, Christian. Please create a JIRA
>> issue at:
>>
>> http://issues.apache.org/jira/browse/TIKA
>>
>> And update the wiki page here http://wiki.apache.org/tika/TikaOCR
>>
>> Would be happy for you to contribute via SVN and Jira/patch and/or
>> from Github per here:
>>
>> http://github.com/apache/tika/#contributing
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Christian Wolfe <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Wednesday, July 22, 2015 at 7:11 PM
>> To: "[email protected]" <[email protected]>
>> Subject: TesseractOCRParser on Linux
>>
>> >Hi folks,
>> >
>> >It looks to me that TesseractOCRParser doesn't work on Linux unless the
>> >Tesseract executable and the 'tessdata' folder are in the same location
>> >on the filesystem. This makes sense in a Windows environment
>> > (where everything is installed together by default), but in linux,
>> >package managers (*and* source code installations) tend to split the
>> >files up across the filesystem.
>> >
>> >
>> >I believe this could be alleviated by creating a second property in
>> >TesseractOCRConfig that points to the 'tessdata' folder separately from
>> >the Tesseract executable. That, or a bit of documentation
>> > that clarifies that the files need to be together.
>> >
>> >
>> >I would be more than willing to work on either solution, but only if the
>> >team considered it worthwhile.
>> >
>> >
>> >Anyway, thanks for making a great library, and for taking time to read
>> >this.
>> >
>>
>> --
> Best regards,
> Konstantin Gribov
>