Thank you for reporting this, it's definitely a bug.

Feel free to contribute by creating issue in JIRA and patch or pull request
on github as Chris suggested.

чт, 23 июля 2015 г. в 15:36, Christian Wolfe <[email protected]>:

> Hello Gribov,
>
> I built tesseract from source, and it installed binaries into
> /usr/local/bin, and tessdata into /usr/local/share/tessdata. It works fine
> when I run Tesseract commands straight from the command line.
>
> When I run it through Tika, I create a TesseractOCRConfig object, and set
> the Tesseract Path property to '/usr/local/bin'. This seems to trip things
> up in the following method in TesseractOCRParser:
>
> private void setEnv(TesseractOCRConfig config, ProcessBuilder pb) {
>   if(!config.getTesseractPath().isEmpty()) {
>       Map<String, String> env = pb.environment();
>       env.put("TESSDATA_PREFIX", config.getTesseractPath());
>   }
> }
>
> Which indicates that if you manually set the Tesseract path, it assumes
> that the executable and the tessdata folder will be located together, and
> if you *don't* set it, it will use the system default values. Sure enough,
> I ran my code without explicitly setting the path, and it worked.
>
> That's good to know, but there could be situations in which these files
> are installed to nonstandard paths, or a situation in which a user wants to
> manually set the path based on what OS they're running in.
>
>
>
> On Thu, Jul 23, 2015 at 5:17 AM, Konstantin Gribov <[email protected]>
> wrote:
>
>> Hi, Christian.
>>
>> I have tesseract binary in /usr/bin (which is in PATH), its data in
>> /usr/share/tessdata and ocr works fine in current Tika release.
>>
>> Does it fail if tessdata path properly configured when building tesseract
>> (default is $PREFIX/share/tessdata)? Or ocr fails only when tessdata is
>> placed in non-standard directory which requires start tesseract with
>> `/path/to/tesseract --tessdata-dir /path/to/tessdata ...` always?
>>
>> Providing explicit property for tessdata-dir shouldn't break anything if
>> its default would be empty string and `--tessdata-dir` argument wouldn't be
>> added to command line if it's empty.
>>
>>
>> чт, 23 июля 2015 г. в 7:42, Mattmann, Chris A (3980) <
>> [email protected]>:
>>
>>> this would be a very welcome change, Christian. Please create a JIRA
>>> issue at:
>>>
>>> http://issues.apache.org/jira/browse/TIKA
>>>
>>> And update the wiki page here http://wiki.apache.org/tika/TikaOCR
>>>
>>> Would be happy for you to contribute via SVN and Jira/patch and/or
>>> from Github per here:
>>>
>>> http://github.com/apache/tika/#contributing
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: [email protected]
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian Wolfe <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Wednesday, July 22, 2015 at 7:11 PM
>>> To: "[email protected]" <[email protected]>
>>> Subject: TesseractOCRParser on Linux
>>>
>>> >Hi folks,
>>> >
>>> >It looks to me that TesseractOCRParser doesn't work on Linux unless the
>>> >Tesseract executable and the 'tessdata' folder are in the same location
>>> >on the filesystem. This makes sense in a Windows environment
>>> > (where everything is installed together by default), but in linux,
>>> >package managers (*and* source code installations) tend to split the
>>> >files up across the filesystem.
>>> >
>>> >
>>> >I believe this could be alleviated by creating a second property in
>>> >TesseractOCRConfig that points to the 'tessdata' folder separately from
>>> >the Tesseract executable. That, or a bit of documentation
>>> > that clarifies that the files need to be together.
>>> >
>>> >
>>> >I would be more than willing to work on either solution, but only if the
>>> >team considered it worthwhile.
>>> >
>>> >
>>> >Anyway, thanks for making a great library, and for taking time to read
>>> >this.
>>> >
>>>
>>> --
>> Best regards,
>> Konstantin Gribov
>>
>
> --
Best regards,
Konstantin Gribov

Reply via email to