Hi, Christian.

I have tesseract binary in /usr/bin (which is in PATH), its data in
/usr/share/tessdata and ocr works fine in current Tika release.

Does it fail if tessdata path properly configured when building tesseract
(default is $PREFIX/share/tessdata)? Or ocr fails only when tessdata is
placed in non-standard directory which requires start tesseract with
`/path/to/tesseract --tessdata-dir /path/to/tessdata ...` always?

Providing explicit property for tessdata-dir shouldn't break anything if
its default would be empty string and `--tessdata-dir` argument wouldn't be
added to command line if it's empty.


чт, 23 июля 2015 г. в 7:42, Mattmann, Chris A (3980) <
[email protected]>:

> this would be a very welcome change, Christian. Please create a JIRA
> issue at:
>
> http://issues.apache.org/jira/browse/TIKA
>
> And update the wiki page here http://wiki.apache.org/tika/TikaOCR
>
> Would be happy for you to contribute via SVN and Jira/patch and/or
> from Github per here:
>
> http://github.com/apache/tika/#contributing
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Christian Wolfe <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday, July 22, 2015 at 7:11 PM
> To: "[email protected]" <[email protected]>
> Subject: TesseractOCRParser on Linux
>
> >Hi folks,
> >
> >It looks to me that TesseractOCRParser doesn't work on Linux unless the
> >Tesseract executable and the 'tessdata' folder are in the same location
> >on the filesystem. This makes sense in a Windows environment
> > (where everything is installed together by default), but in linux,
> >package managers (*and* source code installations) tend to split the
> >files up across the filesystem.
> >
> >
> >I believe this could be alleviated by creating a second property in
> >TesseractOCRConfig that points to the 'tessdata' folder separately from
> >the Tesseract executable. That, or a bit of documentation
> > that clarifies that the files need to be together.
> >
> >
> >I would be more than willing to work on either solution, but only if the
> >team considered it worthwhile.
> >
> >
> >Anyway, thanks for making a great library, and for taking time to read
> >this.
> >
>
> --
Best regards,
Konstantin Gribov

Reply via email to