No, it doesn’t allow for the scripts.  And I’m still trying to confirm what the 
syntax is supposed to be.

This page, 
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES,
 implies that the -l option accepts the name of a language or script.  I 
assumed it would look in tessdata first and if not found, would look in 
tessdata/script.  But it seems you have to enter the path.
For example:

[cid:[email protected]]



From: Tim Allison <[email protected]>
Sent: Thursday, January 28, 2021 1:09 PM
To: [email protected]
Subject: Re: {EXTERNAL}Invalid language code

>Tika uses a regular expression to validate the language string, assuming it is 
>set of  ISO-639-2 language code separated by plus signs.

Are we allowing scripts in the language string?  If not, we need to fix the 
regex. Thank you, again!

On Wed, Jan 27, 2021 at 8:51 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
Different, but related issue.  It seems that Tika doesn’t support Tesseract 
scripts.  Looks like this came out with version 4.0.0.  See 
https://github.com/manisandro/gImageReader/issues/323

In the Tessdata directory there is a directory called script.  These are 
pseudo-language files that define the script or alphabet of the language.  See 
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES
 and https://github.com/tesseract-ocr/tessdata/tree/master/script

Right now, Tika uses a regular expression to validate the language string, 
assuming it is set of  ISO-639-2 language code separated by plus signs.
In light of my previous comment about validating that the language (or script) 
file exists, I suggest parsing the language string by the plus sign and not 
doing any other validating on the string, but instead, actually checking to see 
that the file exists in either tessdata or tessdata/script.
If any of them don’t exists, then a message would be put in the metadata
(which brings me to another issue that I think some of the Warnings that Tika 
puts out should go into the metadata, perhaps with a tag of x-message to make 
it easier to programmatically pass back information, since the warnings just go 
to the console and aren’t passed back to the caller.  But that’s another issue)

Thoughts?



From: Peter Kronenberg 
<[email protected]<mailto:[email protected]>>
Sent: Wednesday, January 27, 2021 4:03 PM
To: [email protected]<mailto:[email protected]>
Subject: {EXTERNAL}Invalid language code

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click 
links or open attachments unless you recognize the sender and know the content 
is safe.
If I pass in a non-existantlanguage code (i.e., the code matches the regular 
expression, but there is no corresponding language file in Tessdata), I am not 
getting any error message.  If I do it from the command line with Tesseract, I 
get an error, but with Tika, I’m not seeing any error in the logs.  Not sure 
why the error from Tesseract is not being displayed somewhere.    Tika just 
blindly calls Tesseract but then doesn’t get any output back.  Is that the 
expected behavior?

Reply via email to