Hello,
I have a question concerning Apache TIKA / TESSERACT.
I need to OCR some poor quality documents which contain different alphabets
e.g. german/polish/english.
I did OCR with all of the alphabets at first because I thought it would be
faster (I mean: by using http header "X-Tika-OCRLanguage").
I noticed that some characters were misidentified so I thought that result will
be better after reducing the number of alphabets to those that de facto appear
in the document (I checked it manually).
But after the reduction of languages, there are characters replaced incorrectly
with other characters even though the first identification was correct.
Example:
The correct spelling: DÖNER
1st try: http header "X-Tika-OCRLanguage" includes english, polish, german
the result: DÖNĘR
2nd try: http header "X-Tika-OCRLanguage" includes only english and german
the result: DONER
I don't know what is going on. Is there anyone who know how to explain this
issue? Is there anything what I possibly could do to improve the outcome?
Best regards,
Kasia