question

katarzyna_malinowska1 Thu, 25 Aug 2022 04:17:03 -0700
Hello, 
 
I have a question concerning Apache TIKA / TESSERACT.
 
I need to OCR some poor quality documents which contain different alphabets 
e.g. german/polish/english.
I did OCR with all of the alphabets at first because I thought it would be 
faster (I mean: by using http header "X-Tika-OCRLanguage").
I noticed that some characters were misidentified so I thought that result will 
be better after reducing the number of alphabets to those that de facto appear 
in the document (I checked it manually). 
But after the reduction of languages, there are characters replaced incorrectly 
with other characters even though the first identification was correct.
 
Example:
The correct spelling: DÖNER
1st try: http header "X-Tika-OCRLanguage" includes english, polish, german
the result: DÖNĘR 
2nd try: http header "X-Tika-OCRLanguage" includes only english and german 
the result: DONER 
I don't know what is going on. Is there anyone who know how to explain this 
issue? Is there anything what I possibly could do to improve the outcome? 
 
Best regards,
Kasia
question

Reply via email to