Hi,
This sounds more like a question to ask the tesseract people. Please try
to run your document with tesseract alone to see what kind of output you
get.
It's only a tika problem if you'd be able to get good output with
tesseract but not with tika while using the same options.
Tilman
Am 25.08.2022 um 13:16 schrieb [email protected]:
Hello,
I have a question concerning Apache TIKA / TESSERACT.
I need to OCR some poor quality documents which contain different
alphabets e.g. german/polish/english.
I did OCR with all of the alphabets at first because I thought it
would be faster (I mean: by using http header "X-Tika-OCRLanguage").
I noticed that some characters were misidentified so I thought that
result will be better after reducing the number of alphabets to those
that de facto appear in the document (I checked it manually).
But after the reduction of languages, there are characters replaced
incorrectly with other characters even though the first identification
was correct.
Example:
The correct spelling: DÖNER
1st try: http header "X-Tika-OCRLanguage" includes english, polish, german
the result: DÖNĘR
2nd try: http header "X-Tika-OCRLanguage" includes only english and
german
the result: DONER
I don't know what is going on. Is there anyone who know how to explain
this issue? Is there anything what I possibly could do to improve the
outcome?
Best regards,
Kasia