Hi,

This sounds more like a question to ask the tesseract people. Please try to run your document with tesseract alone to see what kind of output you get. It's only a tika problem if you'd be able to get good output with tesseract but not with tika while using the same options.

Tilman

Am 25.08.2022 um 13:16 schrieb [email protected]:
Hello,
I have a question concerning Apache TIKA / TESSERACT.
I need to OCR some poor quality documents which contain different alphabets e.g. german/polish/english. I did OCR with all of the alphabets at first because I thought it would be faster (I mean: by using http header "X-Tika-OCRLanguage"). I noticed that some characters were misidentified so I thought that result will be better after reducing the number of alphabets to those that de facto appear in the document (I checked it manually). But after the reduction of languages, there are characters replaced incorrectly with other characters even though the first identification was correct.
Example:
The correct spelling: DÖNER
1st try: http header "X-Tika-OCRLanguage" includes english, polish, german
the result: DÖNĘR
2nd try: http header "X-Tika-OCRLanguage" includes only english and german
the result: DONER
I don't know what is going on. Is there anyone who know how to explain this issue? Is there anything what I possibly could do to improve the outcome?
Best regards,
Kasia

  • question katarzyna_malinowska1
    • Re: question Tilman Hausherr

Reply via email to