Re: question

Tilman Hausherr Thu, 25 Aug 2022 19:58:11 -0700

Hi,

This sounds more like a question to ask the tesseract people. Please tryto run your document with tesseract alone to see what kind of output youget.It's only a tika problem if you'd be able to get good output withtesseract but not with tika while using the same options.


Tilman

Am 25.08.2022 um 13:16 schrieb [email protected]:

Hello,
I have a question concerning Apache TIKA / TESSERACT.
I need to OCR some poor quality documents which contain differentalphabets e.g. german/polish/english.I did OCR with all of the alphabets at first because I thought itwould be faster (I mean: by using http header "X-Tika-OCRLanguage").I noticed that some characters were misidentified so I thought thatresult will be better after reducing the number of alphabets to thosethat de facto appear in the document (I checked it manually).But after the reduction of languages, there are characters replacedincorrectly with other characters even though the first identificationwas correct.
Example:
The correct spelling: DÖNER
1st try: http header "X-Tika-OCRLanguage" includes english, polish, german
the result: DÖNĘR
2nd try: http header "X-Tika-OCRLanguage" includes only english andgerman
the result: DONER
I don't know what is going on. Is there anyone who know how to explainthis issue? Is there anything what I possibly could do to improve theoutcome?
Best regards,
Kasia

Re: question

Reply via email to