For the case someone is interested: I resolved all the problems I had (with 
"best" frk.traineddata or Fraktur.traineddata script) by using the latest 
Fraktur traineddata script (or file…) I found on the Uni-Mannheim server.
I unexpectedly got very, very, very good results!
Here it is: 
https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/
Just download the first file from the list (which is chronologically the 
last one), called: "frak2021-0.905.traineddata"

There is a more recent file here: 
https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021_09/tessdata_fast/
 
("frak2021-09.traineddata") but it had very similar results for me (some 
errors were corrected, and it had some new errors) bur it added a certain 
number of spaces where it should not... so I prefer the first one. But this 
could depend on the images you have to work on.

So just check what works better for your case.

Best,
I.

Il giorno domenica 26 marzo 2023 alle 23:35:06 UTC+2 Isidore Paris ha 
scritto:

> Ciao,
>
> Thanks for sharing!
> I have the same problem with script / Fraktur.traineddata, which is far 
> better than simple "frk.traineddata, but I found there was in the wordlist 
> and in the unicharset all European accented characters (French, Italian and 
> Spanish: âêîôû, æ, œ, àèìòù, áéíóúñ, ¡ ¿ [and relatives CAPS] and other  
> useless characters: € Þ) which are absolutely unknown in old German.
> Could it be that for Tesseract, "Fraktur" is not only for German language?
>
> I solved my problem of ">" and "<" by modifying the unicharset file, and 
> replacing *in the first column only*, these characters by "ck" and "ch" 
> (I also tried to modify the 2 fields after the # ["# ck [63 6b"], but it 
> made no difference).
> I tried the same modification on "ô" and "ó" to get "o" but it doesn't 
> work, even with a modified word list where I cancelled all words with these 
> letters.
>
> I also noticed that the word list seems to have absolutely no effect: 
> changing the list (replace "best"-list by "lstm "-list) doesn't change 
> anything on the result…
>
> Best regards,
> Isidore.
>
> Il giorno lunedì 20 marzo 2023 alle 19:53:01 UTC+1 [email protected] 
> ha scritto:
>
>> Hi,
>>
>> no, unicharambigs is not used by LSTM files. It was used in the legacy 
>> mode.
>>
>> I'm having similar problems with the ancient greek best traineddata: 
>> unfortunately it has been trained with some non standard characters (ά έ ή 
>> ί ό ύ ώ, instead of  ά έ ή ί ό ύ ώ). I tried fine tuning the 
>> grc.traineddata, but without very much success, so, for the time being, I'm 
>> producing hocr files, post-process them and then use hocr-pdf to create a 
>> searchable pdf.
>>
>>
>> best,
>> andrea
>> On Monday, March 13, 2023 at 5:13:33 PM UTC+1 Isidore Paris wrote:
>>
>>> Hi,
>>> I'm doing some frk text recognition, and in my results, I have a great 
>>> number of " > ". Each one should be replaced by " ck ".
>>> I updated my frk.traineddata file (from tessdata_best repository) with a 
>>> frk.unicharambigs file (I tried both formats v1 and v2) but absolutely 
>>> nothing changed.
>>> I also tried the parameter " -c use_ambigs_for_adaption=1 " to see if 
>>> maybe it was needed, but still nothing changed, not a single character (> 
>>> and = and / are all still there).
>>>
>>> Here is the content of my v2 frk.unicharambigs file:
>>> v2
>>> > ck 1
>>> = - 1
>>> / - 1
>>>
>>> Does unicharambigs not work with LSTM files? Or did I miss some 
>>> particular or special step?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6639aad6-a951-4f2d-8e3f-03bd2c171316n%40googlegroups.com.

Reply via email to