Re: [tesseract-ocr] Combine traineddata

Simon Mon, 20 Nov 2023 04:03:52 -0800

Ok it tried it again and have to correct myself. When I use "gdt+eng", 
"eng" seems to be the dominant traineddata, because no matter in what order 
I use the result is always the same as I only used "eng". "eng" on itself 
works fine. I downloaded "eng" traineddata from the git best repository. I 
am using Tesseract 4.1.1 so my generated traineddata "gdt" should align 
with the traineddata of the github tessdata_best.


[email protected] schrieb am Montag, 20. November 2023 um 10:27:23 UTC+1:

> Going out on a limb here, but does '-l eng' on its own deliver any text 
> for you?
>
> The next thing I would look into, if I were you, is whether my 'eng' 
> traineddata has the same (lstm aka v4, I suppose) support listed as your 
> gdt traineddata. I've seen it happens where those do not align.
>
> There's a tesseract tool to list the traineddata engine features (forgot 
> the name/cli Argos, sorry) and one to merge traineddata files 
> (combine_something, but I have to look it up, so you'll be as fast as me 
> with Google + doc search), but my *hunch* is that you wont need the combine 
> tool; what I've seen so far is tesseract picks an engine (psm setting 
> drives this, IIRC) and then pumps the image through all loaded languages on 
> a segment by segment basis. (IIRC, so YMMV ;-) )
>
> (The bit I'm wondering about now myself is: there was some sort of 
> criterium in there, in the code, when to decide to try? or use? multiple 
> lang results; it just /might/ be that's causing trouble, but I would have 
> to dig deep into the code for that and it doesn't rate above "wild crazy 
> guess" anyway, so better take the same route and check your installed 'eng' 
> database is doing what it's supposed to, on its own, first. 
>
> The next sane thing to try is flipping them around, ie "eng+gdt" instead 
> of "gdt+eng", to see if results change and /how/, as that might give us all 
> a hint about what's going on in there.
>
>
>
>
>
> On Mon, 20 Nov 2023, 09:23 Simon, <[email protected]> wrote:
>
>> Hello everybody,
>>
>> right now I am working with tesseract to train it new symbols. Therefore 
>> I used tif pictures with only the desired symbol in it. I trained with 
>> tesstrain Repository and about 4000 training images. At the end of the 
>> procedure I got the traineddata file for my model Common_gdt. 
>> Except of the symbol(s) I trained in the model Common_gdt also numbers 
>> should be recognized. Obviously if I only use Common_gdt Tesseract only 
>> recognizes the symbols trained for but no numbers. 
>> To solve this problem I used -l Common_gdt+eng which should use both 
>> traineddata files. But when I use these files like this, It is like "eng" 
>> doesn't do anything. The results are the same, as I used only Common_gdt. 
>>
>> Does anyone have an idea how traineddata files can be combined?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9ee1df96-eef7-4f93-b93a-2c7914ab52c9n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/9ee1df96-eef7-4f93-b93a-2c7914ab52c9n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6324e072-4aba-4b41-a06f-a6ba1e4b2018n%40googlegroups.com.

Re: [tesseract-ocr] Combine traineddata

Reply via email to