Hello,
I tried today the same (to use my trained data from 2.04 in 3.00 as
much as possible) and I believe I created working file ;-).
Based on analyzing of existing traineddata files and some tests with
'training/combine_tessdata' I found out I need to have following
files:
xxx.config
xxx.unicharset
xxx.unicharambigs
xxx.inttemp
xxx.pffmtable
xxx.normproto
xxx.punc-dawg
xxx.word-dawg
xxx.number-dawg
xxx.freq-dawg
When I analyze existing traineddata in svn, none has xxx.config so I
believe it can be skipped (empty file did not worked - there must be
at least emty line ;-) ).
xxx.unicharset, xxx.pffmtable, xxx.normproto, xxx.inttemp looks to
have the same format in in 3.0 and 2.04.
xxx.unicharambigs looks like new version of xxx.DangAmbigs. The
structure is this:
v1
2 ' ' 1 " 1
2 ` ' 1 " 1
2 ' ` 1 " 1
I found unicharambigs in file of these languages: deu, ell, eng, fra,
ita, nld, rus and spa. I think first line is version info, columns 1-4
has the same meaning as in DangAmbigs. Column 5 looks to me like some
factor. Maybe somebody who can understand source code can give better
explanation.
xxx.*-dawg from 2.04 did not work for me. So I created them with tool
from 3.00 (e.g. 'training/wordlist2dawg number_list xxx.freq-dawg
xxx.unicharset'). Please note that new version of wordlist2dawg use
xxx.unicharset.
Than I combined all files ('training/combine_tessdata xxx.') and I put
file xxx.traineddata to correct place.
Than I run my test:
tesseract font-0001-arial.tif output -l xxx
and it produce output without error message. But result was quit worse
than with 2.04 for me (I did no have chance to improve result in 3.00
yet)
When I tried to play with xxx.unicharambigs (adding new lines) I got
following error:.
index >= 0 && index < size_used_:Error:Assert failed:in file ../ccutil/
genericvector.h, line 215
At the moment I do not know reason for this I just turn back to my
first version of xxx.unicharambigs.
Also when I remove empty line from end of file xxx.unicharambigs I got
this error (from 'training/combine_tessdata'):
Segmentation fault training/combine_tessdata xxx.
Only by accidents I was able to solve these issues, because tesseract
(3.00) did not provide helpful error message for me...
And one more info: I did this on linux 64bit with leptonlib-1.65
library (http://www.leptonica.com/download.html).
Zdenko
On 12. Apr, 15:00 h., MARTIN Pierre <[email protected]> wrote:
> Hello list :)
>
> i'm trying to make tesseract 3 run with my custom train files, and i have a
> few question if some of you can answer these.
>
> 1) Can i combine the old 2.04 xxx.* files to the new xxx.trainedata directly,
> or is the new format also expecting new things which weren't in the old 2.04
> files?
>
> 2) i'm not sure i understand the xxx.punc-dawg and xxx.number-dawg. Can
> someone explain? As far as i understand, it's desambiguation for punctuation
> and numbers, right? So what am i supposed to do, for example if one D
> resembles to one 0:
> 1 D 1 0 -> goes in freq-dawg,
> 1 0 1 D -> goes in numbers-dawg?
> Or only numbers should go in numbers-dawg, for example
> 1 8 1 0 -> Only goes in numbers-dawg then?
>
> 3) i concatened my old config files (Just duplicating freq-dawg to punc /
> numbers), and i'm trying to run tesseract... i'm getting an assertion failure
> (Probably on a map) saying that num <= (SIZE_MAX/elementSize, see stack
> below).
>
> Anyone successfully made the svn319 run with custom traineddata?
>
> Thanks,
> Pierre.
>
> XXX.dll!_fread_nolock_s(void * buffer=0x10560040, unsigned int
> bufferSize=4294967295, unsigned int elementSize=8, unsigned int
> num=1130430464, _iobuf * stream=0x00750c60) Line 156 + 0x35 bytes C
> XXX.dll!fread_s(void * buffer=0x10560040, unsigned int
> bufferSize=4294967295, unsigned int elementSize=8, unsigned int
> count=1130430464, _iobuf * stream=0x00750c60) Line 109 + 0x19 bytes C
> XXX.dll!fread(void * buffer=0x10560040, unsigned int elementSize=8,
> unsigned int count=1130430464, _iobuf * stream=0x00750c60) Line 303 + 0x17
> bytes C
> XXX.dll!tesseract::SquishedDawg::read_squished_dawg(_iobuf *
> file=0x00750c60, tesseract::DawgType type=DAWG_TYPE_PUNCTUATION, const STRING
> & lang={...}, PermuterType perm=PUNC_PERM) Line 298 + 0x19 bytes C++
> XXX.dll!tesseract::SquishedDawg::SquishedDawg(_iobuf *
> file=0x00750c60, tesseract::DawgType type=DAWG_TYPE_PUNCTUATION, const STRING
> & lang={...}, PermuterType perm=PUNC_PERM) Line 350 C++
> XXX.dll!tesseract::Dict::init_permute() Line 276 + 0x30 bytes C++
> XXX.dll!tesseract::Wordrec::program_editup(const char *
> textbase=0x00000000, bool init_permute=true) Line 98 C++
> XXX.dll!tesseract::Wordrec::start_recog(const char *
> textbase=0x00000000) Line 75 C++
> XXX.dll!tesseract::Tesseract::init_tesseract(const char *
> arg0=0x006bd948, const char * textbase=0x00000000, const char *
> language=0x006bd944, char * * configs=0x00000000, int configs_size=0, bool
> configs_global_only=false) Line 186 C++
> XXX.dll!tesseract::TessBaseAPI::Init(const char *
> datapath=0x006bd948, const char * language=0x006bd944, char * *
> configs=0x00000000, int configs_size=0, bool configs_global_only=false) Line
> 154 + 0x44 bytes C++
>
> > XXX.dll!tesseract::TessBaseAPI::Init(const char * datapath=0x006bd948,
> > const char * language=0x006bd944) Line 141 C++
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.