Re: Getting usable source files from traineddata files

zdenko podobny Mon, 16 Apr 2012 09:38:33 -0700

On Mon, Apr 16, 2012 at 4:17 PM, Nick White <[email protected]> wrote:


> Hi there,
>
> There are lots of situations where it would be really useful to be
> able to get some of the source files from a .traineddata file. For
> example I am working on improving training of Ancient Greek (grc) -
> which is basically the same as modern Greek (ell), but with some extra
> accents and similar additions - and it would be really useful to be
> able to reuse all of the perfectly valid ell.traineddata stuff, just
> adding training for the extra characters and symbols, rather than have
> to essentially redo the majority of the training for modern Greek as
> well as the Ancient Greek.
>
> As far as I'm aware this should be possible, but I don't know of any
> tools to do it.
>
> Creating a .tr file from the .inttemp file might be some work, but
> from scanning the way it works looks feasible, and creating a
> dawg2wordlist tool looks like it ought to be straightforward enough.
>
> Has anybody else attempted this? Am I going about things the wrong
> way? If I write code to do this in a sane manner, would it be
> suitable to be included in the Tesseract codebase?
>
>
I think in 3.02 will provide solution this cases: you can use more than one
language for OCR. e.g. you can run something like this:

tesseract image output -l grc+ell


-- 
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Getting usable source files from traineddata files

Reply via email to