Getting usable source files from traineddata files

Nick White Mon, 16 Apr 2012 09:15:42 -0700

Hi there,

There are lots of situations where it would be really useful to be
able to get some of the source files from a .traineddata file. For
example I am working on improving training of Ancient Greek (grc) -
which is basically the same as modern Greek (ell), but with some extra
accents and similar additions - and it would be really useful to be
able to reuse all of the perfectly valid ell.traineddata stuff, just
adding training for the extra characters and symbols, rather than have
to essentially redo the majority of the training for modern Greek as
well as the Ancient Greek.


As far as I'm aware this should be possible, but I don't know of any
tools to do it.

Creating a .tr file from the .inttemp file might be some work, but
from scanning the way it works looks feasible, and creating a
dawg2wordlist tool looks like it ought to be straightforward enough.

Has anybody else attempted this? Am I going about things the wrong
way? If I write code to do this in a sane manner, would it be
suitable to be included in the Tesseract codebase?

Thanks folks,

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Getting usable source files from traineddata files

Reply via email to