On Mon, Apr 16, 2012 at 4:17 PM, Nick White <[email protected]> wrote:
> Hi there, > > There are lots of situations where it would be really useful to be > able to get some of the source files from a .traineddata file. For > example I am working on improving training of Ancient Greek (grc) - > which is basically the same as modern Greek (ell), but with some extra > accents and similar additions - and it would be really useful to be > able to reuse all of the perfectly valid ell.traineddata stuff, just > adding training for the extra characters and symbols, rather than have > to essentially redo the majority of the training for modern Greek as > well as the Ancient Greek. > > As far as I'm aware this should be possible, but I don't know of any > tools to do it. > > Creating a .tr file from the .inttemp file might be some work, but > from scanning the way it works looks feasible, and creating a > dawg2wordlist tool looks like it ought to be straightforward enough. > > Has anybody else attempted this? Am I going about things the wrong > way? If I write code to do this in a sane manner, would it be > suitable to be included in the Tesseract codebase? > > I think in 3.02 will provide solution this cases: you can use more than one language for OCR. e.g. you can run something like this: tesseract image output -l grc+ell -- Zdenko -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

