On Tue, Apr 17, 2012 at 05:29:55PM +0200, zdenko podobny wrote:
> On Tue, Apr 17, 2012 at 4:26 PM, Nick White <[email protected]> wrote:
> > Ah, that's a very good idea, and will indeed be useful. However for
> > my usecase (a script which is mostly the same, but with additions,
> > and an older version of the language), it would be useful to only
> > use one set of dictionary files (rather than presumably the union of
> > grc & ell, in the above example).
> >
> > I wonder if there's any good way of integrating this functionality
> > in to tesseract; I could imagine changing the dictionary files
> > wouldn't be a particularly unusual thing to want to do, as mappings
> > of dictionaries and scripts is not going to be 1:1.
> >
> > As a workaround one could probably unpack the traineddata, remove
> > the dictionary files (and add different ones if appropriate), then
> > repack it. But ideally I think it would be good to be able to
> > specify different dictionary files on the command line (and ideally
> > as UTF-8 word per line files, which were converted into DAWG in
> > memory if needed.)
> >
> > Do you mean something like "CONFIG FILES AND AUGMENTING WITH USER DATA"
> [1] without user-patterns?
> 
> [1] http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

Thanks, I hadn't seen that functionality yet (I suppose it's slated
for 3.02). It does indeed look very useful. Thanks for the link.


It would be really nice if more Box/Tiff source tarballs were
released, so that extending existing training was easier. It would
also likely result in people doing more 'grunt work' to do extra
training, which would only be a good thing. I suppose the issue is
fear of sharing copyrighted training images, but I do hope that
can be sorted out sometime; they'd be very useful indeed.

Thanks for all your help Zdenko.

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to