On Tue, Apr 17, 2012 at 05:29:55PM +0200, zdenko podobny wrote: > On Tue, Apr 17, 2012 at 4:26 PM, Nick White <[email protected]> wrote: > > Ah, that's a very good idea, and will indeed be useful. However for > > my usecase (a script which is mostly the same, but with additions, > > and an older version of the language), it would be useful to only > > use one set of dictionary files (rather than presumably the union of > > grc & ell, in the above example). > > > > I wonder if there's any good way of integrating this functionality > > in to tesseract; I could imagine changing the dictionary files > > wouldn't be a particularly unusual thing to want to do, as mappings > > of dictionaries and scripts is not going to be 1:1. > > > > As a workaround one could probably unpack the traineddata, remove > > the dictionary files (and add different ones if appropriate), then > > repack it. But ideally I think it would be good to be able to > > specify different dictionary files on the command line (and ideally > > as UTF-8 word per line files, which were converted into DAWG in > > memory if needed.) > > > > Do you mean something like "CONFIG FILES AND AUGMENTING WITH USER DATA" > [1] without user-patterns? > > [1] http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html
Thanks, I hadn't seen that functionality yet (I suppose it's slated for 3.02). It does indeed look very useful. Thanks for the link. It would be really nice if more Box/Tiff source tarballs were released, so that extending existing training was easier. It would also likely result in people doing more 'grunt work' to do extra training, which would only be a good thing. I suppose the issue is fear of sharing copyrighted training images, but I do hope that can be sorted out sometime; they'd be very useful indeed. Thanks for all your help Zdenko. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

