For a training tool, python is fine, but c/c++ is essential for the runtime component.Thanks, Ray.
On Tue, Feb 3, 2009 at 5:52 PM, Michael Reimer <[email protected]>wrote: > > Hi Ray. I mean something along those lines, yes, though I'm not > especially focused on part of speech tags (grammatical approaches to > this problem don't generally yield much reward relative to the amount > of processing power they require, unless the grammar is highly > constrained). I'm not sure if you want to know about my plans in any > great detail but I'll be happy to elaborate if so. In any case your > reply is applicable and I'll look into method (b). > > Whatever I end up with will probably involve a training component that > uses some corpus to generate a data file which can then be used in > adjust_word/permute_words. I'm wondering, is there a strong > preference as to what language that ends up being written in? I'm > comfortable with C/C++, but much prefer python for string > manipulation. Thanks for your time. > > On Feb 3, 5:58 pm, Ray Smith <[email protected]> wrote: > > Hi Michael, > > I assume by word order modelling you mean part of speech tagging and > > modelling the order of the parts of speech, perhaps by some HMM - based > > model? > > > > In any case, there are 2 ways to do this: > > > > (a) Easier to do, but lighter effect: > > Modify the adjust_word function to make use of the local context to > promote > > words that fit the model. > > > > (b) More difficult to do, but possibly larger effect. > > Split the dictionary by part of speech into multiple (possibly > overlapping) > > sub-dictionaries. Change permute_words to search each of the > > sub-dictionaries and then adjust_word is more likely to have a good range > to > > choose from. > > > > We try to incorporate improvements into the mainline code. This has been > a > > bit slow so far, but the turnaround time is improving as I catch up. > > > > Ray. > > > > On Mon, Feb 2, 2009 at 3:04 PM, Michael Reimer <[email protected] > >wrote: > > > > > > > > > Hello all. I'm a computational linguistics graduate student, and I'd > > > like to do some work on Tesseract for credit in a software engineering > > > course. My area of interest is word order modelling and I believe > > > this can and has been used to improve the accuracy of other OCR > > > systems. As far as I can tell, Tesseract has nothing similar > > > currently, so I'm interested in adding it. Any feedback on that idea > > > would be appreciated. > > > > > I'm also completely new to open source, so provided that my general > > > goal appeals, I would appreciate any and all advice on how best to get > > > involved with your community, get myself up to speed technically (I am > > > reading the "Hacking Tesseract" manual currently), avoid stepping on > > > toes, and so on. Thanks. > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

