On Wed, Aug 22, 2012 at 06:58:19AM -0700, Jani Monoses wrote: > thanks for the prompt answer!
You're welcome. As I said, it's nice to have clear, well written questions ;) > 600DPI is generally recommended. You could try higher, but if you > say there were some improvements and some regressions, I'd just stay > at 600DPI. > > Alright, although there seemed to be more improvements than regressions at > 1000dpi. I don't think there are any fixed rules on this (someone else should correct me if I'm wrong). So by all means use 1000dpi if it looks better. > By the available language data I meant the already avaiable /usr/share/ > tesseract-ocr/tessdata/ron.traineddata for Romanian > that comes in Ubuntu/Debian's packaging of Tesseract. Aah, OK, forgive me, I didn't realise there was a Romanian training that you were already using. Good. > I was wondering if the Romanian dataset needs further training - I am not sure > what well-trained means in this context. Probably it wouldn't be worth further training. It isn't really feasible to just "improve" the trainings at present, you would have to create a wholly new training, which would take a lot of effort and probably not have a big impact. > I only meant spelling corrections in the post processing phase as I see quite > a > few non-words being recognized instead of > what the original document has, usually one or two edit-distances away. > Matching with dictionary words could fix these but > then I wonder if it would not go against the intention of the OCR process, > which is to recognize what is in the input, and not > what the correct spelling of the input is. In my case the originals are all > correctly spelled so I would need a post-processing step > anyway but maybe it should not be a core part of Tesseract's pipeline. OK, I see. One thing you could do would be to experiment with increasing Tesseract's trust in its dictionary. I have done something similar with my training. Create a file with this in: language_model_penalty_non_freq_dict_word 0.2 language_model_penalty_non_dict_word 0.3 and save it to tessdata/configs/trustdict - wherever your tessdata folder is (probably /usr/share/tesseract-ocr/) The original values for those configuration variables are 0.1 and 0.15 respectively. Play around with increasing them and see whether it helps. Then when you run tesseract, do something like this: tesseract input.png output -l ron trustdict Hope this helps, and let us know how you get on. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

