Hi friends, >> OCR code's now been tweaked and tested to work in both WinXP and >> Win9x. >> This should work in unix as well. >> >> Here is a summary: >> >> 1. Put ocrad 0.16 in the path > > As a note, for Windows you need a copy of ocrad with skip patch that > opens pnm files in binary mode otherwise ocrad will fail on a lot of > files.
Actually you're probably refering to my "patch"? (Ocrad/CygWin1.dll) http://mail.python.org/pipermail/spambayes/2006-October/019983.html If you have MinGW experience - which I don't - I think you can compile an exe-only which don't need the dll. But then I don't know if it is actually working because of the POSIX emulation or they did change the source. (I did not...) You're right Skip pointed it out in the ocrad forum, but the developer was reluctant to change this then so I don't actually know why 0.16 is working... Just know it is, which is fine for me. > Have you tried other ocr programs? No, not yet. Tony Meyer suggested Tesseract: http://mail.python.org/pipermail/spambayes-dev/2006-September/003750.html but there seemed to be build issues... I haven't tried.. I mailed with NoSpam Today! Support (spamassasin based) before I chose SB. They were doing research on FuzzyOcr and ImageInfo. Maybe we could ask again about their results. I believe FuzzyOcr is gocr-based? > I tried gocr and I think that its result are somewhat better but version > 0.41 + pgm patch almost hangs Ok, probably needs some tweaking then. Since the ocr is working with ocrad and - as you see below - I get very good results I will be moving on to the next area now. I think it is far more beneficial to do more research into the actual processing as you commented elsewhere than to start the whole testing/tweaking all over again with a new ocr engine. Of course that is just my opinion... >> 5. Finally I sugest you change the default scale from 1 to 2 like in >> this line >> >> scale = options["Tokenizer", "ocrad_scale"] or 2 > > changing this surely doesn't hurt but ocrad_scale it's already set to 2 > in Options.py Ok, I missed that. Don't know which one has prevalence. ImageStripper.py, Options.py or bayescustomize.ini. With 2 you should get this quality image tokens: watch out here comes the big one! srrl about blow your minds add srrl your radar mon nov ob companu name: stellar resource new (otc bb:srrl.ob) sumbol: srrl prlce: tl._ targe_: tio skip:r 10 ueru s_rong buu our last feature, posted cains ouer __o_ the span weekithose- are ridiculous cainsl cet srrl nowl will makinc stunninc skip:a 10 next weekl massiue campaicns are about startl watch srrl trade monday nou obl don't left out! That is about a 90% recognition or so. > probably should be removed (or set to 2 as you suggest) Then I suggest removal as you say. Better avoid redundancy ( clutter :) ) Happy coding :) Vibe _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
