|
Hi friends,
>> >> 1. Put ocrad 0.16 in the path > I have no experience with mingw but I compiled ocrad
> using it and I'm using the result (without cygwin dll) with no problem, Ok, but note that the sources posted in spambayes-something was 0.15!
New version 0.16 can be downloaded here: http://ftp.gnu.org/gnu/ocrad/ocrad-0.16.tar.bz2
According to the changelog the character recognition was improved.
If you built a 0.16 exe without cygwin1.dll I would like to test
it.
Can you post it somewhere together with a short desciption of how it was built? "Pretty please with sugar on top".
>> > Have you tried other ocr programs?
>> >> No, not yet. >> >> Tony Meyer suggested Tesseract: > I built tesseract with no problem. ... > I tested few spam images and the results were poor.
>> I mailed with NoSpam Today! Support (spamassasin based) before
I chose SB.
>> They were doing research on FuzzyOcr and ImageInfo. Maybe we could ask >> again about their results. I believe FuzzyOcr is gocr-based? > > Yes, they are using gocr. But as I said in my previous mail it has its > own problems. Ok then it has at least been tried ...
>> Since the ocr is working with ocrad and - as you see below - I
get very
>> good results I will be moving on to the next area now. > You are lucky. My results are so so. Probably I get a reduction of
a
> 60/70% of spam with images (which in itself could be considered not bad) > but way too much spam is not stopped. I expect results to vary and it is too early in my testing to tell, but
today SB caught
17 of 18 spams. I changed spam cutoff to 0.7 however that didn't even seem nescessary. My database contains 845 spams and 1411 hams. Zero false positives!
>> I think it is far more beneficial to do more research into the actual processing >> as you commented elsewhere than to start the whole testing/tweaking all over >> again with a new ocr engine. Of course that is just my opinion... > > Yes and no. We need a decent ocr engine to start with than we may
focus
> on better image manipulation. Yes...
>> At the moment spambayes have trouble with image for the following >> reason: > > - PIL sometimes fail to handle the image. I'm still investigating the > issue but the images seems reasonably correct (IE, Firefox and many > viewers, on linux and windows, are able to display them). It's quite > rare and not a big issue Not an ocr problem, but a preprocessing problem...
It's great that you are looking into this because I for one don't know
python
well enough to debug such issues. > - ocr results are poor. The worst case are when you get a sequenze of > chars (char space char space ...) or a long word. both are ignored by > spambayes Tokenizer problem, configurable. Not related to the ocr engine.
> - There are images which contain more than words and in this case we may > get no tokens. I have seen many animations with moving artefacts. Usually not a
problem,
but it may be in the future. Again some filtering - which is preprocessing - might be a brilliant idea. > In few cases if the colors used inside the image are changed you get a > different result. We should work on filtering and histogram analysis to determine the
correct
threshold level for the ocr. If we find a better way than what ocrad already does then we can pass it via the -T parameter. Advanced filtering can even detect repetitive patterns or noise in the
background
and remove that. Sure a professional ocr engine like OmniPage Pro does huge
amounts
of preprocessing like i.e. automatic rotation correction etc. already, but that does not yet seem nescessary for our purpose... > I have no knowledge of image processing but I tried few simple > operations (like scaling, sharpening, convert to gray, ...) but I got no > results. They were all quick tests and the result are in no way > conclusive. I did a course in image analysis. I don't know python / PIL, but I
could
probably do some tests in Matlab when my numbers start to deteriorate. If you have a way to batch-extract images from emails or from a
dbx-file
or if you send me a zip of 100+ problematic spam images then I would be happy to run some tests i.e. on best scale factor and scaling algorithm. I can batch-convert them so only worry about extraction.
> from my understanding in Options.py you set the default values, > bayescustomize.ini contain the values chosen by the user an in > Imagestripper.py the programmer may embed it's values ignoring the user > choice (joking) Something like that I think :) Did you try to change this in
ImageStripper.py and
did it make any change to the output? >> With 2 you should get this quality image tokens: >> >> watch >> out >> here >> comes >> the >> big >> one! ... >> That is about a 90% recognition or so.
> Yes, sometimes the results are good and sometimes are much worst. In
few
> cases a scaling factor of 3 it's better. Just now I'm doing a retraining > with ocrad_scale set to 3. we will see in the next days if the result > are better or worst Yes, my initial suggestion was scaling by 4, but Skip argued to use 2. He
did tests
with different scales already. Intuitively a larger scale should be better. I found however that it slowed down the analysis. Now I don't know what ocrad does, but resampling might be better than
resizing.
Happy coding :) Vibe PS: How come my posts always show up as new threads?
Using OE. Don't want to subscribe. |
_______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
