Re: FuzzyOcr 3.6.0 released

2009-05-28 Thread decoder

RW wrote:

AFAIK though it isn't possible to place a cap on the  FuzzyOCR score. I
don't want to, but I detune it purely to reduce the likelyhood of
something hitting my discard threshold by OCR alone.

  
If you consider this feature so important, then I could implement a 
max_score feature that caps the score done by word recognition. This is 
easy to implement.


Or should it rather be a cap to all FuzzyOcr rules, including the others 
like malformed file etc?



Cheers,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: New spamassassin OCR plugin

2009-05-27 Thread decoder

alex k wrote:


If only FuzzyOCR's developer would read that ;)
Unfortunately he doesn't seem to be interested in his project anymore.
Maybe you could take care of this orphaned code.

  


Dear Alex,


I am reading exactly everything you write ;)


The code is not orphaned, but also not being extended at the moment. The 
SVN version runs stable in all SA 3.2.x releases. I answer to tickets 
and questions via email.



I am planning a new release, but my time schedule is though.


Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: New spamassassin OCR plugin

2009-05-27 Thread decoder

LuKreme wrote:

On 24-May-2009, at 18:40, Henrik K wrote:
I don't know why users are so afraid of words like SVN. You have to 
look at the project, not version numbers.



I don't have FuzzyOCR installed, and it's not because of the SVN. 
First, I don't think my server can take the processing hit and second 
it requires so much to be installed that I'm SURE my server can't take 
the hit.




May I ask how many mails you process per day? Please note that

a) FuzzyOcr runs last if properly installed
b) it doesn't do anything if the score exceeds a configurable threshold
c) it supports hashes and other things that make processing faster


Cheers,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


FuzzyOcr 3.6.0 released

2009-05-27 Thread decoder

Hello all,


after quite some time, I've decided to release another version of 
FuzzyOcr. This version is only a tag from SVN revision 135 (+ a patch 
provided recently which fixes something in one of the sql utilities) 
that has been used quite some time with SA 3.2.x and is included in some 
major distributions already. If you are using FuzzyOcr from SVN (rev 
135), then there is no need for you to upgrade.


Since image spam seems on the rise again, lots of people have contacted 
me in the last 2 months, and I have been asked many times to release 
another tarball... So I hope someone will find it useful. No new 
features are added in this release, as I decided to first tag the 
version that is working without known problems for those that seemed to 
have a problem with checking out the version from SVN. The major version 
number increase is due to the fact that it breaks compatibility with SA 
3.1.x and now requires SA 3.2.x.


See http://fuzzyocr.own-hero.net/wiki/Downloads for more details.

Although I still can't invest that much time into the project at this 
point, there are some features I'd like to add though in the near 
future, such as regex support. I also considered rewriting the scoring 
engine because some people share the opinion that it is too sensitive 
(as opposed to others who consider it to be good).




Best regards,



Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Experimental Plugin: MetaSVM

2009-03-15 Thread decoder

LuKreme wrote:


This is an excellent idea, but it also needs rule hits on ham, right?

You're right if you're saying that the method would work better if there 
were more ham rules. From what I have seen in my experiments however, 
the results are also very precise with the current SA ruleset. But any 
rule that adds some information to the feature set might yet increase 
the performance (especially the performance on unrecognized spam, on 
ham/spam which is detected by SA as well, the algorithm performs nearly 
as good as SA itself).




Regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Experimental Plugin: MetaSVM

2009-03-15 Thread decoder

LuKreme wrote:
I don't see any need for the model to be dynamic.  Periodic 
recalculation of it should be just fine.  I bet even daily 
reprocessing will prove to be over zealous. Weekly, perhaps even monthly.

This is what I think as well :)


I'm thinking that FPs and FNs are bayes problem anyway.  This tool 
need to concentrate on seeing just what rules hit and building off 
that. I'd go so far to say that as far as SVM is concerned, there is 
no such thing as a false postive or negative.
What do you mean by that? Of course FPs and FNs might also be a problem 
for the SVM, every wrong classified point is certainly a problem for a 
machine learning algorithm. However, I think that the SVM is quite 
robust to a certain amount of FPs/FNs if the majority of the training 
points is correct.



So, if you feel like trying out the plugin, let me know how well it 
works =) I'm especially interested in those cases where it increases the 
spam detection rate (reducing false negatives). Might be easy to extract 
this information from logs.




Thanks and regards,



Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Experimental Plugin: MetaSVM

2009-03-13 Thread decoder

Hi all,


as a result of the recent 2+2 != 4 discussion on the list, here is a 
new plugin, which tries to learn ham/spam classification only by knowing 
which rules triggered and which did not. This is, so to say, an 
automatic meta rule.


The plugin is currently experimental and can only be checked out from 
SVN at:


   https://svn.own-hero.net/sysadmin/MetaSVM/trunk


For now I recommend to not use it in production environment, as it is 
still untested (except that I tested it).
In order to use the plugin, you need to train your own model, which 
requires a certain amount of ham/spam.


I evaluated the plugin with my own ham/spam corpus (roughly 5000 spam, 
3000 ham) and the resulting model did not produce false positives with 
respect to the default scoring, but it catched approx. 30% of the mails 
that were not catched by SA itself. I'll probably release more detailed 
numbers in some whitepaper soon :)



Best regards,


Chris




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Experimental Plugin: MetaSVM

2009-03-13 Thread decoder

AlexB wrote:

Chris

From the README its not quite clear: will this work in autolearn ?
If you mean that the plugin can automatically learn with the autolearn 
setting, answer is no.


would it be enough to create the model.* files or is it a must to feed 
it?
You create one model file once by feeding it a large corpus of ham+spam. 
Once you did that,
and evaluated it as described in the README, the model should be working 
accurately enough
for your mail gateway and I expect it to work for a long time, mainly 
because it isn't depending
that much on the type of spam (i.e. the results that the model produces 
are assumed to be more generalizable than for example your bayes db)




I cases of busy gateways, where manual training is higly unpractical, 
it would need to feed itself with headers from SA report's score X
The problem is that feeding does not work with an SVM algorithm. You 
have to train on the _whole_ set _always_, so feeding mails is unpractical.


That's why you do this process _once_ with a lot of ham and spam. You 
can repeat this process any time but it isn't necessary to do this 
permanently.


It is to be expected that the model accuracy will decrease with time ( 
a) because your rules change and b) because spam changes ) but I think 
this is a slow process.




It has yet to be evaluated how well the model performs over time :)



Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Experimental Plugin: MetaSVM

2009-03-13 Thread decoder

John Hardin wrote:


I assume it learns from full message corpa? And all it cares about is 
the rules that hit?


Per my earlier suggestion of learning off the logs + corpa to fix 
FP/FN, could there be an option to learn off generated minimal corpa 
files, with their structure being just the rules hit per message 
(msgid + hits on one possibly very long line)? e.g.:


kggbph.617...@localhost 
BAYES_99,FORGED_RCVD_HELO,L_SOME_STD_PROBS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RBL_PSBL_01,RCVD_IN_BRBL,RCVD_IN_NJABL_SPAM,SARE_FROM_SPAM_MONEY2,STOX_30,URIBL_BLACK,URIBL_JP_SURBL,URIBL_WS_SURBL 



Yes this is certainly possible. Basically all the algorithm requires for 
the SVM is the rules that hit and the classification (ham or spam) 
(actually the rules that did not hit are fed into the SVM as well, but 
they are taken from a the global rules file underlying the model). The 
tool additionally requires the score to evaluate FP/FN properly when 
testing the model, and the message id would be helpful to find false 
positives if one wants to investigate. So you are right, all this info 
would be enough and I can easily modify the tool to use this kind of 
format. I'll try to come up with a code modification to switch the input 
format :)


Then an external tool could generate and maintain these files from the 
SA log and the maintained training corpa, omitting FP/FN from the log 
data.
Yes, that's a good idea, certainly better than learning directly from 
the mail which might be scattered around several mailboxes. However, how 
do you want to exclude FP/FNs? The log certainly doesn't provide this 
information. On the other side, having some false positives in the 
training data did not spoil my results. The algorithm did even predict 
these correctly as spam later on :)




Cheers,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Experimental Plugin: MetaSVM

2009-03-13 Thread decoder

John Hardin wrote:


It needs the score, and not just Y/N Spam/Ham (i.e. from which corpa 
file it came)?
The SVM does not need the score. However, the evaluation tool needs the 
score because it uses it to calculate FP/FN rate.


I was thinking you'd generate a ham file and a spam file from the log, 
possibly dynamically appending rows as messages are processed. 
Naturally this would contain FPs and FNs.
If you want it to be dynamical, then the plugin could do the appending. 
However, the model cannot be extended, that means to incorporate new 
lines, the whole model must be recalculated. So this can't be done per 
message but only maybe on a daily basis.




You'd have a routine to extract the ham file from your full ham 
corpus/corpa, and likewise for spam. The assumption is any FP or FN 
would be placed into these corpa for normal bayes training.


The tool would then combine them, omitting from the log-generated 
files any msgid that appears in the training corpa files. You'd end up 
with one clean spam file and one clean ham file.
That implies that people are indeed using bayes training, but it might 
be a suitable idea. However, I don't think anyway that FPs and FNs spoil 
the SVM result. SVMs are quite robust to outliers (which FPs and FNs 
essentially are) and if their number is low compared to the total amount 
of mail, the algorithm will have no problem to predict them properly 
anyway :)





Er, don't you mean it predicted them as ham (FP = ham scored as spam)? 
It would be great if it was smart enough to recognize a near-boundary 
false result as what it *should* have been...


I mean that I had some unrecognized spam left in my inbox, and the 
algorithm did identify it as spam :) The SVM generally tries to find a 
hyperplane, however, if the wrongly labeled points (FPs and FNs) are of 
small count, the SVM will most likely produce a result where the FPs and 
FNs do not match the label they were trained with. The C-SVM uses a cost 
constraint (each label violation costs a certain value) and tries to 
minimize a given term which includes this cost. So if the dataset is 
sufficently large but has _some_ wrongly labeled points, the chances 
that the result is still what you wanted to have are high :)



-- Chris



smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-11 Thread decoder

Marc Perkel wrote:

So - making any progress? :)


Yes, indeed. I am currently rewriting my code to be more generic and 
cleaner (you wouldn't want to see my initial poc code^^). Once I'm done 
with that, I can quickly repeat some of the experiments on other mail 
sets, such as the one that Justin sent me. After that, I'll write a 
small plugin for SA, so you guys can test around with it (that shouldn't 
be a big deal).




Chris


P.S.: If you want to provide me another ham/spam corpus or even a 
collection of false negatives, feel free to contact me :)




smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-11 Thread decoder

John Hardin wrote:


Chris:

Do you have any interest in writing an offline tool that generates 
static metarules based on the SA log and FP/FN corpa, as I mentioned?


Running some experiments for this kind of tool is at least on my todo 
list :) I don't know however, when I will have time to do that :)



Cheers,


Chris



smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread decoder

Justin Mason wrote:


Thanks for doing this!  couple of q's:

1. I can offer a bigger ham/spam corpus if you'd like to test against
that as well;
corpora from multiple contributors can sometimes expose training set bias.
  
That would be cool :) Is this corpus already processed by spamassassin 
(i.e. has SA headers)?


My poc code currently mines only the headers to find out what rules are 
triggered.

2. can you test it on spam that scored less than 10 points when it arrived?
low-scoring spam is, of course, more useful to hit than stuff that scored highly
on the existing rules.
  
Things like that should be possible easily. I need to check if I have 
enough mails to

do a sufficiently reliable test here.


3. does it give an indication of confidence in its results? or just a
binary spam/ham
decision?
  
I'm currently working only with a binary classifier. However, libsvm 
supports

probability estimates and regression (and to my knowledge, internally, most
SVM algorithms relax classification output to real values and then use 
the sign
to determine the classification, this can also be seen as some sort of 
confidence value)



4. hey, if you're writing an SVM plugin, it might be worth making one
that _also_
supports body text tokens, similarly to the existing Bayes plugin. ;)
  
This would surely be possible somehow, but we'd first have to come up 
with a good
representation of the problem for an SVM. I wouldn't want to mix this 
either with the

current experiment, as these two things somehow represent different data.

One of the problems with text tokens is that there can always be new 
ones (which would
increase the dimension of the problem and hence require the whole SVM to 
be remodeled,

so, a system as performant as bayes might not work directly.)


5. btw one particularly tricky part of dealing with user-trainable
dbs, is supporting
expiry of old tokens.  but that can be deferred until later anyway.
  

I guess this is a question of implementation :)




Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread decoder

Marc Perkel wrote:


Good work so far but sounds like you need to throw more data at it. 
Also even though you indicate over 99% accuracy can you break that 
down better? 99.9% is 10 times as accurate as 99%.
What do you mean by more data? Of course, some additional data might 
help. One should consider that _most_ of the SA rules are designed to 
score on spam. For an SVM, you can use more general data like Mail has 
property XYZ although you don't know what this property means (ham or 
spam) or if it is even suitable to classify anything. This is of course 
an advantage.



With respect to the numbers:

I repeated the experiments today with slight modifications to provide a 
more solid setup:


The input is again the dataset I used yesterday. In one run, I permutate 
the dataset, then split it (2/3 training vs. 1/3 testing, not stratified).
Then the training set is used to train an SVM, and it is applied to the 
1/3 testing set and additionally to my false negatives set.


The SVM outputs an accuracy value, but I wrote a tool that calculates 
precision and recall by hand because these values are more interesting as


1 - Precision = False Positive Rate (which is an important factor in SA)
1 - Recall = False Negative Rate (or, consider recall as the detection rate)


I ran this 5 times, the output is attached as text file, there you will 
see the exact numbers :)


Taking the mean over the 5 runs:


False positive rate: 0.37908199952036 %
Detection Rate: 99.18104855859372 %

Detection Rate on False Negatives (my SA has 0% on this set): 
31.7821782178218 %



One should consider that my dataset might not be 100% accurate. It is 
combined from my inbox and my spam folder. Of course my spam folder is 
unlikely to contain ham, but it is surely possible that I forgot to 
delete one or another false negative from my inbox. I'm looking forward 
to get Justin's set :)





Also - when it identifies messages do the numbers on the spam scores 
go up and ham goes down? If so that makes it more solid and starves 
the middle. I'm encouraged that the initial results are good.

What do you mean by that question, I don't really understand it :)


My feeling is that if this works that it will work better if we have 
more informational tokens. For example - is the from address a 
freemail address. Does the message contain a freemail address. By 
themselves these wouldn't score points. But spam coming from yahoo, 
hotmail, gmail, etc. is a different kind of spam than spam coming from 
spambots. Maybe country tokens from the received lines would be 
useful. Maybe names of banks in the message would be useful. For 
example Bank of America + Nigeria = spam.
Yes, this is exactly what I meant above. These tokens are of limited use 
for SA currently, but an SVM might be able to use them :)



Cheers,


Chris
Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 449
nu = 0.144606
obj = -529.640159, rho = -2.227729
nSV = 802, nBSV = 785
Total nSV = 802
Predicting test set...
Accuracy = 99.2706% (2722/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.8896856039713 %
Recall: 99.01585565883 %

Results on false negative set:
Precision: 100 %
Recall: 31.6831683168317 %

=

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 466
nu = 0.147031
obj = -539.132218, rho = -2.297470
nSV = 817, nBSV = 791
Total nSV = 817
Predicting test set...
Accuracy = 99.2706% (2722/2742) (classification)
Predicting false negative set...
Accuracy = 32.1782% (65/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.6613995485327 %
Recall: 99.2134831460674 %

Results on false negative set:
Precision: 100 %
Recall: 32.1782178217822 %

=

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 454
nu = 0.146568
obj = -535.034660, rho = -2.187959
nSV = 814, nBSV = 793
Total nSV = 814
Predicting test set...
Accuracy = 99.2341% (2721/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.3834080717489 %
Recall: 99.4391475042064 %

Results on false negative set:
Precision: 100 %
Recall: 31.6831683168317 %

=

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 447
nu = 0.144391
obj = -530.359839, rho = -2.219816
nSV = 802, nBSV = 781
Total nSV = 802
Predicting test set...
Accuracy = 99.2341% (2721/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on 

Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread decoder

Marc Perkel wrote:


I suppose what I was thinking was that you still used the SA result 
but added or subtracted from the SA result based on your SVM code, 
sort of the way bayes does. Or are you letting SVM make the final 
determination?
At the moment, I am only using the SVM answer. What you finally do with 
it, is the next step. You can use it like a normal rule and give it a 
score, of course. You can also only use the SVM, but I think I'll go for 
the scoring idea :) It would also be possible to use an SVM model that 
supports confidence/probabilities. At the moment I was only evaluating 
the precision/recall for this method only without any scorings.




Chris



smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread decoder

John Hardin wrote:
Would there be any benefit to having an offline version - i.e. 
something that evaluates the log or a corpus to generate new meta 
rules, that could be added onto the default ruleset? For instance:


cron @ 0200:
sa_meta_eval  /etc/mail/spamassassin/metarules.cf
/etc/init.d/spamassassin restart




This is definetly a good idea. You can create the SVM model offline from 
a logfile only, if it includes the rules that scored and the ham/spam 
status. However, you cannot generate metarules with SVMs, for that 
purpose you need a different learning algorithm (for example bayes, or 
decision trees).


However, SVM classification is very cheap, so once you created the model 
offline, you can use it online really quickly with a plugin.




Cheers,



Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread decoder

Justin Mason wrote:


So you're volunteering to code it up, then? ;)


I was planning to do at least some brainstorming+experiements as to what 
learning methods would seem suitable and how well the method performs, 
whenever I have time again. Unless someone else did that already?




smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread decoder

Marc Perkel wrote:


Justin Mason wrote:


So you're volunteering to code it up, then? ;)

--j.

  

I would if I were any good at perl.


I think we should evaluate if the suggested technique works and performs 
better or is at least of some benefit, before trying to implement it 
properly as a plugin. Such a test can be done offline with spam/ham 
easily... I started writing a script that mines some of my spam and ham, 
and then I'll evaluate how good the classifiers are that I get.



Cheers,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread decoder

decoder wrote:

Justin Mason wrote:


So you're volunteering to code it up, then? ;)


I was planning to do at least some brainstorming+experiements as to 
what learning methods would seem suitable and how well the method 
performs, whenever I have time again. Unless someone else did that 
already?




Ok, I did some short experiments: I've built an SVM classifier from a 
large mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold 
cross validation. The resulting classifier has an accuracy of over 99%, 
so performs as good as the regular system. Now I applied this to a set 
of 202 False Negatives that I collected, and 69 of these are recognized 
as spam by the SVM. As a second test, I pulled 2707 mails from one of my 
other inboxes and applied the classifier, the accuracy was again over 
99% (and this is only ham).


From my point of view, the results show that this approach has 
potential. It is highly accurate with respect to the current system, but 
additionally outperformed it on several false negatives.



There are other advantages that this system has over the common system: 
It allows everybody to train the whole spamfilter (not only Bayes) to 
the kind of spam that one receives, i.e. it is more adaptive than the 
common system.



Any opinions on this are greatly welcome. Maybe we should try to come up 
with a proof of concept plugin for SA?



Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-03 Thread decoder

Marc Perkel wrote:

LuKreme wrote:

On Mar 3, 2009, at 10:06, John Wilcock j...@tradoc.fr wrote:


Le 03/03/2009 17:42, Matus UHLAR - fantomas a écrit :
I have been already thinking about possibility to combine every two 
rules

and do a masscheck over them. Then, optionally repeating that again,
skipping duplicates. Finally gather all rules that scored=0.5 
||=-0.5

- we could have interesting ruleset here.

But that's going to be a HUGE ruleset.


Not to mention that different combinations will suit different sites.

I wonder about the feasibility of a second Bayesian database, using 
the same learning mechanism as the current system, but keeping track 
of rule combinations instead of keywords.


It sounds like a really good idea to me, and also like the most 
reasonable way to manage self-learning meta rules.


It seems to me that the consensus is that it's worth a try. I don't 
know if it will work or not but I think there's a good change this 
could be a significant advancement in how well SA works.


I had exactly the same idea as Marc quite a while ago, but didn't try it 
(yet) because I didn't have a big corpus of false positives/negatives to 
test on. Using such a system mainly makes sense to actually improve the 
performance, i.e. to minimize false positives and negatives, so one 
would need to show that it indeed does improve the performance.


Apart from that, it should be simple using machine learning algorithms 
(e.g. Bayes, or even something more complex, like an SVM) to learn 
meta rules and also reasonably fast once one has the model.



Chris




smime.p7s
Description: S/MIME Cryptographic Signature


RBLs and Freemail Forwards

2008-06-29 Thread decoder

Hello,


on our private mail server we now have quite some forwards from freemail 
providers like yahoo, gmx and such. This wasn't a big problem previously 
but there is quite some spam arriving now over those forwards that isn't 
tagged as such (mainly I think because RBLs can't strike on those).


Is there away to modify the trust path such that I can actually trust 
the Received header added by the freemailer MTA (so that RBLs can match 
the Received line which is before the freemailer MTAs) ? I wouldn't 
really add all those to trusted hosts (and for yahoo, there are tons of 
mtas it seems).




Thanks in advance,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: RBLs and Freemail Forwards

2008-06-29 Thread decoder

Matt Kettler wrote:
Nearly all positive-score RBLs will check all untrusted hosts in 
Received: headers, except the DUL RBLs and XBL which only check the 
first untrusted because they are designed to be used in that manner.


ie: SBL will be tested against *ALL* untrusted hosts, including the IP 
delivering mail to the freemailer, not just the freemailer itself.


Thanks for the clarification, I thought that all RBLs only hit on the 
first untrusted host for performance reasons. If that isn't the case, 
then I'll have to find another way to get rid of that specific spam type 
which is getting quite annoying.. :D



Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ocr plugin

2008-05-02 Thread decoder

Matus UHLAR - fantomas wrote:

does it push the extracted text back to SA so it could be used by e.g.
bayes? This is how it imho should be used.

(and imho the same for .pdf and/or .doc - extract text _and_ images from
it, call OCR for images...)

  
That is a question that was very frequently asked around here and that's 
why I also included it in the FuzzyOcr FAQ:


If you take a look at the actual results of the OCR engines used, then 
you'll see that the output suffers from a lot of noise. Hence, it is not 
suited for common word analysis like bayes, and FuzzyOcr uses a special 
fuzzy matching algorithm to find the words


Also, the SA plugin architecture is not designed to modify the message 
in any way, so you cannot push back the text into the normal processing 
line.


As to image spam in general: Yes, it has dropped dramatically and I 
haven't seen any actually for quite a long time now. I hope that my tool 
is one reason that this annoying technique is gone now :D



Best regards,


Chris



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ocr plugin

2008-05-02 Thread decoder

Theo Van Dinter wrote:

On Fri, May 02, 2008 at 09:12:12PM +0200, decoder wrote:
  
Also, the SA plugin architecture is not designed to modify the message 
in any way, so you cannot push back the text into the normal processing 
line.



Really?  Who says?  I made very specific modifications in 3.2 to allow for
just that.

Search the list archives for post_message_parse.
  
Ah ok, I was refering to the 3.1.x architecture. I haven't looked at the 
changes done in 3.2, but if this is technically possible now, then I 
apologize :D



Best regards,


Chris



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Returned mail spam

2008-04-09 Thread decoder

mouss wrote:


he's not the only one... seems there's a lot of backscatter coming in 
these days.
I guess the reason is that it is so easy to make a mistake in a 
mailserver configuration that enables backscatter...


We recently discovered that even our own mailserver (Postfix) was a 
backscatter source (and 1-2 weeks ago spammers started to actively use 
it), there were several reasons and I'd like to share these points with 
the list so nobody does the same mistakes.


1) With Virtual Domains, the recipient validation is not properly done 
anymore once you map one virtual domain to another, so do not do that. 
Also never use wildcards with domain names except if there is a catch 
all defined for this virtual domain entry.


2) By default, Postfix happily seems to accept email addresses refering 
to subdomains of domains listed in $mydestination. The option 
responsible for this cruel behavior is 
parent_domain_matches_subdomains which is by default not empty. We've 
set it to an empty string and after that, Postfix finally rejected mails 
to bogus recipients on our subdomains.



If any of that is wrong, feel free to correct me :)


Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Spam abuse report plugin

2008-04-04 Thread decoder

Eddy Beliveau wrote:
- Message d'origine - De : Michael Scheidell 
[EMAIL PROTECTED]
À : ram [EMAIL PROTECTED]; spamassassin-users 
users@spamassassin.apache.org

Envoyé : 27 mars 2008 10:04
Objet : Re: Spam abuse report plugin





From: ram [EMAIL PROTECTED]
Date: Thu, 27 Mar 2008 15:36:04 +0530
To: spamassassin-users users@spamassassin.apache.org
Subject: Spam abuse report plugin

I get a lot of spam on my servers which get detected by SA though are
generated by innocent mail servers.

We see a lot of mail users have insanely simple passwords , spammers 
are

using these accounts and send spam. By the time the administrator
realizes the server has sent 1000's of spam

So you would spam the abuse@ account '-)



If spamassassin had an option to send abuse report to servers
automatically and send mails to abuse@server-admin the moment the
first sure spam comes in the admin could be warned before much damage
has been done. Obviously we limit to only 1 or 2 reports in an hour 
to a

particular id


Best is to set up something to use 'spamassassin -r' (report) feature.
Set up a SpamCop account, put that information in local.cf.
SpamCop will scan the emails for uri's add them to uri blacklists, 
add the
server to spamcop blacklists, track down the responsible isp, and 
pre-format

a complain email.

If you have DCC and RAZOR, it will also submit the information to those
databases.

NOTE: YOU DO NOT WANT TO AUTOMATICALLY SEND REPORTS AS THIS _WILL_ SPAM
INNOCENT, FORGED DOMAINS ADDING TO THE BACKSCATTER PROBLEMS.


Hi!

This subject is very interesting

I received many spams daily and have to manually analyse headers or 
email content to be able to send abuse report


Is there a tool which can do this for me ?

I imagine some web form (unix/windows) in which I can put a cut/paste 
of original email (including headers)

and that tool can prepare abuse complaint automagically.

Does that beast exist ?
There is a very basic problem with that. You normally report abuse for 
domains or IPs, however, you do not know the originating IP in most 
cases, because you cannot trust headers. There might be innocent relays 
(freemailers for example) in the middle, and you cannot simply pick the 
first hop, thatone might be forged by spammers. So already determining a 
sure source address is something that can hardly be automatised.



Best regards,


Chris



Thanks,
Eddy





smime.p7s
Description: S/MIME Cryptographic Signature


Re: Bye for good FuzzyOCR

2007-07-22 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

David Morton wrote:


 On Jul 22, 2007, at 9:43 AM, arni wrote:

 Loren Wilton schrieb:
 I'm not recieving much of it anymore anyways.

 FWIW, about 20% of the spam I got today had either a GIF or PNG
  image attached to it.  Most advertizing viagra in clear text
 with no obfuscation, a few advertizing stocks.  FuzzyOCR still
 does quite well here.

 Loren

 I'm not saying that it doesnt work well anymore, i'm just saying
 that i dont need it anymore to bring my spam to above 10 points,
 what happened for me lately was the following: image spam was
 above 10 pts already and fuzzyocr didnt run so fuzzyocr only ran
 for ham with images completely wasting resources

 so i uninstalled it


 I upgraded a system to SA 3.2, which I see now is not compatible
 with FuzzyOCR yet.  I started getting a bunch of image spam again.
 :(

 I wish I had it again...
Try using the SVN Version (revision 132). This is basically the same
as the latest 3.5.x release but some issues with SA 3.2.x were fixed.


Best regards,


Chris




 David Morton Maia Mailguard http://www.maiamailguard.com
 [EMAIL PROTECTED]



-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGo7LoJQIKXnJyDxURAluRAJ9E2BMNncHnPymSY5BDCjr5uOOK+QCfZVll
6MOrbLP0OWQeveEi3raL9Nw=
=BkuK
-END PGP SIGNATURE-



FuzzyOcr and PDF files

2007-07-03 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello all,

because some people insisted on it, I added an experimental feature to
FuzzyOcr that allows you to scan PDFs as if they were images.

The feature was implemented in the latest SVN revision and is of
course disabled by default.

Personally, I would not use this feature because the risk of false
positives on important documents is really high, but if you really
want to test this, here are the steps to enable it:

1. Get dependencies:
 -A netpbm version that includes pstopnm
 -Poppler (http://poppler.freedesktop.org/) for the pdfinfo and
pdftops binaries

2. Add those binaries as helper apps in FuzzyOcr.cf (see the .cf file
included in SVN)
3. Enable PDF scanning with focr_scan_pdfs 1 in config.

Optionally, it is possible to skip PDFs which contain more than x
pages (focr_pdf_maxpages).

Currently, the parameters for pstopnm are hardcoded (-xsize=1000),
there might be better ways/values to translate PDFs into usable, but
not too big pnm files.

If you know better ways, tell me. Also I am missing some recent PDF
spam samples (which contain images), so if you could upload some
sample, that would also help.


Best regards,


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGik19JQIKXnJyDxURAs04AKDFRAq4khA+iRouIbpVBZEsjxEJ6ACeLpBO
F4GSUMSqpHubHr9bZkSLS+w=
=Nu8d
-END PGP SIGNATURE-



Re: Which version fuzzyocr

2007-07-03 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Gary V wrote:
 Hello,

 On the fuzzyocr site I see 3.5.1 version is not SA 3.2.X
 compatible ? Is this true, or can I safely ignore :-)

 We have an older server with SA 3.2.0 and Fuzzyocr 2.3b and it
 works.

 Greetings.. Richard

 http://marc.info/?l=spamassassin-usersm=118254092310213
The revision mentioned in this post is the correct one, I am sorry for
any confusion, I will make another release soon for 3.2 compatiblity.
Until that, use the svn checkout command that Gary wrote about in his
reply. About FuzzyOcr 2.3b, I recommend to not use this version
anymore as it has plenty of problems/bugs which remained unfixed
because those were design errors.


Best regards,


Chris


 Gary V

 _
 Like puzzles? Play free games  earn great prizes. Play Clink now.
 http://club.live.com/clink.aspx?icid=clink_hotmailtextlink2


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGimJLJQIKXnJyDxURAvOrAKCPJuMotPrU46onCPWN3fGlSka8BwCcCT3F
wI/JIWA3i0fWXKvgoDPDpJQ=
=Ep+Q
-END PGP SIGNATURE-



Re: FuzzyOCR Use of uninitialized value Hashing.pm errors

2007-06-26 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Russell Galpin wrote:
 Hi There

 I'm running SA 3.2.1 with the latest version of FuzzyOCR (from svn) and I'm
 receiving the same error over and over again in my mail logs:

 Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in string eq at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 245. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in string eq at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 248. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in string eq at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 251. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in numeric eq (==) at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 254. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in numeric eq (==) at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 257. Jun 25 17:25:56 mta1
 spamd[629]: Argument  isn't numeric in numeric eq (==) at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 260. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in numeric eq (==) at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 260. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in string eq at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 245. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in string eq at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 248. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in string eq at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 251. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in numeric eq (==) at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 254. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in numeric eq (==) at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 257. Jun 25 17:25:56 mta1
 spamd[629]: Argument  isn't numeric in numeric eq (==) at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 260. Jun 25 17:25:56 mta1
 spamd[629]: Use of uninitialized value in numeric eq (==) at
 /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 260.

 My FuzzyOcr.cf and setup are pretty much stock, I'm sending the mail from
 spamc to spamd. I've tried sending a email (with spammy image attached)
through
 spamassassin -D fuzzyocr from the command line and I can't get the error to
 reproduce itself. It seems to only occur on certain messages.

 It appears as though FuzzyOCR is still working, it's scoring messages and
 writing hashes to the MySQL database, I'm just not sure if it's working
as well
 as it should.

 Anyone got any ideas where I can track down the problem?
Hi,

I replied on your ticket in our Trac System, you can follow the steps
there to get more information.


Best regards,

Chris





 TIA

 Russ

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGgQqNJQIKXnJyDxURAtlEAJ0UdMMGAl6CVt+kTxaOglmpzFWEqACcCCI1
ooAIdpLjt+T7PRhSBnJV5CM=
=hVrh
-END PGP SIGNATURE-



FuzzyOcr SVN version fixes formatting problems with SA 3.1.8 or higher

2007-06-22 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello all,


I've just comitted some changes to our SVN that fixes the ugly
formatting problems that came up with SA 3.1.8 and higher.

The new version should display results with a proper formatting in the
SA report, without screwing up the FuzzyOcr logging output.

Thanks to Justin Mason for pointing me to the correct function
(test_log) to achieve this :)



For those that want to try the newest version, read
http://fuzzyocr.own-hero.net/wiki/Downloads#SVN for information about
our SVN.

The current SVN version is not very different to the current 3.5.x
release, so overwriting a 3.5.x install will work in most cases, but
please note that this API has only been tested with SA 3.2.0, I am not
sure if it exists in older versions or where the function test_log was
introduced. If you know this, please tell me :)


Thanks in advance for testing and please report back problems to me
(only serious bug reports related to the SVN version, no general
problems).


Chris


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGe/utJQIKXnJyDxURApPOAKCnKNl/ILr/l0clPwf8lrviFU64tACfbR4y
ef2AZD0NFYozHgRQmSBfHIQ=
=P8KY
-END PGP SIGNATURE-



SpamAssassin 3.2 compatiblity

2007-05-27 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,


after I saw that there are incompatiblities with SA 3.2 and FuzzyOcr,
I decided to try to fix them although I'm still very busy (preparing
for Bachelor thesis).


I made changes and the current SVN version fixes ticket #396 as well
as the good old Exporter.pm warnings bug.


There is still another problem though, the formatting of the rule
descriptions changed in SA 3.2 and I can't seem to get it to do the
old formatting (listing the words etc), the wrapper screws it up
totally.


I'd be glad if some people could try the SVN version with SA 3.2 and
report back. If someone knows how to fix those ugly output formatting
problems, tell me. This is a cosmetically issue but it still looks bad.

If there are any other problems that can be fixed quickly, tell me and
I'll make those changes in SVN before releasing a bugfix version for 3.2.


Best regards,


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGWXnkJQIKXnJyDxURAtetAJ9Ec8+BSP0L9ZiGrvlUBjcYy5YoQACghQai
5PREZKNNOL7pOs8W1qRAcaI=
=KGla
-END PGP SIGNATURE-



Re: FuzzyOcr 3.5.1- error messages in logs

2007-01-15 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Frank Bures wrote:
 On Mon, 15 Jan 2007 20:34:38 +0100, Mark Martinec wrote:

 Frank Bures writes:

 Since I updated to 3.5.1 from 3.4.2, I am sometimes getting the


 following
 FuzzyOcr: Error running preprocessor(pamthreshold):
 /usr/local/bin/pamthreshold -simple -threshold 0.5 FuzzyOcr:
 Errors in Scanset ocrad-decolorize FuzzyOcr: Return code:
 256, Error: pamthreshold: bad magic number - not a PAM, PPM,
 PGM, or PBM file Any explanation?
 No, but here is my complaint about the same problem:

 http://marc.theaimsgroup.com/?l=spamassassin-usersm=116837265504702



 Mark

 In my case the gif's are spam text gif's, advertising stock and
 containing words listed in FuzzyOcr.words.  After the above
 mentioned error, the Fuzzy Ocr does not trigger at all and the
 message is scored using any other rules except FuzzyOcr.

 If it would be helpful, please let me know where to post a message
  triggering this problem.

Send it directly to me in a tar.gz or some other form of archive (the
whole msg, not only the pic)

Chris

 Thanks


 Frank Bures, Dept. of Chemistry, University of Toronto, M5S 3H6
 [EMAIL PROTECTED] http://www.chem.utoronto.ca PGP public key:
  http://pgp.mit.edu:11371/pks/lookup?op=indexsearch=Frank+Bures
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFq+nSJQIKXnJyDxURAupxAJ0e5FlpbgjwVZqLqrnrc71PXcjWvgCggJTe
kmdT4pPMXeHpPndaujBxdUs=
=fD5B
-END PGP SIGNATURE-



Re: FuzzyOcr 3.5.1 released

2007-01-10 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Len Conrad wrote:
 With the severe obfuscation of spam images with:

 1) low-contrast between f/g and b/g and

 2) random images/edges in the b/g,

 ... how effective is FuzzyOCR in OCR accuracy?
With these two factors, FuzzyOcr has not much problems using version
3.5.x.

1) Is covered by binarization, for example in ocrad with the -T
percent switch

2) Is covered by the pamthreshold/pamditherbw scansets

However, there are other things that don't work as well.. which I
won't enumerate here for obvious reasons ;)


Chris


 Len



-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFpRGBJQIKXnJyDxURAicPAKDKgROcf7V3DW+KwqQv+RpsQUStzwCgx5fi
gmXbWRwNT8u7XvksJs05X0I=
=cEwx
-END PGP SIGNATURE-



Re: FuzzyOcr 3.5.1 released

2007-01-08 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


jdow wrote:
 From: Andy Dills [EMAIL PROTECTED]

 On Sun, 7 Jan 2007, Andy Dills wrote:

 On Sun, 7 Jan 2007, decoder wrote:

 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1


 Hello all,


 since 3.5.0 RC1 was released, we fixed many bugs, thanks to
 the
 many
 testers and bug reporters :) so big thanks.


 I have something I'm curious about, having run FuzzyOcr in a
 medium size (3-400k messages per day) mail cluster for about a
 week now.

 Why do you do database maintenance with every unmatched check?

 From Hashing.pm:

 unless ($match) { my $then = time -
 ($conf-{focr_db_max_days}*86400); ---$sql = qq(select
 * from $db.$dbfile order by $dbfile.check); my $sth  =
 $ddb-prepare($sql); $sth-execute; while (my @row =
 $sth-fetchrow_array) { my $hash2 = $row[1] || 0:0:0:0;
 $hash2 .= ::$row[0]; if (within_threshold($digest,$hash2)) {
 $txt   = 'Approx'; $key   = $row[0]; $next  = $row[5] + 1;
 $when  = $row[7] || $now; $ret   = $dbfile eq
 $conf-{focr_mysql_hash} ? $row[8] : $row[5]; $dinfo = $row[9]
 || ''; infolog(Found[$dbfile]: Score='$row[8]' Info:
 '$row[9]'); last; } } # Expire old records... ---$sql
 = qq(delete from $db.$dbfile where $dbfile.check  $then);
 debuglog($sql,2); $ddb-do($sql); }


 Those two queries are extremely expensive in a larger
 envrionment...I have commented this code segment out on our
 cluster, and have written a quick maintenance script that runs
 once per day...dropped the response time from 2-3s to .01-.05s
 on queries, and eliminated the suddenly large and
 customer-annoying mailqueues.

 Sorry to follow up to my own post, but now that I read this
 segment a little closer I realize that I'm basically commenting
 out the matching capability of the Hashing mechanism, eliminating
 all value of the Hashing in the first place.

 So...I guess my point is, unless there is a better way of
 determining the match than checking every single hash in the
 database (hoping that you find one that is close enough along the
 way), it's more efficient (in larger environments at least) to
 just scan each mail message without hashing enabled.

 Thoughts?

 Andy

 Hash the hashes and store them in a suitable tree?
I explained before that you cannot hash the hashes because a
cryptographic hash is tolerance resistant. A fuzzy matching on such a
hash of the actual hash is impossible then.


Chris
 {^_^}

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFokUVJQIKXnJyDxURAlWWAKCBlIaLmg6ToOLuWQJ/As5LlWPBpQCfUoGG
rrSlnywraE1RLwK3YjEWqoc=
=7b3V
-END PGP SIGNATURE-



Re: Problems with FuzzyOcr 3.5.1

2007-01-08 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Ed Kasky wrote:
 I just upgraded to 3.5.1 and it seemed that everything was working
 until I tried using sa-learn on a few messages.  Running
 spamassassin -D --lint produces the following errors:

 [22986] dbg: plugin: fixed relative path:
 /etc/mail/spamassassin/FuzzyOcr.pm [22986] dbg: plugin: loading
 FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm Subroutine
 FuzzyOcr::O_NONBLOCK redefined at /usr/lib/perl5/5.8.1/Exporter.pm
 line 60. at /usr/lib/perl5/5.8.1/i686-linux-thread-multi/POSIX.pm
 line 19 [22986] dbg: plugin: registered FuzzyOcr=HASH(0x98343b0)
 [22986] dbg: plugin: FuzzyOcr=HASH(0x98343b0) implements
 'parse_config' [22986] dbg: FuzzyOcr: focr_bin_helper:
 'pnmnorm,pnminvert,pamthreshold,ppmtopgm,pamtopnm' [22986] info:
 FuzzyOcr: Adding 5 new helper apps [22986] dbg: FuzzyOcr:
 focr_bin_helper: 'tesseract' [22986] info: FuzzyOcr: Adding 1 new
 helper apps

 [22986] warn: config: failed to parse line, skipping:
 focr_bin_gifasm /usr/local/bin/gifasm -(this is installed at
 /usr/local/bin/gifasm) [22986] warn: config: failed to parse line,
 skipping: focr_bin_convert /usr/local/bin/convert -(this is
 installed at /usr/local/bin/convert) [22986] warn: config: failed
 to parse line, skipping: focr_bin_identify /usr/local/bin/identify
 -(this is installed at /usr/local/bin/identify)

Hi,

seems like you are trying to define tools here that are not needed
anymore in FuzzyOcr 3.5.

Neither imagemagick (identify/convert), nor gifasm is required anymore
for FuzzyOcr. Please read the dependencies carefully and look at the
shipped FuzzyOcr.cf file. It contains everything that is possible to
define :)


Best regards,


Chris



 I did not install the following executables but left them commented
  out in the cf.  Am I correct in assuming that I need to install
 them as well as the others that were required for previous
 versions?  If that's the case, I read that pamthreshold is part of
 the newer releases of Netpbm but what version?  I looked at 10.33
 and it's not there.

 [22986] warn: FuzzyOcr: Cannot find executable for gifsicle [22986]
 warn: FuzzyOcr: Cannot find executable for ocrad [22986] warn:
 FuzzyOcr: Cannot find executable for pamthreshold [22986] warn:
 FuzzyOcr: Cannot find executable for tesseract

 Thanks for any help on this one.  FuzzyOcr has been a great
 addition to the arsenal...

 Ed Kasky ~ Randomly Generated Quote (104 of 522): If you
 know how to spend less than you get, you have the philosopher's
 stone.   --Benjamin Franklin


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFoss6JQIKXnJyDxURAiwKAKCVoSN3Cm71xTFPmHh8pK3/n6M/ZACfbp0+
uHZWYeog+LDTjfJdMLnf54Q=
=li9R
-END PGP SIGNATURE-



FuzzyOcr 3.5.1 released

2007-01-07 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Hello all,


since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many
testers and bug reporters :) so big thanks.

Now, the version seems stable enough to replace the 3.4.x branch, and
I recommend everyone to upgrade to it :)

For those that don't know yet, whats new in the 3.5 branch, read the
changelog here:

http://fuzzyocr.own-hero.net/wiki/Changelog-3.x#version3.5.0

You can download version 3.5.1 at

http://fuzzyocr.own-hero.net/wiki/Downloads

For those that try to upgrade from 3.4.x or even 2.3b, please read the
installation manual carefully, the 3.5.x branch is very different to
earlier branches.

Unfortunately, I didn't have the time yet to create a FAQ, so if you
run into problems, try searching our ticket system and our mailing
list archives first. If you can't solve the problem then, please use
our mailing list to get help.

Please DO NOT use the ticket system to get help for your problems, the
ticket system is meant for bug reports, not for support requests. If
you think you've found a bug, feel free to create a ticket. The same
applies for errors or missing statements in documentation.


Best regards,


Chris


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFoQDBJQIKXnJyDxURAmH4AJ96/QkNcVmKBdcqM4al8f2XaJ+yFQCgqqR1
eIWq2eAy3D/cCoR7P/TIrGw=
=t0cr
-END PGP SIGNATURE-


Re: FuzzyOcr 3.5.1 released

2007-01-07 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Giampaolo Tomassoni wrote:
 From: decoder [mailto:[EMAIL PROTECTED]
 Hello all,


 since 3.5.0 RC1 was released, we fixed many bugs, thanks to the
 many testers and bug reporters :) so big thanks.

 Excellent work. Thank you for your efforts in bringing it to us.

 Anyway, I'm wondering why the image hashing is made that way,
 leading to:

 1) a variable-length key and

 2) possibly even a very long one (depending on focr_hash_max).

 This pretty inefficient to handle on SQL backends and, infact,
 FuzzyOcr.mysql must define the key columns as varchar(255)...
If I had more time I'd develope a better hashing system, but I don't :(

 I see that the problem is due to the way the hashing is calculate
 in FuzzyOcr/Hashing.pm:

 code-snip my $cnt = 0; my $c = scalar(@stdout_data); my $s =
 (stat($pfile))[7] || 0; $hash = sprintf %d:%d:%d:%d,$s, defined
 $pic-{height} ? $pic-{height} : 0, defined $pic-{width}  ?
 $pic-{width}  : 0, $c; if ($Threshold{max_hash}) { foreach
 (@stdout_data) { $_ =~ s/ +/ /g; my(@d) = split(' ', $_); $hash .=
 sprintf(::%d:%d:%d:%d:%d,@d); if ($cnt++ ge $Threshold{max_hash})
 { last; } } } /code-snip

 Why not use some form of digest? In example, something like this
 could be more interesting to me: code-snip my $cnt = 0; my $c =
 scalar(@stdout_data); my $s = (stat($pfile))[7] || 0; $hash =
 sprintf %d:%d:%d:%d,$s, defined $pic-{height} ? $pic-{height} :
 0, defined $pic-{width}  ? $pic-{width}  : 0, $c; if
 ($Threshold{max_hash}) { use Digest; my $hctx = Digest-new('MD5');
  my $clrcnt = 0; foreach (@stdout_data) { my(@d) = split(/ +/, $_);
  $hctx-add(pack('CCCN', $d[0], $d[1], $d[2], $d[4])); if
 (++$clrcnt = $Threshold{max_hash}) { last; } }
 $hctx-add(pack('N', $clrcnt)); $hash .= '::' . $hctx-hexdigest; }
  /code-snip

 Which basicly creates a digest on the first (most frequent)
 $Threshold{max_hash} palette colors instead of simply enumerating
 them. The output will be around 40-45 characters and will stick
 with this length irregardless of the value of the focr_hash_max
 setting.

 Please note I'm not a perl wizard no a SA developer, so there is
 space for optimizations here. In example, Digest-new('MD5') could
 probably even be globally definited and there initialized, and a
 $hctx-reset issued when a new digest have to be computed.

 What are your thoughts about?
The point is, if you use a digest, then you need an exact match, no
matter if you digest the image directly, or any of the parameters,
because digests are designed to not accept any tolerance. But the
FuzzyOcr matching algorithm depends on accepting tolerance, the hashes
are never matched 100% exactly. That is why hashing any of the
parameters will not work.

Generally, any hashing algorithm is acceptable for FuzzyOcr as long as
it has tolerance built in. Spammers never send the same pictures
around, they are generated on the fly.


Chris



 Regards,

 giampaolo


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFoSoJJQIKXnJyDxURAvVuAKCwJWgArxWYcY5OTlap+13sB8C9sACdHxOo
KflJrH4H1zMFFJj1yFB3Eb8=
=ST+n
-END PGP SIGNATURE-



Re: Any modules use String::Approx?

2007-01-02 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Robert Nicholson wrote:
 Are there any plugins that use String::Approx as used by FuzzyOCR
 but used to match non-image spam?

Not that I know of but it would definetly be possible. There are only
problems with some words which are too similar to spammy words, as
well as with spammy words contained in normal words (e.g. specialist
vs. cialis), but such cases could be handled by adjusting the
threshold on a per word basis and excluding some words.

What do others think?


Chris

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFmqgyJQIKXnJyDxURAoWWAJ4y0iaqBFv/dzfymdTvypTlGI9LMQCgi83M
GdDC5xcRa7Q4ihJQoIEE81Y=
=Oxps
-END PGP SIGNATURE-



Re: Error in FuzzyOcr 3.5.x branch

2006-12-28 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Jim Knuth wrote:
 Heute (28.12.2006/05:10 Uhr) schrieb Gary V,

 Jim, I have been working on a doc for Debian. It is
 unfinished but may
 help
 you through some rough spots at this point. I have no idea
 when I'll
 have
 time to finish it. I have 3.5.0-rc1 running for two days now
 (works
 great).

 http://www200.pair.com/mecham/spam/image_spam2.html Gary V

 mmh, sorry. But the same game.

 spamassassin --lint Subroutine FuzzyOcr::O_NONBLOCK redefined
 at /usr/share/perl/5.8/Exporter.pm line 65. at
 /usr/lib/perl/5.8/POSIX.pm line 19 Subroutine
 FuzzyOcr::debuglog redefined at /usr/share/perl/5.8/Exporter.pm
  line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 24
 Subroutine FuzzyOcr::parse_config redefined at
 /usr/share/perl/5.8/Exporter.pm line 65. at
 /etc/mail/spamassassin/FuzzyOcr.pm line 25 Subroutine
 FuzzyOcr::check_image_hash_db redefined at
 /usr/share/perl/5.8/Exporter.pm line 65. at
 /etc/mail/spamassassin/FuzzyOcr.pm line 40 Subroutine
 FuzzyOcr::add_image_hash_db redefined at
 /usr/share/perl/5.8/Exporter.pm line 65. at
 /etc/mail/spamassassin/FuzzyOcr.pm line 40 Subroutine
 FuzzyOcr::calc_image_hash redefined at
 /usr/share/perl/5.8/Exporter.pm line 65. at
 /etc/mail/spamassassin/FuzzyOcr.pm line 40 Subroutine
 FuzzyOcr::wrong_ctype redefined at
 /usr/share/perl/5.8/Exporter.pm line 65. at
 /etc/mail/spamassassin/FuzzyOcr.pm line 42 Subroutine
 FuzzyOcr::corrupt_img redefined at
 /usr/share/perl/5.8/Exporter.pm line 65. at
 /etc/mail/spamassassin/FuzzyOcr.pm line 42 Subroutine
 FuzzyOcr::known_img_hash redefined at
 /usr/share/perl/5.8/Exporter.pm line 65. at
 /etc/mail/spamassassin/FuzzyOcr.pm line 42 Subroutine
 FuzzyOcr::max redefined at /usr/share/perl/5.8/Exporter.pm line
  65. at /etc/mail/spamassassin/FuzzyOcr.pm line 43 [2769] warn:
 Subroutine new redefined at /etc/mail/spamassassin/FuzzyOcr.pm
  line 48. [2769] warn: Subroutine dummy_check redefined at
 /etc/mail/spamassassin/FuzzyOcr.pm line 59. [2769] warn:
 Subroutine fuzzyocr_check redefined at
 /etc/mail/spamassassin/FuzzyOcr.pm line 63. [2769] warn:
 config: failed to parse line, skipping: focr_end_config [2769]
 warn: lint: 1 issues detected, please rerun with debug enabled
 for more information


 -- Viele Gruesse, Kind regards,

 And what version of SpamAssassin are you running?

 SpamAssassin version 3.1.7 running on Perl version 5.8.4

 Did you move all the old stuff out of the way and remove the
 loadplugin entry in v310.pre?

 no. ;) That was it! And now gets only:

 spamassassin --lint Subroutine FuzzyOcr::O_NONBLOCK redefined at
 /usr/share/perl/5.8/Exporter.pm line 65. at
 /usr/lib/perl/5.8/POSIX.pm line 19

 What is that still?
This is no FuzzyOcr problem but a perl core problem. Two core perl
modules export the same constant(s) (in this case O_NONBLOCK). You can
safely ignore this. Upgrading perl might remove this warning.


Chris


 In the 2.3 doc it had you comment out the loadplugin directive in
 FuzzyOcr.cf and add one to v310.pre. This doc does not do that.

 Gary V


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFk7gsJQIKXnJyDxURAidrAJsH31Iqt0oRgCFv1DDl/bjw3lGQGgCeK4jW
xZprwz1WGaTzFgVsd681SSs=
=ZEe1
-END PGP SIGNATURE-



Re: Despeckling images for OCR and anti-spam purposes

2006-12-23 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Kelly Jones wrote:
 Spammers are starting to put speckles in their images to defeat
 OCR-scanning plugins such as FuzzyOCR.
Which images are you refering to? If you can put up a sample, then I
can tell you which scanner setting will catch it :)


Best regards,

Chris



 I thought ImageMagick's -despeckle option would help, but it
 doesn't seem to, not even when applied multiple times, not even in
 conjunction with -monochrome.

 I want a filter that does this for each pixel X:

 1) if any of X's 8 neighbor pixels is the same color, turn X black
 2) otherwise, turn X white

 Can some combination of options to convert do this?

 I realize that:

 1. This will only work w/ indexed-color images (eg, GIFs) and not
 JPEGs, etc. 2. Spammers will soon work around this, so this is just
 a short-term bandage. 3. I could write something in libgd to do
 this (blech!)


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFjRZqJQIKXnJyDxURAt4YAKCCpRPORjqRy2l6UejArzZKH6Ar1ACghlCC
PcRpJ+Ur+RUvHMy0OY6eDms=
=EJCE
-END PGP SIGNATURE-



Re: Despeckling images for OCR and anti-spam purposes

2006-12-23 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Kenneth Porter wrote:
 --On Saturday, December 23, 2006 12:43 PM +0100 decoder
 [EMAIL PROTECTED] wrote:

 Which images are you refering to? If you can put up a sample,
 then I can tell you which scanner setting will catch it :)

 Does the SA wiki support uploading of images? Perhaps we could have
  a page of just problem images. Such a page is likely to grow large
  and consume a lot of bandwidth, so perhaps we could get a resource
  that thumbnails them and runs them through the Coral Cache.
I'm not sure about the SA wiki but you can create a ticket for it on
our side and attach the picture :) Maybe I can create a wiki page for
it as well on our page that allows uploading/appending of images. You
can find the page at fuzzyocr.own-hero.net.

Chris



-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFjYNrJQIKXnJyDxURAs8PAJ0TMpqHh47zay0wN8MPwFkcyluknQCeJU9m
YOi1MNkEKQ/0YcIe4VhCVSs=
=2LK1
-END PGP SIGNATURE-



Re: FuzzyOcr questions

2006-12-22 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Ronnie Tartar wrote:
 I have a Qmail Toaster setup.  I have everything working except the
  fuzzyocr.  Should it have information in the header about being
 scanned?

 Here is a header but I don't see the fuzzyocr plugin working

 *X-Spam-Status:* Yes, score=10.5 required=1.0
 tests=EXTRA_MPART_TYPE,

 HTML_IMAGE_ONLY_16,HTML_MESSAGE,HTML_SHORT_LINK_IMG_2,INVALID_TZ_GMT,
  MY_CID_AND_ARIAL2,MY_CID_AND_CLOSING,MY_CID_AND_STYLE,
 MY_CID_ARIAL2_CLOSING,MY_CID_ARIAL_STYLE,SARE_GIF_ATTACH,
 SARE_GIF_STOX autolearn=no version=3.1.7 *X-Spam-Report:* * 1.1
 INVALID_TZ_GMT Invalid date in header (wrong GMT/UTC timezone) *
 0.8 EXTRA_MPART_TYPE Header has extraneous Content-type:...type=
 entry * 0.0 HTML_MESSAGE BODY: HTML included in message * 0.6
 HTML_IMAGE_ONLY_16 BODY: HTML: images with 1200-1600 bytes of words
  * 0.8 SARE_GIF_ATTACH FULL: Email has a inline gif * 1.1
 MY_CID_ARIAL_STYLE SARE cid arial2 style * 1.0
 HTML_SHORT_LINK_IMG_2 HTML is very short with a linked image * 0.9
 MY_CID_AND_CLOSING SARE cid and closing * 0.7 MY_CID_AND_STYLE SARE
 cid and style * 0.7 MY_CID_AND_ARIAL2 SARE CID and Arial2 * 1.2
 MY_CID_ARIAL2_CLOSING SARE cid arial2 closing * 1.7 SARE_GIF_STOX
 Inline Gif with little HTML spamassassin -D --lint  shows fuzzyocr
 loading?

 [2261] dbg: plugin: fixed relative path:
 /etc/mail/spamassassin/FuzzyOcr.pm [2261] dbg: plugin: loading
 FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm [2261] dbg:
 plugin: registered FuzzyOcr=HASH(0xb305908) [2261] dbg: plugin:
 FuzzyOcr=HASH(0xb305908) implements 'parse_config' [2261] dbg:
 FuzzyOcr: Option verbose = 1 [2261] dbg: FuzzyOcr: Option logfile =
  /etc/mail/spamassassin/FuzzyOcr.log [2261] dbg: FuzzyOcr: Option
 global_wordlist = /etc/mail/spamassassin/FuzzyOcr.words [2261] dbg:
 FuzzyOcr: Valid search path: /usr/local/bin [2261] dbg: FuzzyOcr:
 Valid search path: /usr/bin [2261] dbg: config: allowing user
 rules! [2261] dbg: plugin:
 Mail::SpamAssassin::Plugin::ReplaceTags=HASH(0xac9d804) implements
 'finish_parsing_end' [2261] dbg: plugin: FuzzyOcr=HASH(0xb305908)
 implements 'finish_parsing_end' [2261] dbg: replacetags: replacing
 tags [2261] dbg: replacetags: done replacing tags [2261] dbg:
 FuzzyOcr: Using gifsicle = /usr/bin/gifsicle [2261] dbg: FuzzyOcr:
 Cannot find executable for giffix [2261] dbg: FuzzyOcr: Cannot find
 executable for giftext [2261] dbg: FuzzyOcr: Cannot find executable
 for gifinter [2261] dbg: FuzzyOcr: Cannot find executable for
 giftopnm [2261] dbg: FuzzyOcr: Cannot find executable for jpegtopnm
  [2261] dbg: FuzzyOcr: Cannot find executable for pngtopnm [2261]
 dbg: FuzzyOcr: Cannot find executable for bmptopnm [2261] dbg:
 FuzzyOcr: Cannot find executable for tifftopnm [2261] dbg:
 FuzzyOcr: Cannot find executable for ppmhist [2261] dbg: FuzzyOcr:
 Cannot find executable for pamfile

Can't you read? You need to tell the plugin where those binaries are
located, if they are not in the standard locations. Did you even
satisfy the dependencies?


Chris

 [2261] dbg: FuzzyOcr: Using gocr = /usr/local/bin/gocr [2261] dbg:
 FuzzyOcr: Using ocrad = /usr/local/bin/ocrad [2261] dbg: FuzzyOcr:
 Loaded 49 words from /etc/mail/spamassassin/FuzzyOcr.words
 [2261] dbg: FuzzyOcr: Using scan: $gocr -i $pfile [2261] dbg:
 FuzzyOcr: Using scan: $gocr -l 180 -d 2 -i $pfile

 Any help would be greatly appreciated.

 Thanks

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFjDOzJQIKXnJyDxURAjAVAKCWT4V1yhFl4kyHoIzRCKHJQLnsQgCePc1A
gwbjOF8+3Se2F8wafm7iuJc=
=BZ1h
-END PGP SIGNATURE-



Re: fuzzyocr slowing up my server

2006-12-21 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


pinoyskull wrote:
 decoder wrote:

 pinoyskull wrote:
 I've been using fuzzyocr plugin for some time now and I think
 I noticed is its high cpu/memory usage resulting on delayed
 delivery of mails. The server is serving 2000+ clients.

 The server is a P4 2.6Ghz, 1GB memory running on FreeBSD 6.0.
  Should i upgrade the memory to 2GB or 4GB? Will it fix the
 problem?

 You need to give more details about the version of FuzzyOcr and the
  configuration. There are plenty of ways to lower the resource
 usage of FuzzyOcr effectively.

 Best regards,


 Chris

 Hi Chris,

 Im using the the Spamassassin 3.17 and FuzzyOCR version 2.3b, I
 just edited the log location and the helper programs, the rest
 are all defaults
First of all, the FuzzyOcr version is quite old, I recommend 3.4.x,
but most things I tell you now also work with 2.3b.

a) Try to use ocrad instead of gocr. Ocrad is known to be less
resource intensive. In 2.3b, you need to write your own ocrad
scansets, 3.4.2 includes examples/support for ocrad already
b) Use hashing, it decreases the amount of OCR scans actually done
(use the MLDBM database stuff)
c) Set the autodisable score setting to 5 (or whatever is enough at
your place to count as spam) to minimize the amount of mails scanned

I am aware of the resource problem and 3.5 includes lots of
enhancements which decrease used resources even more, but currently
only 3.5rc1 is available. It is still being tested by me and others
(but performs stable at the moment with all patchsets applied). If you
want, you can have a look at it as well (available at our download
page) :)

Best regards,


Chris

 I need your input guys. Thanks and Merry Christmas.




-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFirUVJQIKXnJyDxURArX+AKDA05Q88gRSr4/f9xSmvgSnQYwb3QCePsy4
qa9f1G+DYh7/DM/yAbTCyg8=
=CCcV
-END PGP SIGNATURE-



Re: FuzzyOCR hashdb tagging commonly-used images like spacer.gif as spam

2006-12-17 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Kelly Jones wrote:
 We turned on FuzzyOCR's experimental hashdb function, but had to
 turn it off again after it tagged the following images (hashes) as
 spam:

 8:1:1:1::1:1:1:1:1 14:1:1:1::0:0:0:0:1

 These appear to be spacer.gif-like images: small images commonly
 used in HTML messages for formatting purposes. Has anyone else run
 into this issue?

 Related questions:

 1. How does FuzzyOCR compute an image hash? Skimming FuzzyOcr.pm
 shows this isn't a SHA1/MD5 of the image, but instead depends on
 ppmhist and identify (ImageMagick)?
The hash is a feature fingerprint. The exact algorithm isn't disclosed
to make it harder for spammers.

 2. How do I FuzzyOCR-hash a given image? The naive way fails:

 perl -le 'require FuzzyOcr.pm; ($foo, $bar) =
 FuzzyOcr::calc_image_hash(filename.gif); print $foo,$bar'
Since version 3.4.x, there is a tool fuzzy-find which can search for
hashes in the db, and manage it. It will also show the image hash then.

 3. If a spammer attaches 1 spam image + 5 good images and the
 message gets flagged as spam, do all *six* images get entered into
 the hashdb? The log files imply so. Would this explain why
 commonly-used images are in the hashdb?
It was like that in 2.3b, but was changed in 3.4.x. There are still
all images hashed and saved, but the word count is saved per image, so
including good images won't do much, they won't have any counts,
therefore won't cause a score later... If you are still running the
old version, I'd recommend an upgrade. If you still have problems with
good images recognized as bad by hashes, write me and I'll see if the
code needs any change :)



Best regards,


Chris


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFhb5CJQIKXnJyDxURAsydAJwJZehK5ZiEbEBvW9EAd5gVKtFiAgCgvtfE
q80m0080ov6YWsKiySROx4Q=
=RnSt
-END PGP SIGNATURE-



Re: Why don't my Fuzzyocr see some mails which has spam text in a jpeg file ?

2006-12-16 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Halid Faith wrote:
 I use spamassassin3.1.7 and fuzzyocr3.4.2

 Fuzzyocr usually work well. Yet some mails which contains jpeg
 can't see. Therefore fuzzyocr don't give any score them as
 FUZZY_OCR.
Does the jpeg sample file provided within the tarball work? If that's
the case, isolate a mail that didn't work with FuzzyOcr, and run it
from the command line with debugging enabled. This isn't necessarily a
bug, there are some small spam jpegs that aren't recognized well with
the standard word list (namely 2 that I know of that need custom words).


Best regards,


Chris

 Here's my Fuzzyocr.cf
Your scanset line contains a small error:

$ocrad -s 0.5 -T 0.5 $pfile should be $ocrad -s 5 -T 0.4 $pfile

That will provide better results, just as a tweak.

Best regards,

Chris


 body FUZZY_OCR eval:fuzzyocr_check() describe FUZZY_OCR Mail
 contains an image with common spam text inside body
 FUZZY_OCR_WRONG_CTYPE eval:dummy_check() describe
 FUZZY_OCR_WRONG_CTYPE Mail contains an image with wrong
 content-type set body FUZZY_OCR_CORRUPT_IMG eval:dummy_check()
 describe FUZZY_OCR_CORRUPT_IMG Mail contains a corrupted image body
 FUZZY_OCR_KNOWN_HASH eval:dummy_check() describe
 FUZZY_OCR_KNOWN_HASH Mail contains an image with known hash

 priority FUZZY_OCR 900

 ### Plugin Configuration #

  Logging options # # Verbosity level (see manual)
 Attention: Don't set to 0, but to 0.0 for quiet operation, or
 comment out the focr_logfile line. (Def focr_verbose 2.0 # #
 Logfile (make sure it is writable by the plugin) (Default value:
 NONE) focr_logfile /usr/local/etc/mail/spamassassin/FuzzyOcr.log
 ##

 # Wordlists # # Here we defined the words to scan for
 (Default value: /etc/mail/spamassassin/FuzzyOcr.words)
 focr_global_wordlist
 /usr/local/etc/mail/spamassassin/FuzzyOcr.words # # This is the
 path RELATIVE to the respektive home directory for the personalized
 list # This list is merged with the global word list on execution
 (Default value: .spamassassin/fuzzyocr.words) # If
 focr_personal_wordlist begins with '/', treats option as fixed path
 and does not search HOME #focr_personal_wordlist
 .spamassassin/fuzzyocr.words #

 # These parameters can be used to change other detection settings #
 If you leave these commented out, the defaults will be used. # Do
 not use   around any parameters! # # Location of helper
 applications (path + binary) (Default values: /usr/bin/app) #
  focr_bin_gifsicle /usr/local/bin/gifsicle focr_bin_giffix
 /usr/local/bin/giffix focr_bin_giftext /usr/local/bin/giftext
 focr_bin_gifinter /usr/local/bin/gifinter focr_bin_giftopnm
 /usr/local/bin/giftopnm focr_bin_jpegtopnm /usr/local/bin/jpegtopnm
  focr_bin_pngtopnm /usr/local/bin/pngtopnm focr_bin_bmptopnm
 /usr/local/bin/bmptopnm focr_bin_tifftopnm /usr/local/bin/tifftopnm
  focr_bin_ppmhist /usr/local/bin/ppmhist focr_bin_gocr
 /usr/local/bin/gocr focr_bin_ocrad /usr/local/bin/ocrad #
 focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin #
 


 # Scansets, comma seperated (Default value: $gocr -i -, $gocr
 -l 180 -d 2 -i -) # # Each scanset consists of one or more
 commands which make text out of pnm input. # Each scanset is run
 seperately on the PNM data, results are combined in scoring.
 #focr_scansets $gocr -i $pfile, $gocr -l 180 -d 2 -i $pfile # # An
 example that involves ocrad as well focr_scansets $gocr -i $pfile,
 $gocr -l 180 -d 2 -i $pfile, $ocrad -s 0.5 -T 0.5 $pfile # #
 Another one for ocrad only #focr_scansets $ocrad -s 0.5 -T 0.5
 $pfile # # To use only one scan with default values, uncomment the
 next line instead #focr_scansets $gocr -i $pfile




-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFg+L/JQIKXnJyDxURAqsMAJkBuj2GAZiYOwuktV/rI9yqUN30YACfV5n9
V7Gr+wPYEGkIb0u8EPCg6MA=
=Y/t1
-END PGP SIGNATURE-



Re: How can I add to FuzzyOcr.hashdb manually a mail which contains spam text in gif/jpeg.

2006-12-15 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Halid Faith wrote:
 I use spamassassin3.1.7 and fuzzyocr-2.3b.
 it usually works well.
 
 Although Some mails which contain spam in gif/jpeg, fuzzyocr can't
 see them. So it doesn't give them any score as FUZZY_OCR.
 I want to add these mails to FuzzyOcr.hashdb manually.
 How can I do that?
That isn't possible with the version you are using. 3.4.x is the first
branch which introduced fuzzy-find, a tool to manage the database.
Using 3.4.2 + the tools from the SVN (only they have command line
switches --learn-spam and --learn-ham), you can also add your own
hashes to the MLDBM database by passing the picture to the tool


Best regards,


Chris
 Thanks.
 
 
  

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFgqGVJQIKXnJyDxURAq9dAJ42vPMXF7BvPzD+NgLE7W4nZT280QCgi6zx
obN3XhPZHeKfXmTPf/zdYcc=
=ljCU
-END PGP SIGNATURE-



Re: Released patchset 2

2006-12-13 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Hi,


can you provide me the message which triggered the 2 warnings + the
error? Also, are your files unchanged or did you add any
scanset/preprocessor?


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFgE8nJQIKXnJyDxURAlgEAKCachzu/zwwAPe9b63FTElPIJaFxACgnJ8Y
gtE/tWg57I2fk/db1QL0BOc=
=sGiS
-END PGP SIGNATURE-



Re: Released patchset 2

2006-12-13 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Ignore that msg... wasn't meant to go here, sorry :)
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFgE9rJQIKXnJyDxURAhmgAJwIbRTfUXxcd2xACQXeSDXqcHsZwQCgoXIJ
pqAyVW5SerjESMzZYXKarc8=
=naxD
-END PGP SIGNATURE-



Re: Botnet 0.6 plugin for Spam Assassin availabile

2006-12-08 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


John Rudd wrote:
 Michael Schaap wrote:
 John Rudd wrote:

 The next version of the Botnet plugin for Spam Assassin is
 ready. The install instructions are in the Botnet.txt file, and
 in the INSTALL file.


 Great work!


 To Do before 1.0:

 (...)


 There's another thing that would be really nice to have.  You
 know how the DNS rules' descriptions specify what actually
 matches?  e.g.:

 3.9 RCVD_IN_XBLRBL: Received via a relay in Spamhaus
 XBL [12.34.56.789 listed in sbl-xbl.spamhaus.org] 1.6 URIBL_SBL
 Contains an URL listed in the SBL blocklist [URIs: example.com]

 It would be great if Botnet could do something similar, like:

 2.0 BOTNET The submitting mail server looks like
 part of a Botnet [ip=12.34.56.789 rdns=dhcp12.34.example.org]


 Any tips on how to do that? :-}
Have a look at the FuzzyOcr plugin, especially on Scoring.pm in the
SVN, found here:

http://fuzzyocr.own-hero.net/browser/trunk/devel/FuzzyOcr/Scoring.pm

In each of the functions, the mail is scored with a different rule, a
custom score and a custom description which is generated there.

That should be enough for you to reproduce that :)


Chris


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFeTiMJQIKXnJyDxURAicaAJ9n5XdSIpvWXrz3W4w2DtKmbiQ82ACgvyAB
ywuRctN/qak0u61idiMFw5o=
=obGb
-END PGP SIGNATURE-



Re: FuzzyOcr helper apps

2006-12-08 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Robert Fitzpatrick wrote:
 I have two gateways that filter using amavisd-new and SA 3.1.7 with
 the FuzzyOcr recipes used. On one of these FreeBSD servers, all the
 helper applications are present, but on the other, they're all
 missing. I just now realized this after a while and do not remember
 where those helper apps, like giffix, come from. All packages on
 both systems were installed using FreeBSD ports system. Can someone
 give me a pointer? Can I merely copy over the missing helper apps?
http://fuzzyocr.own-hero.net/wiki/OSSpecificNotes

At the bottom is a link to a FreeBSD tutorial, I'm sure it lists what
you need :)


Chris


 Thanks in advance!


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFeYFhJQIKXnJyDxURAvwsAKClBTQJmpVLCAR9FYgtQa4/yx2fuwCfdGkD
czGZM7qXDec+mxKmzGvEtak=
=1Ogr
-END PGP SIGNATURE-



Re: Installed FuzzyOCR - What am I missing?

2006-11-28 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Evan Platt wrote:
 Installed FuzzyOCR on my os/x box per
 http://fuzzyocr.own-hero.net/wiki/Installation-3.x  .

 Based on my reading of it, I don't need to do anything other than
 put the FuzzyOcr.cf file in my spamassassin directory (which on my
 install is /private/etc/opt/mail/spamassassin/ ) .

 So I have FuzzyOcr.cf, FuzzyOcr.pm (chmod +x'd) .

 The relevent (AFAICT) parts of .cf are:

 loadplugin FuzzyOcr FuzzyOcr.pm body FUZZY_OCR
 eval:fuzzyocr_check() describe FUZZY_OCR Mail contains an image
 with common spam text inside body FUZZY_OCR_WRONG_CTYPE
 eval:dummy_check() describe FUZZY_OCR_WRONG_CTYPE Mail contains an
 image with wrong content-type set body FUZZY_OCR_CORRUPT_IMG
 eval:dummy_check() describe FUZZY_OCR_CORRUPT_IMG Mail contains a
 corrupted image body FUZZY_OCR_KNOWN_HASH eval:dummy_check()
 describe FUZZY_OCR_KNOWN_HASH Mail contains an image with known
 hash

 focr_personal_wordlist ./spamassassin/FuzzyOcr.words (.words is in
 the same directory).

 I then ran spamassassin  animated-gif.eml  out

 out shows no FuzzyOCR hits.

 Am I missing something obvious?

 If I'm not providing enough details, please let me know.
You should try to run spamassassin with -D to see more debug output.
Watch out for FuzzyOcr lines :)

Best regards,


Chris

 Thanks.

 Evan


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFbI8vJQIKXnJyDxURAtR2AJ90OR9yKBE2rngmCFiLn3W+8yClCQCgqUKJ
15VKwaPTeOd2sxcRU6U3qrg=
=aMj2
-END PGP SIGNATURE-



Re: Installed FuzzyOCR - What am I missing?

2006-11-28 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



Evan Platt wrote:
 At 11:34 AM 11/28/2006, you wrote:
 You should try to run spamassassin with -D to see more debug
 output. Watch out for FuzzyOcr lines :)

 Didn't think of that.. :)

 Ok, did that.

 Only a few lines have Fuzzy:
I forgot to tell you that you also need to increase the verbosity
factor of the plugin:

focr_verbose 2

will make sure that you see more (i.e. everything ;))

Best regards,


Chris


 [554] dbg: config: read file /etc/opt/mail/spamassassin/FuzzyOcr.cf
  [554] dbg: plugin: fixed relative path:
 /etc/opt/mail/spamassassin/FuzzyOcr.pm [554] dbg: plugin: loading
 FuzzyOcr from /etc/opt/mail/spamassassin/FuzzyOcr.pm [554] dbg:
 plugin: FuzzyOcr=HASH(0x1d0e4a4) implements 'parse_config'

 Nothing there looks like a problem?

 I put the entire debug session at
 http://www.espphotography.com/sadebug.txt

 Thanks.

 Evan

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFbJ4TJQIKXnJyDxURAv+/AJ91Hiq7q8uZWopDe1aDvkZkP+KaTACfX0kt
QF+pEYZA347kjVZBmtzLSi4=
=Geew
-END PGP SIGNATURE-


Re: Installed FuzzyOCR - What am I missing?

2006-11-28 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Evan Platt wrote:
 At 12:37 PM 11/28/2006, you wrote:
 I forgot to tell you that you also need to increase the verbosity
 factor of the plugin:

 focr_verbose 2

 will make sure that you see more (i.e. everything ;))

 Best regards,


 Did that, reran spamassassin -D animated--gif.eml  out , same
 results :(
Did you specify a logfile? If not, do so and check for output there :)

Best regards,

Chris


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFbKHIJQIKXnJyDxURAqiSAJ9aRyxKzuz//TW2XCicTiiDB6nLPgCfT/uq
8XuY1ycxz3nVDPDuyDf6gBw=
=ypSP
-END PGP SIGNATURE-



Re: Fuzzy OCR - first time user

2006-11-18 Thread decoder

Marc Perkel wrote:
OK - trying out the FuzzyOCR plugin. So far it all the default stuff 
with minimal installation. I'm running Fedora Core 6. Used the gocr 
RPM and didn't patch the source. Everything is default and it doesn't 
seem to be complaining so .


If I like this what do I need to change to really do it right? Should 
I grab the devel code? Do I really need the gocr patch? Should I tweek 
the scores? What do the hard core users change?


My suggestion the FuzzyOcr version is 3.4.x, since it is a lot better. I 
also recommend to enable image hashing which is disabled by default.


About the patch for gocr: I highly suggest to build it from source 
because I don't know if Fedora Core 6 has the proper bindings to netpbm 
compiled with gocr. Redhat does not. That leads to dramatical decrease 
in effectiveness. Also, the patch prevents segmentation faults with some 
pictures, and afaik, this bug still hasn't been fixed.


The scores normally do not need change, unless you get serious problems 
with FPs..


And what the hardcore users change? lol... well, experienced users have 
different scansets, for example they invoke ocrad instead of gocr in 
their scansets because it runs faster and recognizes better in most 
situations. In the shipped config file, there is an example for a 
scanset which includes ocrad (If you wan't to try it out, make sure to 
read the Notes about the config file page on the FuzzyOcr download 
page as the ocrad scanset contains a small typo which should be fixed 
first :))


Finally, if you run into problems, try our mailing list at 
http://lists.own-hero.net/mailman/listinfo/devel-spam



Best regards,


Chris


Re: FuzzyOCR words file

2006-11-18 Thread decoder

Marc Perkel wrote:
The words file needs a little documentation. Is it limited to single 
words or phrases too? What's with the colon and the numbers after the 
word?


Phrases are possible too, spaces and numbers are stripped out in both 
the wordlist and the OCR output before matching :)


The colon + the number after it indicates a custom matching threshold 
for this word. The default threshold is defined in the FuzzyOcr.cf, but 
it makes sense to override this setting for some specific words which 
often trigger FPs with the default threshold.



Best regards,

Chris


Re: image exception with FuzzyOCR??

2006-11-17 Thread decoder

Sietse van Zanen wrote:


Ofcourse, save the image, calculate the hash and then use the 
fuzzy-find.pl script to delete it from the bad hash db.


Next you’ll have to use a little trick to get it into the good hash 
db, as that’s not possible from the fuzzy-find.pl script.


Simply make an empty word list and yank the image through FuzzyOcr 
again. It’ll put it into the known good db.




It is planned to include this feature, it is really something that is 
missing... maybe I'll hack it up right now and release it :)


Regards,

Chris


-Sietse

*From:* Thiago LPS [mailto:[EMAIL PROTECTED]
*Sent:* Friday, November 17, 2006 18:25
*To:* users@spamassassin.apache.org
*Subject:* image exception with FuzzyOCR??



Hello everybody...

there is a way to do a exception to some image that isn't a SPAM... 
but the FuzzyOCR thinks that it is a spam image??


i really dont want to disable the Hashdb...







Re: image exception with FuzzyOCR??

2006-11-17 Thread decoder

Thiago LPS wrote:



On 11/17/06, *Sietse van Zanen* [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED] wrote:


To be more exact, the procedure would be:

 


1.   Save the image file, and the message

2.   Calculate the hash and delete it from the bad hash db
with the fuzzy-find.pl script

3.


In the body of mail marked as spam , i have the hash value...
so.. i removed this hash from hashdb...
it was happen because i didnt yet apply the Patch to only include in 
hasb db pictures matched as pic-spam..
after removed the hash and applied the patch... the picture wasn't 
include in the hasb db anymore..


but.. the question is: even with patch applied if some good-picture be 
included in the hashdb nothing better than a white-hashdb to solve 
it.. :D

im not expert with perl.. but it doesnt sounds dificult to do.. :D
I'm not sure if I understand you correctly, but FuzzyOcr 3.x has already 
a whitelist hashdb :)



And for all the others, I just checked in revision 40, which contains a 
modified fuzzy-find script, to be found at


http://fuzzyocr.own-hero.net/browser/trunk/devel/Utils/fuzzy-find

Please note that this is bleeding edge, if you want to try it out, go 
for it, but backup the database first in case something breaks...



The script now features --learn-spam, and --learn-ham which will 
manually add the hash of a given image file, i.e. fuzzy-find --learn-ham 
somepic.gif



Best regards,

Chris




 


Create an empty wordlist, or fill it with some bogus words, that
don't appear in the image

4.   Update the FuzzyOcr.cf file to point to the new wordlist.
If you're using spamd don't restart, it'll keep using the correct
wordlist. Otherwise you might want to stop incoming mail for a
little while.

5.   Pipe the message through FuccyOcr.pm directly, it'll put
the hash into the known good db.

6.   Correct the config. (and restart maild).

7.   Send in a feature request to update the fuzzy-find.pl
script to insert hashes into a db. ;-)

 


-Sietse

 


*From:* Sietse van Zanen [mailto:[EMAIL PROTECTED]
mailto:[EMAIL PROTECTED]]
*Sent:* Friday, November 17, 2006 20:09
*To:* Thiago LPS; users@spamassassin.apache.org
mailto:users@spamassassin.apache.org
*Subject:* RE: image exception with FuzzyOCR??

 


Ofcourse, save the image, calculate the hash and then use the
fuzzy-find.pl script to delete it from the bad hash db.

 


Next you'll have to use a little trick to get it into the good
hash db, as that's not possible from the fuzzy-find.pl script.

Simply make an empty word list and yank the image through FuzzyOcr
again. It'll put it into the known good db.

 


-Sietse

 

 


*From:* Thiago LPS [mailto:[EMAIL PROTECTED]
mailto:[EMAIL PROTECTED]]
*Sent:* Friday, November 17, 2006 18:25
*To:* users@spamassassin.apache.org
mailto:users@spamassassin.apache.org
*Subject:* image exception with FuzzyOCR??

 




Hello everybody...

there is a way to do a exception to some image that isn't a
SPAM... but the FuzzyOCR thinks that it is a spam image??

i really dont want to disable the Hashdb...





--
--
Thiago LPS
C.E.S.A.R - Administrador de Sistemas
msn: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
0xx 81 8735 2591
-- 




Re: Linked images in e-mail

2006-11-15 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


John D. Hardin wrote:
 On the FuzzyOCR list (devel-spam) there was a question about OCR of
  remote images vs. embedded images.

 I ased there but didn't think to ask here:

 Does SA check URIBLs on IMG tags with remote sources?

 e.g. IMG src=http://known.spammer.com/gibberish.jpg;
Yes it seems to do this. I just searched for an email in my spam
folder that was caught by URIBL, took the url, edited another HTML
mail and inserted an img src=theurl/blub.jpg and then ran it
through SA again. It listed theurl in the URIBL results as well :)

Regards,

Chris


 -- John Hardin KA7OHZ
 http://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic
 #11174 pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 -- 2D8C 34F4
 6411 F507 136C  AF76 D822 E6E6 B873 2E79
 ---
  There is no doubt in my mind that millions of lives could have
 been saved if the people were not brainwashed about gun ownership
 and had been well armed. ... Gun haters always want to forget the
 Warsaw Ghetto uprising, which is a perfect example of how a ragtag,
  half-starved group of Jews took 10 handguns and made asses out of
 the Nazis.-- Theodore Haas, Dachau Survivor
 
 ---



-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFW2mvJQIKXnJyDxURAn6sAJ9c4T1y7z7hIOwSE3XgELhZsO1gGACeKxw4
t2XHhoIAE4ZNiYX4d2ZD3hc=
=rJlo
-END PGP SIGNATURE-



New FuzzyOcr Development Release (3.4.x)

2006-11-12 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
Hello all,

for those that are not on the devel-spam Mailing list, I'd like to
announce a new development release here.

If you are interested, our new website is located at
http://fuzzyocr.own-hero.net/

The branch has been tested by me and some other people and seems to be
very stable so far. This should be especially interesting for users of
2.3j or 2.3b which want to participate in testing.

Major Changes are:

For users of 2.3j:

- - Logging Facility was fixed, you can specify a logfile again without
getting the SA debug output into the logfile
- - New animated gifs are all deanimated properly now
- - No ImageMagick dependency anymore
- - Improved Utilities for the hash database a bit
- - Ocrad support

For users of 2.3b:

- - See http://www.joval.info/proj/FuzzyOcr-2.3j/CHANGES for changes
between 2.3b - 2.3j, then read the above changes.

The main reason for this release was to give users a version which
also catches recent animated spam types, but also to show that we are
still alive ;)

Another major development branch is planned (3.5), it will hopefully
be the last release before we release a new version labeled as stable
with more features.

The main features which are planned for 3.4 - 3.5 are:

- - Splitting FuzzyOcr into multiple .pm files for better maintaining
- - Config switches to disable scanning of specific extensions (like
tiff... many people don't want this)
- - Maximum image size and dimensions in configuration
- - autodisable_score also for a minimum score, so messages which are to
be considered ham already arent scanned anymore

If you have more feature requests, ideas, bugs, or anything, please
create a ticket on the mentioned website :)

Best regards,

Chris

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
iD8DBQFFV0/1JQIKXnJyDxURAuGZAKC3Pl+FNomog0jxu8taqYckmpLmYwCfXOFC
TRHAS+XquHo2+qthph454X0=
=e8xF
-END PGP SIGNATURE-



Re: FuzzyOcr problem (Re: Relay Checker plugin v0.2)

2006-11-11 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
John Rudd wrote:
 decoder wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 John Rudd wrote:
 D.J. wrote:
 On 11/10/06, Patrick Sneyers [EMAIL PROTECTED] wrote:
 I get this warning: plugin: failed to create instance of plugin
  Mail::SpamAssassin::Plugin::RelayChecker: Can't locate object
 method new via package
 Mail::SpamAssassin::Plugin::RelayChecker at (eval 26) line 1.


 (This is my own build of SA 3.1.7 on Max OS X Server 10.4 ppc)

 It seems to work OK though: *  3.0 RELAY_CHECKER RELAY: badrdns
  (I lowered the score)

 Patrick Sneyers Belgium

 I also received some weirdness.  When linting in debug mode, I
 found the following lines that seem to indicate that RelayChecker
 isn't playing nicely with FuzzyOCR:

 [28058] dbg: plugin: fixed relative path:
 /etc/mail/spamassassin/FuzzyOcr.pm [28058] dbg: plugin: loading
 FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm [28058] dbg:
 plugin: registered FuzzyOcr=HASH(0x9d04570) [28058] dbg: plugin:
 FuzzyOcr=HASH(0x9d04570) implements 'parse_config' [28058] dbg:
 FuzzyOcr: Option logfile =
 /home/amavis/.spamassassin/FuzzyOcr.log [28058] dbg: FuzzyOcr:
 Found scan: $gocr -i $pfile [28058] dbg: FuzzyOcr: Found scan:
 $gocr -l 180 -d 2 -i $pfile [28058] dbg: FuzzyOcr: Found scan:
 $gocr -l 140 -d 2 -i $pfile [28058] dbg: FuzzyOcr: Option
 threshold = 0.25 [28058] dbg: FuzzyOcr: Score{autodisable} =
 10.01 [28058] dbg: FuzzyOcr: Option counts_required = 3 [28058]
 dbg: plugin: fixed relative path:
 /etc/mail/spamassassin/RelayChecker.pm [28058] dbg: plugin:
 loading RelayChecker from /etc/mail/spamassassin/RelayChecker.pm
 [28058] dbg: plugin: registered RelayChecker=HASH(0x9d94a80)
 [28058] dbg: plugin: FuzzyOcr=HASH(0x9d04570) implements
 'parse_config' [28058] dbg: plugin: RelayChecker=HASH(0x9d94a80)
 implements 'parse_config' [28058] dbg: FuzzyOcr: unknown Score:
 relaychecker_score [28058] dbg: FuzzyOcr: unknown Option:
 relaychecker_skip_nordns [28058] dbg: FuzzyOcr: unknown Option:
 relaychecker_skip_badrdns [28058] dbg: FuzzyOcr: unknown Option:
 relaychecker_skip_baddns [28058] dbg: FuzzyOcr: unknown Option:
 relaychecker_skip_ipinhostname [28058] dbg: FuzzyOcr: unknown
 Option: relaychecker_skip_dynhostname [28058] dbg: FuzzyOcr:
 unknown Option: relaychecker_skip_clienthostname [28058] dbg:
 FuzzyOcr: unknown Option: relaychecker_skip_ip [28058] dbg:
 FuzzyOcr: unknown Option: relaychecker_pass_auth

 Ok that really doesn't look nice... is the fault on our (FuzzyOcr's)
 side?

 Yes.

 If so, then maybe someone can explain me what the correct way
 would be to fix this :)

 When you encounter an option you don't own (ie. it's not a
 FuzzyOcr option), then parse_config should return 0.


 If you could verify that this also applies to the latest development
 version (3.4.1), then that would be nice


 Yup, I found this in your 3.4.1 code (my comments indicate the issues):
Thank you very much for the work, I will patch this into our SVN
version and the 3.4.x devel branch right now.

Best regards

Chris

 sub parse_config {
 my ( $self, $opts ) = @_;

 # this is good: you're restricting yourself to ^focr_bin_ keys

 if ( $opts-{key} =~ /^focr_bin_/i ) {
 my $p = lc $opts-{key};
 $p =~ s/focr_bin_//;
 if (grep {m/$p/} @bin_utils) {
 $App{$p} = $opts-{value};
 debuglog(App{$p} = $App{$p});
 } else {
 debuglog(unknown App: $opts-{key});
 }
 # you should tell SA you processed this config option:
 #$self-inhibit_further_callbacks();
 }

 # this is bad: you're processing _score configs that may not belong to
 # FuzzyOcr.  A better statement might be:
 #elsif (($opts-{key} =~ /^focr_/i)  ($opts-{key} =~
 m/_score$/i)) {
 # that way you're only processing _score configs that belong to focr

 elsif ( $opts-{key} =~ m/_score$/i ) {
 my $o = lc $opts-{key};
 $o =~ s/focr_//;
 $o =~ s/_score//;
 if (grep {m/$o/} @pgm_scores) {
 $Score{$o} = $opts-{value};
 debuglog(Score{$o} = $Score{$o});
 } else {
 debuglog(unknown Score: $opts-{key});
 }
 # again, inhibit further callbacks here:
 #$self-inhibit_further_callbacks();
 }

 # same as above: now you're taking ANY key, from ANY plugin, and
 handling
 # it.  Bad bad bad.  This should be changed to:
 #elsif ($opts-{key} =~ /^focr_/i) {

 else {
 my $o = lc $opts-{key};
 $o =~ s/focr_//;
 if (grep {m/$o/} @pgm_opts) {
 if ($o eq 'scansets') {
 @scansets = (); # remove
 foreach my $s (split(',',$opts-{value})) {
 $s =~ s/^\s*//; $s =~ s/\s*$//;
 push @scansets,$s;
 debuglog(Found scan: $s);
 }
 } elsif ($o eq 'path_bin') {
 @paths = (); # remove
 foreach my $p (split

Re: FuzzyOCR

2006-11-11 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
sokka wrote:
 Hi,

 Can anyone post me URL or PDF of clear documentation of the
 FuzzyOcr ?
The current URL for FuzzyOcr is http://fuzzyocr.own-hero.net/

The page (wiki) is still quite under construction, but you'll find
installation instructions inside the tarball (you can try version
3.4.1 if you want, it performs better than the stable version 2.3b,
just isnt tested as long yet..). Installation itself is not hard if
you have all the dependencies installed :) If you need further
assistance, check out our list at
http://lists.own-hero.net/mailman/listinfo/devel-spam

Once I get more time, I will also be able to do more work on the wiki :)


Best regards,

Chris

 thanks in advance

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
iD8DBQFFVbEaJQIKXnJyDxURAkYrAJ4/ObuZsaThvCh13jBycDpMZrUpqQCgsdO6
UmIM0FUXykERwXZTIN7wLPo=
=dtEH
-END PGP SIGNATURE-



Re: Questions about FuzzyOCR

2006-11-11 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
Pascal Maes wrote:

 Version 2.3b


 1) Here is the ouptut of the scanner (gocr -i) :

 _

 date Informations



 9- 11-lO061O_30   Le __ek-end du 3-4r'11, les adresses de cou
 r_er jlectron_que des jtud_ants non ri_nscmts j _UCL ont jtj
 ddsact_vjes. La ra_son est pÄrement adm_n_strat_ve et I_je j Ia
 caNe j puce. Pour permeNre j ces jtud_ants de rjcupdrer leurs
 messaqes, nous avons fa_t en soNe qu'_Is pu_ssent encore accjder j
 leur boîte aux leNres jusqu'au l4.r l 1 ,/lo 06 . ANent_on, la
 consuttat_on se fera av_ un cI_ent de messager_e !Thunderb_rd.
 Eudora, Outlook.. .7 ou v_a le _IebMa_I ma_s plus v_a le poNa_I .


 We get almost the same result with gocr -l 180 -d 2 -i

 And FuzzyOCr says :

 13 FUZZY_OCR  BODY: Mail contains an image with common
 spam text inside Words found: wexe in 3 lines alert in 2 lines
 alert in 2 lines investor in 1 lines trade in 3 lines (11
 word occurrences found)

 But I don't find any of these words in th text above !

You can try lowering your fuzz from 0.3 to 0.2, I didn't make any
experience so far how the plugin reacts to text in different
languages, so this might produce false positives.

 2) How remove an image which as been stored by mistake in the hash
 database ?
In version 2.3b, this is not possible yet with a tool, unfortunately.
But the database is only a textfile, so you can simply search the hash
there and delete the line. Version 3.4.1 brings a tool that removes a
given hash from the database, but I am still improving it a bit, so
one can also pass it an image file to look for.

Best regards,

Chris

 Thanks -- Pascal




-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
iD8DBQFFVbIjJQIKXnJyDxURAkYjAJ9iFDj2oFrY+mVMyEBvEusYxxBxFQCgjZoM
SJny4nTsw1G3XgGqBOVl7S8=
=5S1J
-END PGP SIGNATURE-



Re: Questions about FuzzyOCR

2006-11-11 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
decoder wrote:
 Pascal Maes wrote:
 Version 2.3b


 1) Here is the ouptut of the scanner (gocr -i) :

 _

 date Informations



 9- 11-lO061O_30   Le __ek-end du 3-4r'11, les adresses de cou
  r_er jlectron_que des jtud_ants non ri_nscmts j _UCL ont jtj
 ddsact_vjes. La ra_son est pÄrement adm_n_strat_ve et I_je j Ia
 caNe j puce. Pour permeNre j ces jtud_ants de rjcupdrer leurs
 messaqes, nous avons fa_t en soNe qu'_Is pu_ssent encore accjder
 j leur boîte aux leNres jusqu'au l4.r l 1 ,/lo 06 . ANent_on, la
  consuttat_on se fera av_ un cI_ent de messager_e !Thunderb_rd.
 Eudora, Outlook.. .7 ou v_a le _IebMa_I ma_s plus v_a le poNa_I .



 We get almost the same result with gocr -l 180 -d 2 -i

 And FuzzyOCr says :

 13 FUZZY_OCR  BODY: Mail contains an image with
 common spam text inside Words found: wexe in 3 lines alert in
 2 lines alert in 2 lines investor in 1 lines trade in 3
 lines (11 word occurrences found)

 But I don't find any of these words in th text above !

 You can try lowering your fuzz from 0.3 to 0.2, I didn't make any
 experience so far how the plugin reacts to text in different
 languages, so this might produce false positives.
 2) How remove an image which as been stored by mistake in the
 hash database ?
 In version 2.3b, this is not possible yet with a tool,
 unfortunately. But the database is only a textfile, so you can
 simply search the hash there and delete the line. Version 3.4.1
 brings a tool that removes a given hash from the database, but I am
 still improving it a bit, so one can also pass it an image file to
 look for.
I must correct myself there, passing it an image is already supported :)

Best regards,

Chris


 Best regards,

 Chris
 Thanks -- Pascal




-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
iD8DBQFFVeMqJQIKXnJyDxURAhIbAKCpiYddgBqEBZZt1WnM9e4qjkgFfgCePG/R
mWU8mtJuXQlVIHdO90e6xR0=
=hMuz
-END PGP SIGNATURE-



Re: FuzzyOcr problem (Re: Relay Checker plugin v0.2)

2006-11-10 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
John Rudd wrote:
 D.J. wrote:
 On 11/10/06, Patrick Sneyers [EMAIL PROTECTED] wrote:

 I get this warning: plugin: failed to create instance of plugin
  Mail::SpamAssassin::Plugin::RelayChecker: Can't locate object
 method new via package
 Mail::SpamAssassin::Plugin::RelayChecker at (eval 26) line 1.


 (This is my own build of SA 3.1.7 on Max OS X Server 10.4 ppc)

 It seems to work OK though: *  3.0 RELAY_CHECKER RELAY: badrdns
  (I lowered the score)

 Patrick Sneyers Belgium


 I also received some weirdness.  When linting in debug mode, I
 found the following lines that seem to indicate that RelayChecker
 isn't playing nicely with FuzzyOCR:

 [28058] dbg: plugin: fixed relative path:
 /etc/mail/spamassassin/FuzzyOcr.pm [28058] dbg: plugin: loading
 FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm [28058] dbg:
 plugin: registered FuzzyOcr=HASH(0x9d04570) [28058] dbg: plugin:
 FuzzyOcr=HASH(0x9d04570) implements 'parse_config' [28058] dbg:
 FuzzyOcr: Option logfile =
 /home/amavis/.spamassassin/FuzzyOcr.log [28058] dbg: FuzzyOcr:
 Found scan: $gocr -i $pfile [28058] dbg: FuzzyOcr: Found scan:
 $gocr -l 180 -d 2 -i $pfile [28058] dbg: FuzzyOcr: Found scan:
 $gocr -l 140 -d 2 -i $pfile [28058] dbg: FuzzyOcr: Option
 threshold = 0.25 [28058] dbg: FuzzyOcr: Score{autodisable} =
 10.01 [28058] dbg: FuzzyOcr: Option counts_required = 3 [28058]
 dbg: plugin: fixed relative path:
 /etc/mail/spamassassin/RelayChecker.pm [28058] dbg: plugin:
 loading RelayChecker from /etc/mail/spamassassin/RelayChecker.pm
 [28058] dbg: plugin: registered RelayChecker=HASH(0x9d94a80)
 [28058] dbg: plugin: FuzzyOcr=HASH(0x9d04570) implements
 'parse_config' [28058] dbg: plugin: RelayChecker=HASH(0x9d94a80)
 implements 'parse_config' [28058] dbg: FuzzyOcr: unknown Score:
 relaychecker_score [28058] dbg: FuzzyOcr: unknown Option:
 relaychecker_skip_nordns [28058] dbg: FuzzyOcr: unknown Option:
 relaychecker_skip_badrdns [28058] dbg: FuzzyOcr: unknown Option:
 relaychecker_skip_baddns [28058] dbg: FuzzyOcr: unknown Option:
 relaychecker_skip_ipinhostname [28058] dbg: FuzzyOcr: unknown
 Option: relaychecker_skip_dynhostname [28058] dbg: FuzzyOcr:
 unknown Option: relaychecker_skip_clienthostname [28058] dbg:
 FuzzyOcr: unknown Option: relaychecker_skip_ip [28058] dbg:
 FuzzyOcr: unknown Option: relaychecker_pass_auth

Ok that really doesn't look nice... is the fault on our (FuzzyOcr's)
side? If so, then maybe someone can explain me what the correct way
would be to fix this :)

If you could verify that this also applies to the latest development
version (3.4.1), then that would be nice

Best regards,

Chris


 That would seem to me to indicate that FuzzyOcr isn't returning the
  proper code when it finds an option it doesn't own.  It should
 be returning 0 if it's not a FuzzyOcr option.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
iD8DBQFFVPhzJQIKXnJyDxURAqXUAKC0gAy2TH0JvheiRuAGdcEV/y+7sACgn4tL
VhzEF71Q2wCP5gI87DiTYtg=
=geVq
-END PGP SIGNATURE-



Re: ocrtext vs FuzzyOCR?

2006-10-30 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



James Lay wrote:
 On Mon, 30 Oct 2006 07:19:44 -0800 Jeff Chan [EMAIL PROTECTED]
 wrote:

 Does anyone have any opinions on which of these is better:

 http://wiki.apache.org/spamassassin/CustomPlugins

 OCR scanner and image validator SA-plugin Checks for specific
 keywords in gif/jpg/png attachments, using gocr. This can be used
 to detect spam that puts all the real contect in an attached
 image, accompanied with random text and html (no URL's, etc).
 There are also various rules to validate attached images and
 detect forged content types or broken images. This plugin needs
 SpamAssassin 3.1.1 or later. The version 2.0 is able to defeat
 recent gif animations which use gif tricks to avoid OCR. Created
 by: Martin Blapp Contact: mb -at- imp -dot- ch License Type: BSD
 Status: active Available at: [WWW]
 http://antispam.imp.ch/patches/patch-ocrtext Note: Feedback and
 new sample images are welcome. Please test and send reports.


 Fuzzy OCR Plugin Derived from OcrPlugin (see above), but has many
 feature enhancements, including an approximate matching algorithm
 to compensate recognition errors and obfuscation, support for
 broken gifs, jpeg and png, dynamic scoring, automatic
 content-type independant format detection and many more. Created
 by: Christian Holler Contact: decoder_at_own-hero_dot_net License
 Type: Same as SpamAssassin Status: active Available at:
 FuzzyOcrPlugin Note: Feedback and new sample images are welcome.
 Please test and send reports.

 Jeff C. --

 I'd like to see something on this myself.  The segfault patch for
 Fuzzy OCR failed, so I stopped right there as I wasn't sure what to
 do next.

This is no patch for FuzzyOcr but for gocr. You will need it with
every OCR plugin that uses gocr... It should work with version 0.40

Best regards,

Chris

 James

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFRiU2JQIKXnJyDxURAhB4AJ4vDRdlck+1I0D0HSNu0AFikgn13QCffOyi
0Tq0HJJvW7lrUGUKEKwX/EE=
=xWpz
-END PGP SIGNATURE-


Re: This image is turning frequent..

2006-10-17 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



Anders Norrbring wrote:
 This type of image spam is getting more common, and is not
 detected.. At least not here..
Yes, this picture is indeed hard to detect...


I'd need a blackbox like

Input: Animated gif of any kind
Output: NonAnimated gif which shows what the user will see

But that is a difficult task considering how many things are possible
with the GIF standard. This picture uses offsets and slow frame rates,
others use transparency etc. A simple way to block these images would
be to scan the GIF for offset frames. I don't think there is any valid
GIF which makes use of these techniques...


Best regards,

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFNOuPJQIKXnJyDxURAsLVAKDIdS8QJ38I6snB/lq4mejK8y9r6gCfSoSg
PGMfmUQ35Aez6I7kfJB91h8=
=nHuo
-END PGP SIGNATURE-


Re: FuzzyOCR/SpamAssassin questions

2006-10-17 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Bill wrote:
 I just installed FuzzyOCR and have questions about 2 things:

 1) I am getting the following errors in the fuzzy.log file. Are these
 something I should be concerned about? I have verbose enabled.

 FuzzyOcr received timeout after running 10 seconds.

 Unexpected error in pipe to external programs.
 Please check that all helper programs are installed and in the correct
path.
 (Pipe Command /usr/bin/giftopnm -, Pipe exit code 1 (), Temporary file:
 /tmp/.spamassassin4571WcpHuTtmp)
Does that happen with all pictures you scan or only with particular ones?

 2) The server I installed this on handles a fairly small volume of email
 each day. I would like to install it on higher volume mail server but am
 afraid FuzzyOCR would overload it. Both servers are Xeon 2.6 with about 2
 gig of RAM. I ue MimeDefang and the latest SpamAssassin. Is there something
 I can adjust in FuzzyOCR to make it more efficient?
Yes there are several options. The first one is the image hashing
db... it saves image features as a kind of hash to avoid a gocr run
twice on the same type of image. Also, if that is not enough, you can
change the default scansets to contain only 2 or 1 gocr scan.


 I have the Priority
 setting set to 900I have MimeDefang set to reject emails that score
over
 14. Can I set FuzzyOCR/SpamAssassin to NOT scan the graphics if the email
 already scores over 14 and will be rejected anyway?
Yes, this is even the default behavior, but the default auto_disable
is 10. You can set it to 14 if you want, by modifying the
corresponding config entry :)

Best regards,

Chris


 Bill


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFNVVQJQIKXnJyDxURAoTEAJ9HeGSyV4s3zZBjAm9+jFr9jePLnwCgunq1
alABUBzZg19Y6P5drvQRrCw=
=jlM5
-END PGP SIGNATURE-



Re: FuzzyOCR request

2006-10-05 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Duncan Hill wrote:
 On Wednesday 04 October 2006 22:23, Alan Munday wrote:

 I've been following your developments and looking at how to
 integrate with my (few) systems. But as I don't have a test
 environment (until I have built a VMWare one) I was cautious at
 trying this with one of the live box's. Zero scoring seemed to be
 a good way round this.

 SA treats a 0-scored rule as a rule that must not be run.
FuzzyOcr does not use a standard rule to score but does the scoring
itself.

But to the main subject:

I haven't tried it out, but to archieve a zero score, you could as
well try to set the scores that are configurable to 0, or to a very
small amount... Also I recommend only using 2.3b in production
environments :)

Best regards,

Chris

 Score at 0.01 and the rule will fire, but should have a
 non-significant impact.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFJO0cJQIKXnJyDxURAoIdAJ94WXbh/azaNswXjxRNT4R38yBFUACfeSdY
1axXA+NqRmcW2TTnOy2OV1o=
=Awmk
-END PGP SIGNATURE-



Re: FuzzyOCR seems to not like gif and png

2006-10-04 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Loren Wilton wrote:
 @page Section1 {size: 8.5in 11.0in; margin: 1.0in 1.0in 1.0in 1.0in;
 } P.MsoNormal { FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY:
 Times New Roman } LI.MsoNormal { FONT-SIZE: 12pt; MARGIN: 0in 0in
 0pt; FONT-FAMILY: Times New Roman } DIV.MsoNormal { FONT-SIZE:
 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: Times New Roman } A:link {
 COLOR: blue; TEXT-DECORATION: underline } SPAN.MsoHyperlink { COLOR:
 blue; TEXT-DECORATION: underline } A:visited { COLOR: purple;
 TEXT-DECORATION: underline } SPAN.MsoHyperlinkFollowed { COLOR:
 purple; TEXT-DECORATION: underline } SPAN.EmailStyle17 { COLOR:
 windowtext; FONT-FAMILY: Arial; mso-style-type: personal-compose }
 DIV.Section1 { page: Section1 }
 There are newer versions of FuzzyOCR that probably fix or at least
 get around this.  A lot of image spam mails have broken images in
 them, and this messes up a lot of stuff.  The latest versions use
 ImageMagic.  This is reputedly hard to install on many systems.  But
 if you can get it installed it seems to work much better in terms of
 the images that it can handle.
 
 You might want to join the FuzzyOCR mailing list:
 
 List-Id: devel-spam.lists.own-hero.net
 List-Unsubscribe:
 http://lists.own-hero.net/mailman/listinfo/devel-spam,
  mailto:[EMAIL PROTECTED]
 List-Archive: http://lists.own-hero.net/mailman/private/devel-spam
 List-Post: mailto:[EMAIL PROTECTED]
 List-Help: mailto:[EMAIL PROTECTED]
 List-Subscribe: http://lists.own-hero.net/mailman/listinfo/devel-spam,
  mailto:[EMAIL PROTECTED]
 If you search the list archive you will see a number of posts on the
 current release and where to get it.  I think the current version is
 something like J.
The current version is b. J is a devel version as are all versions
higher than b. Please note that when trying out these versions. A new
stable version will follow soon, once I get the time again.

Chris
 
 Loren
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFI61qJQIKXnJyDxURAkZTAJwN39dvgOtmYg4gp63OAivuBx8cYQCgjH7c
f3p/ug6HPt+YEjoly1iETPA=
=wgR7
-END PGP SIGNATURE-



Re: Stock spam in images

2006-10-02 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Theo Van Dinter wrote:
 On Mon, Oct 02, 2006 at 03:18:58PM +0100, Randal, Phil wrote:
 undetected). Wouldn't it be better to inject the detected
 text back to SA? There should be enough variants of spam
 worlds to let SA fuzzily catch the ones from images.
 I think so.  Some of the words would be perfectly legitimate in the text
 of emails but rarely found in attached legitimate images.

 Quite apart from the fact that Spamassassin isn't designed for
 reinjection.

 FWIW, 3.2 adds in support to have rendering of non-text parts.  So a plugin
 could, for instance, OCR text from an image, and then the normal body rules
 and such would be able to use that information.

This sounds great. Once I am back to continue the developing process
of FuzzyOcr, I might add an option to pass the text back to SA.
Combined with a new, more precise OCR engine like tesseract, this will
probably work very well. Unfortunately, there is currently a lot of
picture spam being sent around which won't be caught at all by
FuzzyOcr because they use new obfuscation technics with animated gifs
etc and I don't have the time atm to adjust the plugin to these...

Best regards

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFIVIfJQIKXnJyDxURAlIlAKCCcaD5O43KmvAHUxcew85d7cE82wCgwbGG
NAd6j8vgv1pvV9zVBN+5oqE=
=LB3n
-END PGP SIGNATURE-



Re: Stock spam in images

2006-10-02 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Randal, Phil wrote:
 This has been covered so many times on this list.

 1:  if you're not on spamassassin 3.1.5 get it now, and run
 sa-update (via a cron job daily, but test first with a manual
 sa-update -D)

 2:  pop over to http://www.rulesemporium.com and get an appropriate
  selection of their rules, and configure Rules du Jour (
 http://www.exit0.us/index.php?pagename=RulesDuJour ) to download
 them daily.

 3:  don't forget the additional rules here:
 http://www.rulesemporium.com/other-rules.htm I've found Fred's
 header rules helpful

 4:  add the ImageInfo plugin from
 http://www.rulesemporium.com/plugins.htm

 5:  if you want to be adventurous, make sure you have ImageMagick,
 ImageMagick-perl and other prerequisites installed and use the
 FuzzyOCR plugin ( latest version at
 http://www.joval.info/proj/FuzzyOcr.html , but see also
 http://wiki.apache.org/spamassassin/FuzzyOcrPlugin ). The FuzzyOCR
 mailing list is very helpful too.

What do you mean with adventurous? Those versions published by joval
are all devel.

The stable version is available at
http://users.own-hero.net/~decoder/fuzzyocr/ and works fine.

There is nothing adventurous about them and the prerequisites are also
lower than for the devel stuff.

I am simply not able to continue development at the moment, but maybe
in a few weeks, I'll start again.

Best regards,

Chris


 In my experience here a well-trained Bayes plus the various
 RulesEmporium rulesets gets most of them.

 Cheers,

 Phil -- Phil Randal Network Engineer Herefordshire Council
 Hereford, UK

 -Original Message- From: Dylan Bouterse
 [mailto:[EMAIL PROTECTED] Sent: 02 October 2006 14:38 To:
 users@spamassassin.apache.org Subject: Stock spam in images

 I'm a newbie to the list and have been scanning recent posts to
 see if what I'm about to ask about has been covered but I haven't
  seen anything yet.

 Lately I have been getting more and more of the stock alert spam
 but now all the good info is in an image and typically following
 the image is random text to fool the Bayesian filter. I think the
 random text thing has been covered here recently. It's
 frustrating when sa is giving a -1.6 (or so) score to these
 emails right off the bat. Quite a few of these aren't even
 getting spam headers because they aren't scoring high enough. Is
 there some magical trick to help score these messages higher?
 Maybe a future version of sa will incorporate an OCR module? :)

 Dylan


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFIVpDJQIKXnJyDxURAoTiAJ0SS12lfncMkv/vaLpPX2dscSMkWwCfftby
uosbxGicE+jBtHgaYCd0Klc=
=RRVE
-END PGP SIGNATURE-



FuzzyOcr development/support stop for 7 weeks

2006-09-01 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello all,


since I will have a very tight time schedule in the next 7 weeks for a
project at the university, I will not be able to release any new
versions of FuzzyOcr, fix bugs, reply to questions or give support.
Instead of writing me, you can write to either this mailing list or
the devel-spam mailing list and other people will try to answer your
questions.

Moderator privileges for the devel-spam mailing list will be given to
some people that helped with the development earlier.


Best regards,


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE9/pcJQIKXnJyDxURApkCAJ0eY0CdeN5ssYNTcMO0PSkU7V3hMgCfUxGF
FcvWk8cr6/9VIEuKm+JRYjA=
=ARBX
-END PGP SIGNATURE-



Strange SPF problem/wrong result

2006-09-01 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

today I saw a strange SPF bug occuring. The original mail header was:

Return-Path: [EMAIL PROTECTED]
Received: from mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200])
by wjpserver.cs.uni-sb.de (8.12.11.20060308/8.12.11) with ESMTP id
k7T8rU6P012050;
Tue, 29 Aug 2006 10:53:30 +0200
Received: from mail-eur1.microsoft.com (mail-eur1.microsoft.com
[213.199.128.139])
by mail.cs.uni-sb.de (8.13.8/2006081400) with ESMTP id k7T8rT98004989;
Tue, 29 Aug 2006 10:53:29 +0200 (CEST)
Received: from x.europe.corp.microsoft.com ([65.53.193.xxx]) by
mail-eur1.microsoft.com with Microsoft SMTPSVC(6.0.3790.1830);
 Tue, 29 Aug 2006 09:53:29 +0100

(Some unrelated privacy details replaced with xxx).

Now what SPF should do is (as far as I understood):

- - Get the mail server that sent this (mail-eur1.microsoft.com)
- - Check that its IP is in the allowed SPF record of microsoft.com

This check passes as you can see here:
http://www.dnsstuff.com/tools/spf.ch?server=microsoft.comip=213.199.128.139

Now SpamAssassin did something else, it took mail.cs.uni-sb.de as the
mailserver that sent, and tried to match it against microsoft.com's
SPF records which produced a SOFTFAIL:

 1.4 SPF_SOFTFAIL   Sending host does not match SPF-record
(softfail)
[SPF failed: Please see
http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available]
 2.4 SPF_HELO_SOFTFAIL  HELO-Name does not match SPF-record
(softfail)
[SPF failed: Please see
http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available]

Can someone explain me this failure?

Thanks


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE+BcYJQIKXnJyDxURAl22AJ9D1gsr9/mjmevWVe63mRcdOkeWqACgxYs8
S2NysNSm5mdscg2H2OsSsiI=
=ghdo
-END PGP SIGNATURE-



Re: Strange SPF problem/wrong result

2006-09-01 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

So adding the line

trusted_networks 134.96.254.200

to local.cf will fix this problem and this mail would be recognized
correctly (as in pass SPF) ?

Thanks


Chris

Justin Mason wrote:
 it's trusted_networks -- SpamAssassin doesn't know that it can
 trust mail.cs.uni-sb.de.

 --j.

 decoder writes:
 today I saw a strange SPF bug occuring. The original mail header
 was:

 Return-Path: [EMAIL PROTECTED] Received: from
 mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) by
 wjpserver.cs.uni-sb.de (8.12.11.20060308/8.12.11) with ESMTP id
 k7T8rU6P012050; Tue, 29 Aug 2006 10:53:30 +0200 Received: from
 mail-eur1.microsoft.com (mail-eur1.microsoft.com
 [213.199.128.139]) by mail.cs.uni-sb.de (8.13.8/2006081400) with
 ESMTP id k7T8rT98004989; Tue, 29 Aug 2006 10:53:29 +0200 (CEST)
 Received: from x.europe.corp.microsoft.com ([65.53.193.xxx])
 by mail-eur1.microsoft.com with Microsoft SMTPSVC(6.0.3790.1830);
  Tue, 29 Aug 2006 09:53:29 +0100

 (Some unrelated privacy details replaced with xxx).

 Now what SPF should do is (as far as I understood):

 - - Get the mail server that sent this (mail-eur1.microsoft.com)
 - - Check that its IP is in the allowed SPF record of
 microsoft.com

 This check passes as you can see here:

http://www.dnsstuff.com/tools/spf.ch?server=microsoft.comip=213.199.128.139


 Now SpamAssassin did something else, it took mail.cs.uni-sb.de as
 the mailserver that sent, and tried to match it against
 microsoft.com's SPF records which produced a SOFTFAIL:

 1.4 SPF_SOFTFAIL   Sending host does not match SPF-record
  (softfail) [SPF failed: Please see

http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available]
  2.4 SPF_HELO_SOFTFAIL  HELO-Name does not match SPF-record
 (softfail) [SPF failed: Please see

http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available]


 Can someone explain me this failure?

 Thanks


 Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5
 (GNU/Linux) Comment: Using GnuPG with Mozilla -
 http://enigmail.mozdev.org

 iD8DBQFE+BcYJQIKXnJyDxURAl22AJ9D1gsr9/mjmevWVe63mRcdOkeWqACgxYs8
 S2NysNSm5mdscg2H2OsSsiI= =ghdo -END PGP SIGNATURE-

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE+BxWJQIKXnJyDxURAhQ1AKCsicr906Fy7RkBZtU3TduR/cgFHgCfWJGe
2KZKNwn4ZfYBx4yh/xUwoHw=
=AtZw
-END PGP SIGNATURE-



Re: Strange SPF problem/wrong result

2006-09-01 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Gino Cerullo wrote:
 On 1-Sep-06, at 7:18 AM, decoder wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hello,

 today I saw a strange SPF bug occuring. The original mail header was:

 Return-Path: [EMAIL PROTECTED]
 Received: from mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200])
 by wjpserver.cs.uni-sb.de (8.12.11.20060308/8.12.11) with ESMTP id
 k7T8rU6P012050;
 Tue, 29 Aug 2006 10:53:30 +0200
 Received: from mail-eur1.microsoft.com (mail-eur1.microsoft.com
 [213.199.128.139])
 by mail.cs.uni-sb.de (8.13.8/2006081400) with ESMTP id
 k7T8rT98004989;
 Tue, 29 Aug 2006 10:53:29 +0200 (CEST)
 Received: from x.europe.corp.microsoft.com ([65.53.193.xxx]) by
 mail-eur1.microsoft.com with Microsoft SMTPSVC(6.0.3790.1830);
  Tue, 29 Aug 2006 09:53:29 +0100

 (Some unrelated privacy details replaced with xxx).

 Now what SPF should do is (as far as I understood):

 - - Get the mail server that sent this (mail-eur1.microsoft.com)
 - - Check that its IP is in the allowed SPF record of microsoft.com

 This check passes as you can see here:

http://www.dnsstuff.com/tools/spf.ch?server=microsoft.comip=213.199.128.139


 Now SpamAssassin did something else, it took mail.cs.uni-sb.de as the
 mailserver that sent, and tried to match it against microsoft.com's
 SPF records which produced a SOFTFAIL:

  1.4 SPF_SOFTFAIL   Sending host does not match SPF-record
 (softfail)
 [SPF failed: Please see

http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available]

  2.4 SPF_HELO_SOFTFAIL  HELO-Name does not match SPF-record
 (softfail)
 [SPF failed: Please see

http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available]


 Can someone explain me this failure?

 Spamassassin gave the correct result. It compared the IP address of
 the last received server mail.cs.uni-sb.de (mail.cs.uni-sb.de
 [134.96.254.200]) against the SPF record for Microsoft and did not
 see a match. Result SOFTFAIL

 Why do you think it should compare to mail-eur1.microsoft.com
 (mail-eur1.microsoft.com [213.199.128.139]).

 SPF compares the IP address of the last server to handle the message
 before it was handed off to a server on your receiving end. If the
 message was sent to someone who is using forwarding and forwarded
 through mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) then
 this would explain the SOFTFAIL. Forwarding breaks SPF.
This is no real forwarding, but all mail for us gets received by that
server first, and this server passes it to us. This is a common
structure for a bigger mail setup. The trusted_networks option solved
my problems, but it should definetly be included in the wiki somewhere.
Maybe we should add a note about trusted_networks being important for
SPF in the install manual where SPF installation is explained


Chris



 --
 Gino Cerullo

 Pixel Point Studios
 21 Chesham Drive
 Toronto, ON  M3M 1W6

 416-247-7740




-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE+C2ZJQIKXnJyDxURAp3eAJ9qvVbNz2OaPygoLghms+3KiPc1SQCgpCpD
splrSRz31hg6UjCgJPWVKhY=
=Sb9E
-END PGP SIGNATURE-



Re: Strange SPF problem/wrong result

2006-09-01 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ramprasad wrote:
 Return-Path: [EMAIL PROTECTED] Received: from
 mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) by
 wjpserver.cs.uni-sb.de (8.12.11.20060308/8.12.11) with ESMTP
 id k7T8rU6P012050; Tue, 29 Aug 2006 10:53:30 +0200 Received:
 from mail-eur1.microsoft.com (mail-eur1.microsoft.com
 [213.199.128.139]) by mail.cs.uni-sb.de (8.13.8/2006081400)
 with ESMTP id k7T8rT98004989; Tue, 29 Aug 2006 10:53:29 +0200
 (CEST)

 snip
 This is no real forwarding, but all mail for us gets received by
 that server first, and this server passes it to us. This is a
 common structure for a bigger mail setup. The trusted_networks
 option solved my problems, but it should definetly be included in
 the wiki somewhere. Maybe we should add a note about
 trusted_networks being important for SPF in the install manual
 where SPF installation is explained
 snip

 If 134.96.254.200 is accepting mails for you then you must do all
 SPF checks on that host. SPF checks dont work unless you do the
 checks on the receiving host.
In a big infrastructure, this is hardly possible. This mailserver is
not under our control but belongs to the University directly, not to
our chair.

Chris


 Thanks Ram







-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE+ELcJQIKXnJyDxURAn12AJ9OSP19czmLi1KNEmunB37WkWC75wCffMa4
15iEKJqbZOzSycS3nwn4RKU=
=4Exp
-END PGP SIGNATURE-



Re: [Devel-spam] Hash Stats

2006-08-30 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

- --[ UxBoD ]-- wrote:
 How many hits are you getting ?

 Database changed mysql select count(*) from maillog where
 spamreport like '%FUZZY_OCR%' and date = '2006-08-29'; +--+
  | count(*) | +--+ |  385 | +--+ 1 row in set
 (0.10 sec)

 mysql select count(*) from maillog where spamreport like
 '%FUZZY_OCR_KNOWN_HASH%' and date = '2006-08-29'; +--+ |
 count(*) | +--+ |1 | +--+ 1 row in set
 (0.05 sec)

 mysql select count(*) from maillog where spamreport like
 '%FUZZY_OCR_CORRUPT%' and date = '2006-08-29'; +--+ |
 count(*) | +--+ |  298 | +--+ 1 row in set
 (0.05 sec)

 --[ UxBoD ]-- // PGP Key: curl -s
 http://www.splatnix.net/uxbod.asc | gpg --import // Fingerprint:
 543A E778 7F2D 98F1 3E50  9C1F F190 93E0 E8E8 0CF8 // Keyserver:
 www.keyserver.net Key-ID: 0xE8E80CF8



Did you apply the patch I sent to the SA mailing list? There is a bug
in 2.3b which breaks the database completely. Please fix the
corresponding line:

line 492:


It says:

  print DB $score::$digest\n;


Should be:

  print DB ${score}::${digest}\n;



As a result, the produced hashdb is corrupted, delete it and start
with a new one...


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE9XUrJQIKXnJyDxURAoWOAJ9ej8U66qKCGiJSrPYM51ZP0WHGnQCfZWqa
8BxDIenQxw0JrGD/31hQshI=
=lDtr
-END PGP SIGNATURE-



Re: FuzzyOCR Install - Issues processing ONLY Gif images.

2006-08-30 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Michael Grey wrote:
 !-- /* Style Definitions */ p.MsoNormal, li.MsoNormal,
 div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt;
  font-family:Times New Roman;} a:link, span.MsoHyperlink
 {color:blue; text-decoration:underline;} a:visited,
 span.MsoHyperlinkFollowed {color:purple;
 text-decoration:underline;} span.EmailStyle17
 {mso-style-type:personal-compose; font-family:Arial;
 color:windowtext;} @page Section1 {size:8.5in 11.0in; margin:1.0in
 1.25in 1.0in 1.25in;} div.Section1 {page:Section1;} --

 Installed FuzzyOCR and believe all the dependencies.



 Using the sample images I get a Pipe Error ONLY on gif images;
 resulting in no hits on FUZZY_OCR.

 Pipe Command /usr/bin/giftopnm -



 Giftopnm exists in that path.



 Running giftopnm on the command line seems to work with no errors,
 spitting out a binary file to stdout as expected.



 Any ideas of what might be missing ? ( Fedora Core 4 ).

You can try step by step debugging, first of all, what sample is
producing the error? (there are two gif samples).

If it doesn't work with the corrupted sample, try extracting the image
from that eml file (ripmime), then run the pipe manually:

cat filename.gif | giffix | giftopnm -  blah.gif

If that fails, try splitting the commands up and trace down which
binary causes the problem.

Chris



 Thanks?


 Michael Grey











 - log / reports -



 Corrupted-gif.eml



 pts rule name  description

  --
 --

 0.1 HTML_MESSAGE   BODY: HTML included in message

 3.0 BAYES_95   BODY: Bayesian spam probability is 95 to
 99%

 [score: 0.9694]

 1.5 FUZZY_OCR_WRONG_CTYPE  BODY: Mail contains an image with wrong

 content-type set

 Image has format GIF but content-type is

 image/jpeg





 [2006-08-29 19:20:00] Debug mode: Image has format GIF but
 content-type is image/jpeg

 [2006-08-29 19:20:01] Debug mode: Image is single non-interlaced...


 [2006-08-29 19:20:01] Unexpected error in pipe to external
 programs.

 Please check that all helper programs are installed and in the
 correct path.

 (Pipe Command /usr/bin/giftopnm -, Pipe exit code 1 (),
 Temporary file: /tmp/.spamassassin23614sXR9Dltmp)

 [2006-08-29 19:20:01] Debug mode: FuzzyOcr ending successfully...

 bash-3.00$









 animated-gif.eml



 pts rule name  description

  --
 --

 0.7 DATE_IN_PAST_06_12 Date: is 6 to 12 hours before Received:
 date

 0.1 HTML_MESSAGE   BODY: HTML included in message

 0.0 BAYES_50   BODY: Bayesian spam probability is 40 to
 60%

 [score: 0.5000]





 [2006-08-29 19:22:12] Debug mode: Analyzing file with content-type
 image/gif

 [2006-08-29 19:22:12] Debug mode: Image is single non-interlaced...


 [2006-08-29 19:22:12] Unexpected error in pipe to external
 programs.

 Please check that all helper programs are installed and in the
 correct path.

 (Pipe Command /usr/bin/giftopnm -, Pipe exit code 1 (),
 Temporary file: /tmp/.spamassassin23644bPPq3jtmp)

 [2006-08-29 19:22:12] Debug mode: FuzzyOcr ending successfully...










-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE9XXqJQIKXnJyDxURAuv0AKCNGLWfDNggpjyOLGLQiXQZHh4ukgCgtlBi
ptzwNcXJ4pIaQJGVhZ7yiKE=
=IH6h
-END PGP SIGNATURE-



Re: wrong ml, ignore ;)

2006-08-30 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

decoder wrote:
 --[ UxBoD ]-- wrote:
 How many hits are you getting ?

 Database changed mysql select count(*) from maillog where
 spamreport like '%FUZZY_OCR%' and date = '2006-08-29';
 +--+ | count(*) | +--+ |  385 |
 +--+ 1 row in set (0.10 sec)

 mysql select count(*) from maillog where spamreport like
 '%FUZZY_OCR_KNOWN_HASH%' and date = '2006-08-29'; +--+
 | count(*) | +--+ |1 | +--+ 1 row in
 set (0.05 sec)

 mysql select count(*) from maillog where spamreport like
 '%FUZZY_OCR_CORRUPT%' and date = '2006-08-29'; +--+ |
 count(*) | +--+ |  298 | +--+ 1 row in set
 (0.05 sec)

 --[ UxBoD ]-- // PGP Key: curl -s
 http://www.splatnix.net/uxbod.asc | gpg --import //
 Fingerprint: 543A E778 7F2D 98F1 3E50  9C1F F190 93E0 E8E8 0CF8
 // Keyserver: www.keyserver.net Key-ID:
 0xE8E80CF8



 Did you apply the patch I sent to the SA mailing list? There is a
 bug in 2.3b which breaks the database completely. Please fix the
 corresponding line:

 line 492:


 It says:

 print DB $score::$digest\n;


 Should be:

 print DB ${score}::${digest}\n;



 As a result, the produced hashdb is corrupted, delete it and start
 with a new one...


 Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE9XYVJQIKXnJyDxURAsR4AJ472simn6QDxPJOJiFMhgrWgJVNmgCgypsb
43SCSvXwBGAHNlTbJzrPKdE=
=Ez80
-END PGP SIGNATURE-



Silent bug in FuzzyOcr 2.3b, database feature - hotfix

2006-08-29 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,


someone discovered that the DB was not working properly in most cases,
please fix line 492:


It says:

  print DB $score::$digest\n;


Should be:

  print DB ${score}::${digest}\n;


As a result, the produced hashdb is unusable, please delete it, or
convert it by adding a score + :: before each entry, like:

190959:221:288:64::255:255:255:255:45599::18:26:58:27:3991::11:67:247:71:1875::254:254:253:254:1417::236:14:8:80:1180

gets to

10::190959:221:288:64::255:255:255:255:45599::18:26:58:27:3991::11:67:247:71:1875::254:254:253:254:1417::236:14:8:80:1180


The bug caused the score not to get written in front of the file size...

Thanks to all the people on the devel-spam list that helped finding
this bug...

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE9HkvJQIKXnJyDxURAk6jAJ0eP/tTzBCiqwxHHSf/cJ0UXZmbUQCdHq1H
dLJKZ0yRDo968QVY6TN0Ek8=
=QDuC
-END PGP SIGNATURE-



Re: Hashcash

2006-08-29 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Arik Raffael Funke wrote:
 Hello,

 how does spamassassin handle hashcash? It is turned on by default,
 right?
Yes but you still need to define your accept range as you tried to do
above:)

 I am using v3.1.2 and have in init.pre loadplugin
 Mail::SpamAssassin::Plugin::Hashcash. However, the hashcash
 contained in incoming mails seems to have been ignored. I added
 following to local.cf, but I am still out of luck:

 use_hashcash 1 hashcash_accept [EMAIL PROTECTED]
try [EMAIL PROTECTED]

 How do I get this to work? (Can it be a problem that I installed
 hashcash after spamassassin?)
That shouldn't matter.

Best regards,

Chris

 Cheers, Arik


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE9LfyJQIKXnJyDxURAo5YAJwJ9RYdpq8khY7lHOMXuMpU1gvNAQCfeMug
ZHh0X6YHdAqr/uLO8yJtp5A=
=v81o
-END PGP SIGNATURE-



Re: Hashcash

2006-08-29 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Arik Raffael Funke wrote:
 decoder wrote:
 Arik Raffael Funke wrote:
 Hello,

 how does spamassassin handle hashcash? It is turned on by
 default, right?
 Yes but you still need to define your accept range as you tried
 to do above:)
 I am using v3.1.2 and have in init.pre loadplugin
 Mail::SpamAssassin::Plugin::Hashcash. However, the hashcash
 contained in incoming mails seems to have been ignored. I added
  following to local.cf, but I am still out of luck:

 use_hashcash 1 hashcash_accept [EMAIL PROTECTED]
 try [EMAIL PROTECTED]

 That doesn't seem to help either. Any other ideas?

Run with -D on a hashcash stamped message and check the output for
relevant data..

Chris

 Regards, Arik


 BTW: I am not using v.3.1.2 as said above but v.3.1.4...


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE9MEHJQIKXnJyDxURAsccAKCrJSIsOHABvlJiEE2xi2Kqbj/2AgCfcHX9
rtW8EHKC+x9gPaDA+AFsBZQ=
=MQwm
-END PGP SIGNATURE-



Re: Animated images in mails

2006-08-28 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Plenz wrote:

 decoder wrote:
 gifasm can split them into multiple files, etc.


 Thanks, gifasm works very well. Seems that I only have to choose
 the biggest one of the output files, it contains the text.
That is what FuzzyOcr does automatically for you :) (If you set the
gif frame option in the cf file to a low value... with 1 it will
always be used..)

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8rskJQIKXnJyDxURAiDsAJ0SuPpt+3SU+CZP6zx2BTrN0CsTawCfWEVf
sEyehX84ZiLrpvV/kTZwGMk=
=Ak2M
-END PGP SIGNATURE-



Now ascii spam instead of real pictures

2006-08-28 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello there,


A friend of mine recently received a mail containing an ASCII image
advertising meds. The mail is attached.


Anyone seen this before? Do rules exist already against this kind of spam?


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8vb3JQIKXnJyDxURAlpDAKCcMJNgdznSZnga1uZst+Lhc2iCIwCgvJP3
41fvggKf/jHtrac0n+sAdQA=
=4Ehg
-END PGP SIGNATURE-

---BeginMessage---


Monty,
Y,our sourse for poopular M,EDS

__   __ (_)   __ ___ _   _ __   __ _   /$\$ $  ___   ___   __  __ 
\ \ / / | |  / _` |  / _` | | '__| / _` |  $ $   $$$$ / __| / _ \ |  \/  |
 \ V /  | | | (_| | | (_| | | |   | (_| |  \$$  $ $   $ $| (__ | (_) || |\/| |
  \_/   |_|  \__,_|  \__, | |_|\__,_|   $ $ $ \___| \___/ |_|  |_|
 |___/ $$ $ 0
 
inculcate MqY

---End Message---


Re: Now ascii spam instead of real pictures

2006-08-28 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Loren Wilton wrote:
 Ah.  Sig-file format.  That is I guess a slight new twist.  This
 sort of thing was popular for a month or two a couple of years ago.
 I suspect they gave up on it then because it was probably done by
 hand and not worth the effort.

 Probably not too hard to catch this sort of thing.

Loren


Yea this could probably be catched by looking for huge amounts of
typical ascii art characters like \ | / ( ) etc...

Seems like a case for SARE ;D

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8xrvJQIKXnJyDxURAmzGAJ4idww42k0/f/P+7ah7Jg3+skFW/QCgitFW
rAp+8JQeCHf9m1/QUyaSw4s=
=gg81
-END PGP SIGNATURE-



Re: [Devel-spam] FuzzyOcr 2.3b released,fixes bugs and improves stability

2006-08-27 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

jdow wrote:
 From: decoder [EMAIL PROTECTED]
 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1

 Expertsites, Inc. wrote:
 From: decoder [EMAIL PROTECTED]

 Hello,


 I just uploaded FuzzyOcr 2.3b to the download site. If you
 find bugs or run into problems, please mail back :)

 This release failed to recognize the sample png.eml file with
 logfile error message: Debug mode: Image type not recognized,
 unknown format. Skipping this image...

 I resolved this problem by changing one line in FuzzyOcr.pm

 Changed: elsif ( substr($picture_data,0,5) eq
 \x89\x50\x4e\x47 ) { To read: elsif (
 substr($picture_data,0,4) eq \x89\x50\x4e\x47 ) { ^

 Tom Green -- Expertsites, Inc.



 Thank you for reporting this... seems I cant count bytes anymore
 ;)

 For anyone who is downloading this past this message, the tarball
 has been updated...

 As someone else pointed out - it has not been updated. I just
 checked, Chris.

 {^_^}
Hrm what the hell... I am 1000% sure I uploaded it -_-

ok NOW it is fixed... if not, then there is some kind of gremlin in
our server...


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8YOyJQIKXnJyDxURApY6AJsGyauiMoSbKvgAGQVUxr1iUqXASgCfd09k
bE/7zCyzwI8wGCFw9TZSwIw=
=OfOj
-END PGP SIGNATURE-



Re: Fuzzy 2.3b and PNG

2006-08-27 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Gary V wrote:
 Rose, Bobby wrote:
 What am I missing?  I updated but not png isn't working.  If I
 switch to
 debug logging 2 I see in the log when I run the sample thru.

 [2006-08-26 18:16:40] Debug mode: Analyzing file with
 content-type image/png [2006-08-26 18:16:40] Debug mode:
 Image type not recognized, unknown format. Skipping this
 image...

 Thanks Bobby
 Yes, I already posted this in this thread, there is a bug in this
  line:

 elsif ( substr($picture_data,0,5) eq \x89\x50\x4e\x47 )

 correct is:

 elsif ( substr($picture_data,0,4) eq \x89\x50\x4e\x47 )


 The tarball which is available for download has been fixed
 already...


 Chris

 I just downloaded it from
 http://users.own-hero.net/~decoder/fuzzyocr/ and line 733 says:

 elsif ( substr($picture_data,0,5) eq \x89\x50\x4e\x47 ) {

 Gary V

Yea my problem it seems like the tarball was not uploaded... now
it should be... ;)

Chris

 _
 Get real-time traffic reports with Windows Live Local Search
 http://local.live.com/default.aspx?v=2cp=42.336065~-109.392273style=rlvl=4scene=3712634trfc=1




-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8YPfJQIKXnJyDxURApZaAJ9c3DmDnJyBWM/7kUCGf0s2pCBlMQCfbBj8
C0yO4KQrMU3UIPrfNeyowtE=
=unf7
-END PGP SIGNATURE-



Re: Animated images in mails

2006-08-27 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Loren Wilton wrote:
 Sure.  giftopnm will do it.  The FuzzyOCR plugin is using some
 other tool that will also do it, I don't recall what just at the
 moment.

 Loren

giftopnm wont do it as far as I tested it... it only extracts the
first frame...

FuzzyOcr is using two different tests... for few frames, it simples
glues them to one frame using imagemagick,

for many frames, it picks the best and tests that..

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8YQ9JQIKXnJyDxURAo+eAJ9Wk+gzU2jssvSYK+a8MfFtbiJJbgCgmrpi
4zx5qlGfVPqRqVxO/7HMFIY=
=Xu9s
-END PGP SIGNATURE-



Re: [Devel-spam] FuzzyOcr 2.3b released,fixes bugs and improves stability

2006-08-26 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Gary V wrote:
 Hello,


 I just uploaded FuzzyOcr 2.3b to the download site. If you find
 bugs or run into problems, please mail back :)

 The jpeg.eml and png.eml samples failed to provide FuzzyOcr hits on
 my system because the messages scored higher than the default
 focr_autodisable_score. You should mention in the README file in the
 samples directory that you may need to temporarily raise the
 focr_autodisable_score while testing.
Ah thanks... I didn't think about that... earlier, the score was 50 by
default and I lowered it to 10 without redoing the tests :)

Chris

 Gary V

 _
 Check the weather nationwide with MSN Search: Try it now!
 http://search.msn.com/results.aspx?q=weatherFORM=WLMTAG


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8B+CJQIKXnJyDxURAvgMAJ9+zygJtk0qHNWjOoNwkKxfQMOanACeImox
I2+dh0H9UAtHxmkyHurPtfo=
=0TIT
-END PGP SIGNATURE-



Re: Animated images in mails

2006-08-26 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Plenz wrote:
 Today I got animated spam. The first frame only with dots an lines, the
 second frame with spam text, the third frame again with dots and lines. The
 duration of the text frame is very long, the others are very short.

 Is there a command line utility which can extract animated GIFs?
Various... imagemagick can either extract them or put them into one
image, gifasm can split them into multiple files, etc.


FuzzyOcr utilizies both as needed to scan animated gifs.

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8CCgJQIKXnJyDxURAtr1AJ4/6ONiWg3t5mQJVt9MUcNpYfY3YACfcXW/
xQ4dD6PpT9CW79pekPvfQQw=
=PU49
-END PGP SIGNATURE-



Re: FuzzyOcr 2.3b release, broken with SA 3.1.0

2006-08-26 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I was just informed that the latest FuzzyOcr version, 3.2b, includes a
function (module from SA) which is only available in 3.1.4, not in
3.1.0. The missing module is Mail::SpamAssassin::Timeout. Currently,
the only way to fix this is to upgrade to 3.1.4. I am still unsure
wether I should add my own timeout stuff with alert() only to support
3.1.0.

Maybe someone else here has a better idea :)



Chris



decoder wrote:
 Hello,


 I just uploaded FuzzyOcr 2.3b to the download site. If you find
 bugs or run into problems, please mail back :)

 The major changes are:

 - Added a configurable timeout (maximum runtime) for the plugin, to
  avoid any lockups/unwanted delays - The default matching threshold
 (set in the config file) can now be overridden on a per-word basis
 in the wordlist

 An example, wordlist contains:

 word1 word2::0 word3::0.2


 Then word1 is matched with the default threshold set in the config
 file, word2 must be an exact match (threshold 0), and word 3 is
 matched with a threshold of 0.2.

 This is especially useful for words which trigger false positives
 very often like: penis, money or news.

 Note that the tendency to produce a FP is not directly connected to
 the word length. The word buy produces very few FP compared to
 penis, when both are being matched with the same threshold.

 The FuzzyOcr.words.sample contains some suggestions for word
 specific thresholds which I recommend.

 - The experimental MD5 database has been replaced by a custom hash
 database which is able to match very similar images.

 Often, you get the same image twice, or all your customers get the
 same spam mail. But even though the pictures look the same, they
 are not identical. That is why MD5 was useless. The newly
 introduced hash (self invented) is able to recognize almost
 identical images based on features that I won't explain here as it
 would make it easier for spammers :) If a message contains a
 picture previously registered in the database, the original score
 is reread from the database and the message is immediatly tagged
 with this score and the plugin ends.

 - Some non-alpha-alpha translations are now used on the gocr
 output, that fix common mistakes, like i being misread as ; or
 a as 8.

 - There are now 2 scores for broken images, one is used when the
 picture is recognized as broken, but giffix was able to correct the
  errors and it gave some output that can be scanned, the other one
 is used if the image is unfixable (that means either too broken, or
  interlaced/animated and broken). The first one is set lower than
 the second one (2.5 vs. 5).

 -Various bugfixes

 TODO:

 -Write an external program to manage the database (add, remove and
 verify given pictures). -Rewrite the temp file system to do all
 external program operations on files (saves memory).


 Another wish: I'd like to create a database to ship with the plugin
 so it can be used out of the box but I do not have much samples
 here, so it would be nice if you sent me picture samples of common
 picture spam you get with [picture sample] in the subject to my
 mail address. I will post here again if I got enough :).


 Thanks to Jorge Valdes, Michael Alan Dorman and UxBoD for finding
 bugs and sending improvement suggestions for this version

 Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8JRFJQIKXnJyDxURAgY1AJ97hGp6zw94H+eUCeH2lay9T2mVDgCdFWEE
4VOwP8X4yVlPguHD6S1m9tI=
=ufN9
-END PGP SIGNATURE-



Re: Fuzzy 2.3b and PNG

2006-08-26 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rose, Bobby wrote:
 What am I missing?  I updated but not png isn't working.  If I switch to
 debug logging 2 I see in the log when I run the sample thru.

 [2006-08-26 18:16:40] Debug mode: Analyzing file with content-type
 image/png
 [2006-08-26 18:16:40] Debug mode: Image type not recognized, unknown
 format. Skipping this image...

 Thanks
 Bobby
Yes, I already posted this in this thread, there is a bug in this line:

elsif ( substr($picture_data,0,5) eq \x89\x50\x4e\x47 )

correct is:

elsif ( substr($picture_data,0,4) eq \x89\x50\x4e\x47 )


The tarball which is available for download has been fixed already...


Chris


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8MoeJQIKXnJyDxURAiVFAKCleKLAkgiklWw1yZdsWPmmXvibOgCfQa5K
eIWLLQcS1Lch1Rcd41tjB38=
=jYbC
-END PGP SIGNATURE-



Re: Broken images in mails

2006-08-25 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Plenz wrote:
 Adding a point for corrupted images is sounding better and better.

 I disagree. To check out what happens I converted a JPG picture into a GIF
 file
 and sent it to myself. One time I converted it with IrfanView and the
second
 time with PaintShop Pro. Both GIF files had the result
 giftopnm: EOF or error reading data portion... So I produced a
corrupt (?)
 image, but it was not spam.

 I have no idea what is wrong and how it could be fixed. Only this: a GIF
 file
 seems to be divided into several blocks. Perhaps one block (perhaps the
last
 block) is too short and does not match to its block header (if any
exists?).
 Perhaps it is possible to read out the correct block length from a header
 and fill the block with 00h to get a valid GIF file.

 Ah... I just found that there is a program named GIFFIX. I should try it
 out.

FuzzyOcr will try to invoke Giffix if an image is broken. If giffix
does not completely fail, then it will only give a low score for the
picture being corrupted. If it isn't able to fix the image at all,
then it will give a higher score.


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE7sVkJQIKXnJyDxURAv29AJ9i/LjlLx1me4TZiwRrSuD0KasBYQCfagl2
95Nt5kXjo3v+WO7i2jngnCk=
=XN3X
-END PGP SIGNATURE-



Re: Discourage broken content

2006-08-25 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Kenneth Porter wrote:
 --On Friday, August 25, 2006 12:05 AM -0700 Plenz
 [EMAIL PROTECTED] wrote:

 I disagree. To check out what happens I converted a JPG picture
 into a GIF
 file
 and sent it to myself. One time I converted it with IrfanView and the
 second  time with PaintShop Pro. Both GIF files had the result
 giftopnm: EOF or error reading data portion... So I produced a
 corrupt
 (?) image, but it was not spam.

 I think we should discourage all broken content in email and on the
 web.

 At one time we could assume that broken content was an honest
 mistake and make an attempt at fixing it. But with the rise of
 malicious content attempting to exploit bugs in content handlers
 (like overruns in image libraries), we should simply reject anything
 that fails to pass validation, on the assumption that's it out to
 get us.

 This includes not just broken images but also broken HTML, which is
 so commonly used to conceal spam.

 We need to stop giving a free pass to broken content creation
 software just because it's popular. When someone sends you broken
 content, you should react the same way you would if they sent you
 documents on dirt-smeared paper. Stop letting your emperor walk
 around naked.

I completely agree, the problem is, some implementations makes this
impossible. For example MailScanner.

I've heard that it truncates the mail at 30kb, no matter if that is
within a MIME block or not... So my plugin gets a broken image..
though it was not broken originally...

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE705eJQIKXnJyDxURAiGZAJ4q2f5KIxWjrYN3U6vB4kFhLbZ2igCfVM1l
n13w21PXoSH7IethDVc3uio=
=IWPe
-END PGP SIGNATURE-



Re: Discourage broken content

2006-08-25 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Logan Shaw wrote:
 On Fri, 25 Aug 2006, enediel gonzalez wrote:
 From: decoder [EMAIL PROTECTED] Kenneth Porter wrote:

 I completely agree, the problem is, some implementations makes
 this impossible. For example MailScanner.

 I've heard that it truncates the mail at 30kb, no matter if
 that is within a MIME block or not... So my plugin gets a
 broken image.. though it was not broken originally...

 Yes, if you leave the default Max SpamAssassin Size = 3
 setting in place, it will do this.

 Could somebody explain to me the reason why MailScanner acts this
 way?

 Performance.  The theory, I think, is that if a message is spam,
 there should be some evidence of that in the first 3 bytes, so
 there is no need to pass the whole message to SpamAssassin.

 I think this was a good assumption and a good plan when
 SpamAssassin didn't check a lot of attachments.  Now that there are
 plugins which do check attachments, leaving the MIME structure of
 the message intact is more important, but MailScanner hasn't caught
 up with this reality.
I heard that a proposal on letting the MIME structure intact has been
made... so at least if the message was truncated, it wouldn't be
truncated in the middle of an attachment (which would make absolutely
no sense, either you truncate before or after the attachment, a broken
attachment doesnt help anyone and will only cause unnecessary errors)

Chris

 Of course, you can always just remove the limitation by changing
 the MailScanner configuration file.

 A good question could be decide if you adapt this plugin to be
 compatible with MailScanner or tha last one should change this
 practice.

 MailScanner calls SpamAssassin, so no adaptation needed in most
 cases.  Unless you are talking about workarounds for issues like
 the above.

 - Logan

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE71X+JQIKXnJyDxURAnGdAKC2aHFPzyX8lFhhsoSsrIgl+ci6QgCeJO4q
58fKQR01gJE0I/0P2Zpdprw=
=MU3c
-END PGP SIGNATURE-



Re: Discourage broken content

2006-08-25 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rick Cooper wrote:

 -Original Message- From: decoder
 [mailto:[EMAIL PROTECTED] Sent: Friday, August 25, 2006 2:24
 PM To: users@spamassassin.apache.org Subject: Re: Discourage
 broken content


 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1

 Kenneth Porter wrote:
 --On Friday, August 25, 2006 12:05 AM -0700 Plenz
 [EMAIL PROTECTED] wrote:

 I disagree. To check out what happens I converted a JPG
 picture into a GIF file and sent it to myself. One time I
 converted it with IrfanView and the second  time with
 PaintShop Pro. Both GIF files had the result giftopnm: EOF
 or error reading data portion... So I produced a corrupt (?)
 image, but it was not spam.
 I think we should discourage all broken content in email and on
 the web.

 At one time we could assume that broken content was an honest
 mistake and make an attempt at fixing it. But with the rise of
 malicious content attempting to exploit bugs in content
 handlers (like overruns in image libraries), we should simply
 reject anything that fails to pass validation, on the
 assumption that's it out to get us.

 This includes not just broken images but also broken HTML,
 which is so commonly used to conceal spam.

 We need to stop giving a free pass to broken content creation
 software just because it's popular. When someone sends you
 broken content, you should react the same way you would if they
 sent you documents on dirt-smeared paper. Stop letting your
 emperor walk around naked.
 I completely agree, the problem is, some implementations makes
 this impossible. For example MailScanner.

 I've heard that it truncates the mail at 30kb, no matter if that
 is within a MIME block or not... So my plugin gets a broken
 image.. though it was not broken originally...


 That is patently false. I have a graphics design/advertising
 department at one of my locations and these fellas send huge
 graphics files back and forth when they have emergency
 proofs/changes and MailScanner has *never* damaged anything, ever,
 anywhere. Now, there is a setting for scanning (much like exiscan
 IIRCC) that allows you to truncate the message and only scan xxx
 amount, it's optional and doesn't modify the actual message in
 anyway.

 Rick
I did not say it damages the mail. I said it feds only a given amount
of the message to SpamAssassin and THAT breaks plugins requiring the
whole message, especially when MailScanner breaks messages in the
middle of attachments.

And as far as I know, it is the default setting of mailscanner to feed
only a given amount of kb to SpamAssassin. That does not mean it
truncates the message before delivering it.

Chris



 -- This message has been scanned for viruses and dangerous content
 by MailScanner, and is believed to be clean.



-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE71wLJQIKXnJyDxURAtxUAJ9/O5F4cC/1vlsE6EsRb6vLcepH+ACfcTCA
x4CmnLDyZbUFtAr2kWK9koY=
=Ckpc
-END PGP SIGNATURE-



FuzzyOcr 2.3b released, fixes bugs and improves stability

2006-08-25 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,


I just uploaded FuzzyOcr 2.3b to the download site. If you find bugs
or run into problems, please mail back :)

The major changes are:

- - Added a configurable timeout (maximum runtime) for the plugin, to
avoid any lockups/unwanted delays
- - The default matching threshold (set in the config file) can now be
overridden on a per-word basis in the wordlist

An example, wordlist contains:

word1
word2::0
word3::0.2


Then word1 is matched with the default threshold set in the config
file,
word2 must be an exact match (threshold 0), and word 3 is matched
with a threshold of 0.2.

This is especially useful for words which trigger false positives
very often like: penis, money or news.

Note that the tendency to produce a FP is not directly connected
to the word length.
The word buy produces very few FP compared to penis, when both
are being matched with the same threshold.

The FuzzyOcr.words.sample contains some suggestions for word
specific thresholds which I recommend.

- - The experimental MD5 database has been replaced by a custom hash
database which is able to match very similar images.

Often, you get the same image twice, or all your customers get the
same spam mail. But even though the pictures look the same, they
are not identical. That is why MD5 was useless. The newly
introduced hash (self invented) is able to recognize almost
identical images based on features that I won't explain here as it
would make it easier for spammers :)
If a message contains a picture previously registered in the
database, the original score is reread from the database and the
message is immediatly tagged with this score and the plugin ends.

- - Some non-alpha-alpha translations are now used on the gocr output,
that fix common mistakes, like i being misread as ; or a as 8.

- - There are now 2 scores for broken images, one is used when the
picture is recognized as broken, but giffix was able to correct the
errors and it gave some output that can be scanned, the other one is
used if the image is unfixable (that means either too broken, or
interlaced/animated and broken). The first one is set lower than the
second one (2.5 vs. 5).

- -Various bugfixes

TODO:

- -Write an external program to manage the database (add, remove and
verify given pictures).
- -Rewrite the temp file system to do all external program operations on
files (saves memory).


Another wish: I'd like to create a database to ship with the plugin so
it can be used out of the box but I do not have much samples here, so
it would be nice if you sent me picture samples of common picture spam
you get with [picture sample] in the subject to my mail address. I
will post here again if I got enough :).


Thanks to Jorge Valdes, Michael Alan Dorman and UxBoD for finding bugs
and sending improvement suggestions for this version

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE72jaJQIKXnJyDxURApfeAJ47JcACEeIaYtEA8z6wDdFxGPhrUgCZAZSE
sdWROYeF8IFdbUX0njAdV+o=
=y7XM
-END PGP SIGNATURE-



Re: FuzzyOcr 2.3b released, fixes bugs and improves stability

2006-08-25 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

John Andersen wrote:
 On Friday 25 August 2006 13:17, decoder wrote:
 Another wish: I'd like to create a database to ship with the
 plugin so it can be used out of the box but I do not have much
 samples here, so it would be nice if you sent me picture samples
 of common picture spam you get with [picture sample] in the
 subject to my mail address. I will post here again if I got
 enough :).

 Wouldn't it be more productive to the community to work with SURBL
  to enable the centralized storage of these hashes?

 Or perhaps with Razor2?

 I'm not an expert on Razor, but my limited understanding of it is
 that it generates hashes of (portions of) message bodies and stores
  that hash for future comparison.

 It would seem that once someone decide something is spam, one could
 take your hash and wrap a minimal message around it and report THAT
 to razor.

 Then your engine could examine an image, generate your hash, and
 wrap it in the same minimal message and Query Razor.  Presumably
 getting a hit.

 No local database is needed, because a world wide one would be
 substituted. That way, if you get this spam and report it, It will
 already be known by the time I get the spam.

Maybe it would. But this kind of hash is no real hash. It is just a
combination of picture features that I invented... but it seems
reliable in my tests so far. Once it has been tested in public, such a
cooperation with SURBL or Razor might be possible


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE724mJQIKXnJyDxURAuW6AKClt1V0/faPEJaTwjLRXChXqhtTkwCfc9Yp
UBsuigcaOac6pOZz2EP7Gkk=
=LJEa
-END PGP SIGNATURE-



Re: FuzzyOcr 2.3b released, fixes bugs and improves stability

2006-08-25 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Michael Scheidell wrote:
 Now if you could just ocr the whole thing as text, and pass it back to
 SA to score!
I explained before why this is not going to happen really soon:

a) It is VERY hard to realize. To preserve the message, you would need
two plugins, one that runs as first rule, converts the message to text
only, and another one that runs as last rule and puts the image back
into the message (so the message stays unchanged).

b) The default gocr output is not reliable enough for text only rules.
The current FuzzyOcr archives better results by doing multiple scans
with different settings.

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE728SJQIKXnJyDxURAlaQAJ447+AJu7pHwnqfHR5MkdCRIf5zDQCfedAb
7PyOxUGE4oTuoVmd5JRGuGw=
=dMnX
-END PGP SIGNATURE-



  1   2   >