Re: FuzzyOcr 3.6.0 released
RW wrote: AFAIK though it isn't possible to place a cap on the FuzzyOCR score. I don't want to, but I detune it purely to reduce the likelyhood of something hitting my discard threshold by OCR alone. If you consider this feature so important, then I could implement a max_score feature that caps the score done by word recognition. This is easy to implement. Or should it rather be a cap to all FuzzyOcr rules, including the others like malformed file etc? Cheers, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: New spamassassin OCR plugin
alex k wrote: If only FuzzyOCR's developer would read that ;) Unfortunately he doesn't seem to be interested in his project anymore. Maybe you could take care of this orphaned code. Dear Alex, I am reading exactly everything you write ;) The code is not orphaned, but also not being extended at the moment. The SVN version runs stable in all SA 3.2.x releases. I answer to tickets and questions via email. I am planning a new release, but my time schedule is though. Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: New spamassassin OCR plugin
LuKreme wrote: On 24-May-2009, at 18:40, Henrik K wrote: I don't know why users are so afraid of words like SVN. You have to look at the project, not version numbers. I don't have FuzzyOCR installed, and it's not because of the SVN. First, I don't think my server can take the processing hit and second it requires so much to be installed that I'm SURE my server can't take the hit. May I ask how many mails you process per day? Please note that a) FuzzyOcr runs last if properly installed b) it doesn't do anything if the score exceeds a configurable threshold c) it supports hashes and other things that make processing faster Cheers, Chris smime.p7s Description: S/MIME Cryptographic Signature
FuzzyOcr 3.6.0 released
Hello all, after quite some time, I've decided to release another version of FuzzyOcr. This version is only a tag from SVN revision 135 (+ a patch provided recently which fixes something in one of the sql utilities) that has been used quite some time with SA 3.2.x and is included in some major distributions already. If you are using FuzzyOcr from SVN (rev 135), then there is no need for you to upgrade. Since image spam seems on the rise again, lots of people have contacted me in the last 2 months, and I have been asked many times to release another tarball... So I hope someone will find it useful. No new features are added in this release, as I decided to first tag the version that is working without known problems for those that seemed to have a problem with checking out the version from SVN. The major version number increase is due to the fact that it breaks compatibility with SA 3.1.x and now requires SA 3.2.x. See http://fuzzyocr.own-hero.net/wiki/Downloads for more details. Although I still can't invest that much time into the project at this point, there are some features I'd like to add though in the near future, such as regex support. I also considered rewriting the scoring engine because some people share the opinion that it is too sensitive (as opposed to others who consider it to be good). Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: Experimental Plugin: MetaSVM
LuKreme wrote: This is an excellent idea, but it also needs rule hits on ham, right? You're right if you're saying that the method would work better if there were more ham rules. From what I have seen in my experiments however, the results are also very precise with the current SA ruleset. But any rule that adds some information to the feature set might yet increase the performance (especially the performance on unrecognized spam, on ham/spam which is detected by SA as well, the algorithm performs nearly as good as SA itself). Regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: Experimental Plugin: MetaSVM
LuKreme wrote: I don't see any need for the model to be dynamic. Periodic recalculation of it should be just fine. I bet even daily reprocessing will prove to be over zealous. Weekly, perhaps even monthly. This is what I think as well :) I'm thinking that FPs and FNs are bayes problem anyway. This tool need to concentrate on seeing just what rules hit and building off that. I'd go so far to say that as far as SVM is concerned, there is no such thing as a false postive or negative. What do you mean by that? Of course FPs and FNs might also be a problem for the SVM, every wrong classified point is certainly a problem for a machine learning algorithm. However, I think that the SVM is quite robust to a certain amount of FPs/FNs if the majority of the training points is correct. So, if you feel like trying out the plugin, let me know how well it works =) I'm especially interested in those cases where it increases the spam detection rate (reducing false negatives). Might be easy to extract this information from logs. Thanks and regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Experimental Plugin: MetaSVM
Hi all, as a result of the recent 2+2 != 4 discussion on the list, here is a new plugin, which tries to learn ham/spam classification only by knowing which rules triggered and which did not. This is, so to say, an automatic meta rule. The plugin is currently experimental and can only be checked out from SVN at: https://svn.own-hero.net/sysadmin/MetaSVM/trunk For now I recommend to not use it in production environment, as it is still untested (except that I tested it). In order to use the plugin, you need to train your own model, which requires a certain amount of ham/spam. I evaluated the plugin with my own ham/spam corpus (roughly 5000 spam, 3000 ham) and the resulting model did not produce false positives with respect to the default scoring, but it catched approx. 30% of the mails that were not catched by SA itself. I'll probably release more detailed numbers in some whitepaper soon :) Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: Experimental Plugin: MetaSVM
AlexB wrote: Chris From the README its not quite clear: will this work in autolearn ? If you mean that the plugin can automatically learn with the autolearn setting, answer is no. would it be enough to create the model.* files or is it a must to feed it? You create one model file once by feeding it a large corpus of ham+spam. Once you did that, and evaluated it as described in the README, the model should be working accurately enough for your mail gateway and I expect it to work for a long time, mainly because it isn't depending that much on the type of spam (i.e. the results that the model produces are assumed to be more generalizable than for example your bayes db) I cases of busy gateways, where manual training is higly unpractical, it would need to feed itself with headers from SA report's score X The problem is that feeding does not work with an SVM algorithm. You have to train on the _whole_ set _always_, so feeding mails is unpractical. That's why you do this process _once_ with a lot of ham and spam. You can repeat this process any time but it isn't necessary to do this permanently. It is to be expected that the model accuracy will decrease with time ( a) because your rules change and b) because spam changes ) but I think this is a slow process. It has yet to be evaluated how well the model performs over time :) Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: Experimental Plugin: MetaSVM
John Hardin wrote: I assume it learns from full message corpa? And all it cares about is the rules that hit? Per my earlier suggestion of learning off the logs + corpa to fix FP/FN, could there be an option to learn off generated minimal corpa files, with their structure being just the rules hit per message (msgid + hits on one possibly very long line)? e.g.: kggbph.617...@localhost BAYES_99,FORGED_RCVD_HELO,L_SOME_STD_PROBS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RBL_PSBL_01,RCVD_IN_BRBL,RCVD_IN_NJABL_SPAM,SARE_FROM_SPAM_MONEY2,STOX_30,URIBL_BLACK,URIBL_JP_SURBL,URIBL_WS_SURBL Yes this is certainly possible. Basically all the algorithm requires for the SVM is the rules that hit and the classification (ham or spam) (actually the rules that did not hit are fed into the SVM as well, but they are taken from a the global rules file underlying the model). The tool additionally requires the score to evaluate FP/FN properly when testing the model, and the message id would be helpful to find false positives if one wants to investigate. So you are right, all this info would be enough and I can easily modify the tool to use this kind of format. I'll try to come up with a code modification to switch the input format :) Then an external tool could generate and maintain these files from the SA log and the maintained training corpa, omitting FP/FN from the log data. Yes, that's a good idea, certainly better than learning directly from the mail which might be scattered around several mailboxes. However, how do you want to exclude FP/FNs? The log certainly doesn't provide this information. On the other side, having some false positives in the training data did not spoil my results. The algorithm did even predict these correctly as spam later on :) Cheers, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: Experimental Plugin: MetaSVM
John Hardin wrote: It needs the score, and not just Y/N Spam/Ham (i.e. from which corpa file it came)? The SVM does not need the score. However, the evaluation tool needs the score because it uses it to calculate FP/FN rate. I was thinking you'd generate a ham file and a spam file from the log, possibly dynamically appending rows as messages are processed. Naturally this would contain FPs and FNs. If you want it to be dynamical, then the plugin could do the appending. However, the model cannot be extended, that means to incorporate new lines, the whole model must be recalculated. So this can't be done per message but only maybe on a daily basis. You'd have a routine to extract the ham file from your full ham corpus/corpa, and likewise for spam. The assumption is any FP or FN would be placed into these corpa for normal bayes training. The tool would then combine them, omitting from the log-generated files any msgid that appears in the training corpa files. You'd end up with one clean spam file and one clean ham file. That implies that people are indeed using bayes training, but it might be a suitable idea. However, I don't think anyway that FPs and FNs spoil the SVM result. SVMs are quite robust to outliers (which FPs and FNs essentially are) and if their number is low compared to the total amount of mail, the algorithm will have no problem to predict them properly anyway :) Er, don't you mean it predicted them as ham (FP = ham scored as spam)? It would be great if it was smart enough to recognize a near-boundary false result as what it *should* have been... I mean that I had some unrecognized spam left in my inbox, and the algorithm did identify it as spam :) The SVM generally tries to find a hyperplane, however, if the wrongly labeled points (FPs and FNs) are of small count, the SVM will most likely produce a result where the FPs and FNs do not match the label they were trained with. The C-SVM uses a cost constraint (each label violation costs a certain value) and tries to minimize a given term which includes this cost. So if the dataset is sufficently large but has _some_ wrongly labeled points, the chances that the result is still what you wanted to have are high :) -- Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Marc Perkel wrote: So - making any progress? :) Yes, indeed. I am currently rewriting my code to be more generic and cleaner (you wouldn't want to see my initial poc code^^). Once I'm done with that, I can quickly repeat some of the experiments on other mail sets, such as the one that Justin sent me. After that, I'll write a small plugin for SA, so you guys can test around with it (that shouldn't be a big deal). Chris P.S.: If you want to provide me another ham/spam corpus or even a collection of false negatives, feel free to contact me :) smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
John Hardin wrote: Chris: Do you have any interest in writing an offline tool that generates static metarules based on the SA log and FP/FN corpa, as I mentioned? Running some experiments for this kind of tool is at least on my todo list :) I don't know however, when I will have time to do that :) Cheers, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Justin Mason wrote: Thanks for doing this! couple of q's: 1. I can offer a bigger ham/spam corpus if you'd like to test against that as well; corpora from multiple contributors can sometimes expose training set bias. That would be cool :) Is this corpus already processed by spamassassin (i.e. has SA headers)? My poc code currently mines only the headers to find out what rules are triggered. 2. can you test it on spam that scored less than 10 points when it arrived? low-scoring spam is, of course, more useful to hit than stuff that scored highly on the existing rules. Things like that should be possible easily. I need to check if I have enough mails to do a sufficiently reliable test here. 3. does it give an indication of confidence in its results? or just a binary spam/ham decision? I'm currently working only with a binary classifier. However, libsvm supports probability estimates and regression (and to my knowledge, internally, most SVM algorithms relax classification output to real values and then use the sign to determine the classification, this can also be seen as some sort of confidence value) 4. hey, if you're writing an SVM plugin, it might be worth making one that _also_ supports body text tokens, similarly to the existing Bayes plugin. ;) This would surely be possible somehow, but we'd first have to come up with a good representation of the problem for an SVM. I wouldn't want to mix this either with the current experiment, as these two things somehow represent different data. One of the problems with text tokens is that there can always be new ones (which would increase the dimension of the problem and hence require the whole SVM to be remodeled, so, a system as performant as bayes might not work directly.) 5. btw one particularly tricky part of dealing with user-trainable dbs, is supporting expiry of old tokens. but that can be deferred until later anyway. I guess this is a question of implementation :) Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Marc Perkel wrote: Good work so far but sounds like you need to throw more data at it. Also even though you indicate over 99% accuracy can you break that down better? 99.9% is 10 times as accurate as 99%. What do you mean by more data? Of course, some additional data might help. One should consider that _most_ of the SA rules are designed to score on spam. For an SVM, you can use more general data like Mail has property XYZ although you don't know what this property means (ham or spam) or if it is even suitable to classify anything. This is of course an advantage. With respect to the numbers: I repeated the experiments today with slight modifications to provide a more solid setup: The input is again the dataset I used yesterday. In one run, I permutate the dataset, then split it (2/3 training vs. 1/3 testing, not stratified). Then the training set is used to train an SVM, and it is applied to the 1/3 testing set and additionally to my false negatives set. The SVM outputs an accuracy value, but I wrote a tool that calculates precision and recall by hand because these values are more interesting as 1 - Precision = False Positive Rate (which is an important factor in SA) 1 - Recall = False Negative Rate (or, consider recall as the detection rate) I ran this 5 times, the output is attached as text file, there you will see the exact numbers :) Taking the mean over the 5 runs: False positive rate: 0.37908199952036 % Detection Rate: 99.18104855859372 % Detection Rate on False Negatives (my SA has 0% on this set): 31.7821782178218 % One should consider that my dataset might not be 100% accurate. It is combined from my inbox and my spam folder. Of course my spam folder is unlikely to contain ham, but it is surely possible that I forgot to delete one or another false negative from my inbox. I'm looking forward to get Justin's set :) Also - when it identifies messages do the numbers on the spam scores go up and ham goes down? If so that makes it more solid and starves the middle. I'm encouraged that the initial results are good. What do you mean by that question, I don't really understand it :) My feeling is that if this works that it will work better if we have more informational tokens. For example - is the from address a freemail address. Does the message contain a freemail address. By themselves these wouldn't score points. But spam coming from yahoo, hotmail, gmail, etc. is a different kind of spam than spam coming from spambots. Maybe country tokens from the received lines would be useful. Maybe names of banks in the message would be useful. For example Bank of America + Nigeria = spam. Yes, this is exactly what I meant above. These tokens are of limited use for SA currently, but an SVM might be able to use them :) Cheers, Chris Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 449 nu = 0.144606 obj = -529.640159, rho = -2.227729 nSV = 802, nBSV = 785 Total nSV = 802 Predicting test set... Accuracy = 99.2706% (2722/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on test set: Precision: 99.8896856039713 % Recall: 99.01585565883 % Results on false negative set: Precision: 100 % Recall: 31.6831683168317 % = Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 466 nu = 0.147031 obj = -539.132218, rho = -2.297470 nSV = 817, nBSV = 791 Total nSV = 817 Predicting test set... Accuracy = 99.2706% (2722/2742) (classification) Predicting false negative set... Accuracy = 32.1782% (65/202) (classification) Evaluating results... Results on test set: Precision: 99.6613995485327 % Recall: 99.2134831460674 % Results on false negative set: Precision: 100 % Recall: 32.1782178217822 % = Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 454 nu = 0.146568 obj = -535.034660, rho = -2.187959 nSV = 814, nBSV = 793 Total nSV = 814 Predicting test set... Accuracy = 99.2341% (2721/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on test set: Precision: 99.3834080717489 % Recall: 99.4391475042064 % Results on false negative set: Precision: 100 % Recall: 31.6831683168317 % = Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 447 nu = 0.144391 obj = -530.359839, rho = -2.219816 nSV = 802, nBSV = 781 Total nSV = 802 Predicting test set... Accuracy = 99.2341% (2721/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Marc Perkel wrote: I suppose what I was thinking was that you still used the SA result but added or subtracted from the SA result based on your SVM code, sort of the way bayes does. Or are you letting SVM make the final determination? At the moment, I am only using the SVM answer. What you finally do with it, is the next step. You can use it like a normal rule and give it a score, of course. You can also only use the SVM, but I think I'll go for the scoring idea :) It would also be possible to use an SVM model that supports confidence/probabilities. At the moment I was only evaluating the precision/recall for this method only without any scorings. Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
John Hardin wrote: Would there be any benefit to having an offline version - i.e. something that evaluates the log or a corpus to generate new meta rules, that could be added onto the default ruleset? For instance: cron @ 0200: sa_meta_eval /etc/mail/spamassassin/metarules.cf /etc/init.d/spamassassin restart This is definetly a good idea. You can create the SVM model offline from a logfile only, if it includes the rules that scored and the ham/spam status. However, you cannot generate metarules with SVMs, for that purpose you need a different learning algorithm (for example bayes, or decision trees). However, SVM classification is very cheap, so once you created the model offline, you can use it online really quickly with a plugin. Cheers, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Justin Mason wrote: So you're volunteering to code it up, then? ;) I was planning to do at least some brainstorming+experiements as to what learning methods would seem suitable and how well the method performs, whenever I have time again. Unless someone else did that already? smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Marc Perkel wrote: Justin Mason wrote: So you're volunteering to code it up, then? ;) --j. I would if I were any good at perl. I think we should evaluate if the suggested technique works and performs better or is at least of some benefit, before trying to implement it properly as a plugin. Such a test can be done offline with spam/ham easily... I started writing a script that mines some of my spam and ham, and then I'll evaluate how good the classifiers are that I get. Cheers, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
decoder wrote: Justin Mason wrote: So you're volunteering to code it up, then? ;) I was planning to do at least some brainstorming+experiements as to what learning methods would seem suitable and how well the method performs, whenever I have time again. Unless someone else did that already? Ok, I did some short experiments: I've built an SVM classifier from a large mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold cross validation. The resulting classifier has an accuracy of over 99%, so performs as good as the regular system. Now I applied this to a set of 202 False Negatives that I collected, and 69 of these are recognized as spam by the SVM. As a second test, I pulled 2707 mails from one of my other inboxes and applied the classifier, the accuracy was again over 99% (and this is only ham). From my point of view, the results show that this approach has potential. It is highly accurate with respect to the current system, but additionally outperformed it on several false negatives. There are other advantages that this system has over the common system: It allows everybody to train the whole spamfilter (not only Bayes) to the kind of spam that one receives, i.e. it is more adaptive than the common system. Any opinions on this are greatly welcome. Maybe we should try to come up with a proof of concept plugin for SA? Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Marc Perkel wrote: LuKreme wrote: On Mar 3, 2009, at 10:06, John Wilcock j...@tradoc.fr wrote: Le 03/03/2009 17:42, Matus UHLAR - fantomas a écrit : I have been already thinking about possibility to combine every two rules and do a masscheck over them. Then, optionally repeating that again, skipping duplicates. Finally gather all rules that scored=0.5 ||=-0.5 - we could have interesting ruleset here. But that's going to be a HUGE ruleset. Not to mention that different combinations will suit different sites. I wonder about the feasibility of a second Bayesian database, using the same learning mechanism as the current system, but keeping track of rule combinations instead of keywords. It sounds like a really good idea to me, and also like the most reasonable way to manage self-learning meta rules. It seems to me that the consensus is that it's worth a try. I don't know if it will work or not but I think there's a good change this could be a significant advancement in how well SA works. I had exactly the same idea as Marc quite a while ago, but didn't try it (yet) because I didn't have a big corpus of false positives/negatives to test on. Using such a system mainly makes sense to actually improve the performance, i.e. to minimize false positives and negatives, so one would need to show that it indeed does improve the performance. Apart from that, it should be simple using machine learning algorithms (e.g. Bayes, or even something more complex, like an SVM) to learn meta rules and also reasonably fast once one has the model. Chris smime.p7s Description: S/MIME Cryptographic Signature
RBLs and Freemail Forwards
Hello, on our private mail server we now have quite some forwards from freemail providers like yahoo, gmx and such. This wasn't a big problem previously but there is quite some spam arriving now over those forwards that isn't tagged as such (mainly I think because RBLs can't strike on those). Is there away to modify the trust path such that I can actually trust the Received header added by the freemailer MTA (so that RBLs can match the Received line which is before the freemailer MTAs) ? I wouldn't really add all those to trusted hosts (and for yahoo, there are tons of mtas it seems). Thanks in advance, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: RBLs and Freemail Forwards
Matt Kettler wrote: Nearly all positive-score RBLs will check all untrusted hosts in Received: headers, except the DUL RBLs and XBL which only check the first untrusted because they are designed to be used in that manner. ie: SBL will be tested against *ALL* untrusted hosts, including the IP delivering mail to the freemailer, not just the freemailer itself. Thanks for the clarification, I thought that all RBLs only hit on the first untrusted host for performance reasons. If that isn't the case, then I'll have to find another way to get rid of that specific spam type which is getting quite annoying.. :D Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: ocr plugin
Matus UHLAR - fantomas wrote: does it push the extracted text back to SA so it could be used by e.g. bayes? This is how it imho should be used. (and imho the same for .pdf and/or .doc - extract text _and_ images from it, call OCR for images...) That is a question that was very frequently asked around here and that's why I also included it in the FuzzyOcr FAQ: If you take a look at the actual results of the OCR engines used, then you'll see that the output suffers from a lot of noise. Hence, it is not suited for common word analysis like bayes, and FuzzyOcr uses a special fuzzy matching algorithm to find the words Also, the SA plugin architecture is not designed to modify the message in any way, so you cannot push back the text into the normal processing line. As to image spam in general: Yes, it has dropped dramatically and I haven't seen any actually for quite a long time now. I hope that my tool is one reason that this annoying technique is gone now :D Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: ocr plugin
Theo Van Dinter wrote: On Fri, May 02, 2008 at 09:12:12PM +0200, decoder wrote: Also, the SA plugin architecture is not designed to modify the message in any way, so you cannot push back the text into the normal processing line. Really? Who says? I made very specific modifications in 3.2 to allow for just that. Search the list archives for post_message_parse. Ah ok, I was refering to the 3.1.x architecture. I haven't looked at the changes done in 3.2, but if this is technically possible now, then I apologize :D Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: Returned mail spam
mouss wrote: he's not the only one... seems there's a lot of backscatter coming in these days. I guess the reason is that it is so easy to make a mistake in a mailserver configuration that enables backscatter... We recently discovered that even our own mailserver (Postfix) was a backscatter source (and 1-2 weeks ago spammers started to actively use it), there were several reasons and I'd like to share these points with the list so nobody does the same mistakes. 1) With Virtual Domains, the recipient validation is not properly done anymore once you map one virtual domain to another, so do not do that. Also never use wildcards with domain names except if there is a catch all defined for this virtual domain entry. 2) By default, Postfix happily seems to accept email addresses refering to subdomains of domains listed in $mydestination. The option responsible for this cruel behavior is parent_domain_matches_subdomains which is by default not empty. We've set it to an empty string and after that, Postfix finally rejected mails to bogus recipients on our subdomains. If any of that is wrong, feel free to correct me :) Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: Spam abuse report plugin
Eddy Beliveau wrote: - Message d'origine - De : Michael Scheidell [EMAIL PROTECTED] À : ram [EMAIL PROTECTED]; spamassassin-users users@spamassassin.apache.org Envoyé : 27 mars 2008 10:04 Objet : Re: Spam abuse report plugin From: ram [EMAIL PROTECTED] Date: Thu, 27 Mar 2008 15:36:04 +0530 To: spamassassin-users users@spamassassin.apache.org Subject: Spam abuse report plugin I get a lot of spam on my servers which get detected by SA though are generated by innocent mail servers. We see a lot of mail users have insanely simple passwords , spammers are using these accounts and send spam. By the time the administrator realizes the server has sent 1000's of spam So you would spam the abuse@ account '-) If spamassassin had an option to send abuse report to servers automatically and send mails to abuse@server-admin the moment the first sure spam comes in the admin could be warned before much damage has been done. Obviously we limit to only 1 or 2 reports in an hour to a particular id Best is to set up something to use 'spamassassin -r' (report) feature. Set up a SpamCop account, put that information in local.cf. SpamCop will scan the emails for uri's add them to uri blacklists, add the server to spamcop blacklists, track down the responsible isp, and pre-format a complain email. If you have DCC and RAZOR, it will also submit the information to those databases. NOTE: YOU DO NOT WANT TO AUTOMATICALLY SEND REPORTS AS THIS _WILL_ SPAM INNOCENT, FORGED DOMAINS ADDING TO THE BACKSCATTER PROBLEMS. Hi! This subject is very interesting I received many spams daily and have to manually analyse headers or email content to be able to send abuse report Is there a tool which can do this for me ? I imagine some web form (unix/windows) in which I can put a cut/paste of original email (including headers) and that tool can prepare abuse complaint automagically. Does that beast exist ? There is a very basic problem with that. You normally report abuse for domains or IPs, however, you do not know the originating IP in most cases, because you cannot trust headers. There might be innocent relays (freemailers for example) in the middle, and you cannot simply pick the first hop, thatone might be forged by spammers. So already determining a sure source address is something that can hardly be automatised. Best regards, Chris Thanks, Eddy smime.p7s Description: S/MIME Cryptographic Signature
Re: Bye for good FuzzyOCR
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 David Morton wrote: On Jul 22, 2007, at 9:43 AM, arni wrote: Loren Wilton schrieb: I'm not recieving much of it anymore anyways. FWIW, about 20% of the spam I got today had either a GIF or PNG image attached to it. Most advertizing viagra in clear text with no obfuscation, a few advertizing stocks. FuzzyOCR still does quite well here. Loren I'm not saying that it doesnt work well anymore, i'm just saying that i dont need it anymore to bring my spam to above 10 points, what happened for me lately was the following: image spam was above 10 pts already and fuzzyocr didnt run so fuzzyocr only ran for ham with images completely wasting resources so i uninstalled it I upgraded a system to SA 3.2, which I see now is not compatible with FuzzyOCR yet. I started getting a bunch of image spam again. :( I wish I had it again... Try using the SVN Version (revision 132). This is basically the same as the latest 3.5.x release but some issues with SA 3.2.x were fixed. Best regards, Chris David Morton Maia Mailguard http://www.maiamailguard.com [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGo7LoJQIKXnJyDxURAluRAJ9E2BMNncHnPymSY5BDCjr5uOOK+QCfZVll 6MOrbLP0OWQeveEi3raL9Nw= =BkuK -END PGP SIGNATURE-
FuzzyOcr and PDF files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello all, because some people insisted on it, I added an experimental feature to FuzzyOcr that allows you to scan PDFs as if they were images. The feature was implemented in the latest SVN revision and is of course disabled by default. Personally, I would not use this feature because the risk of false positives on important documents is really high, but if you really want to test this, here are the steps to enable it: 1. Get dependencies: -A netpbm version that includes pstopnm -Poppler (http://poppler.freedesktop.org/) for the pdfinfo and pdftops binaries 2. Add those binaries as helper apps in FuzzyOcr.cf (see the .cf file included in SVN) 3. Enable PDF scanning with focr_scan_pdfs 1 in config. Optionally, it is possible to skip PDFs which contain more than x pages (focr_pdf_maxpages). Currently, the parameters for pstopnm are hardcoded (-xsize=1000), there might be better ways/values to translate PDFs into usable, but not too big pnm files. If you know better ways, tell me. Also I am missing some recent PDF spam samples (which contain images), so if you could upload some sample, that would also help. Best regards, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGik19JQIKXnJyDxURAs04AKDFRAq4khA+iRouIbpVBZEsjxEJ6ACeLpBO F4GSUMSqpHubHr9bZkSLS+w= =Nu8d -END PGP SIGNATURE-
Re: Which version fuzzyocr
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Gary V wrote: Hello, On the fuzzyocr site I see 3.5.1 version is not SA 3.2.X compatible ? Is this true, or can I safely ignore :-) We have an older server with SA 3.2.0 and Fuzzyocr 2.3b and it works. Greetings.. Richard http://marc.info/?l=spamassassin-usersm=118254092310213 The revision mentioned in this post is the correct one, I am sorry for any confusion, I will make another release soon for 3.2 compatiblity. Until that, use the svn checkout command that Gary wrote about in his reply. About FuzzyOcr 2.3b, I recommend to not use this version anymore as it has plenty of problems/bugs which remained unfixed because those were design errors. Best regards, Chris Gary V _ Like puzzles? Play free games earn great prizes. Play Clink now. http://club.live.com/clink.aspx?icid=clink_hotmailtextlink2 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGimJLJQIKXnJyDxURAvOrAKCPJuMotPrU46onCPWN3fGlSka8BwCcCT3F wI/JIWA3i0fWXKvgoDPDpJQ= =Ep+Q -END PGP SIGNATURE-
Re: FuzzyOCR Use of uninitialized value Hashing.pm errors
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Russell Galpin wrote: Hi There I'm running SA 3.2.1 with the latest version of FuzzyOCR (from svn) and I'm receiving the same error over and over again in my mail logs: Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in string eq at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 245. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in string eq at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 248. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in string eq at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 251. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in numeric eq (==) at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 254. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in numeric eq (==) at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 257. Jun 25 17:25:56 mta1 spamd[629]: Argument isn't numeric in numeric eq (==) at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 260. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in numeric eq (==) at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 260. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in string eq at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 245. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in string eq at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 248. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in string eq at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 251. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in numeric eq (==) at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 254. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in numeric eq (==) at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 257. Jun 25 17:25:56 mta1 spamd[629]: Argument isn't numeric in numeric eq (==) at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 260. Jun 25 17:25:56 mta1 spamd[629]: Use of uninitialized value in numeric eq (==) at /etc/mail/spamassassin/FuzzyOcr/Hashing.pm line 260. My FuzzyOcr.cf and setup are pretty much stock, I'm sending the mail from spamc to spamd. I've tried sending a email (with spammy image attached) through spamassassin -D fuzzyocr from the command line and I can't get the error to reproduce itself. It seems to only occur on certain messages. It appears as though FuzzyOCR is still working, it's scoring messages and writing hashes to the MySQL database, I'm just not sure if it's working as well as it should. Anyone got any ideas where I can track down the problem? Hi, I replied on your ticket in our Trac System, you can follow the steps there to get more information. Best regards, Chris TIA Russ -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGgQqNJQIKXnJyDxURAtlEAJ0UdMMGAl6CVt+kTxaOglmpzFWEqACcCCI1 ooAIdpLjt+T7PRhSBnJV5CM= =hVrh -END PGP SIGNATURE-
FuzzyOcr SVN version fixes formatting problems with SA 3.1.8 or higher
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello all, I've just comitted some changes to our SVN that fixes the ugly formatting problems that came up with SA 3.1.8 and higher. The new version should display results with a proper formatting in the SA report, without screwing up the FuzzyOcr logging output. Thanks to Justin Mason for pointing me to the correct function (test_log) to achieve this :) For those that want to try the newest version, read http://fuzzyocr.own-hero.net/wiki/Downloads#SVN for information about our SVN. The current SVN version is not very different to the current 3.5.x release, so overwriting a 3.5.x install will work in most cases, but please note that this API has only been tested with SA 3.2.0, I am not sure if it exists in older versions or where the function test_log was introduced. If you know this, please tell me :) Thanks in advance for testing and please report back problems to me (only serious bug reports related to the SVN version, no general problems). Chris -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGe/utJQIKXnJyDxURApPOAKCnKNl/ILr/l0clPwf8lrviFU64tACfbR4y ef2AZD0NFYozHgRQmSBfHIQ= =P8KY -END PGP SIGNATURE-
SpamAssassin 3.2 compatiblity
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, after I saw that there are incompatiblities with SA 3.2 and FuzzyOcr, I decided to try to fix them although I'm still very busy (preparing for Bachelor thesis). I made changes and the current SVN version fixes ticket #396 as well as the good old Exporter.pm warnings bug. There is still another problem though, the formatting of the rule descriptions changed in SA 3.2 and I can't seem to get it to do the old formatting (listing the words etc), the wrapper screws it up totally. I'd be glad if some people could try the SVN version with SA 3.2 and report back. If someone knows how to fix those ugly output formatting problems, tell me. This is a cosmetically issue but it still looks bad. If there are any other problems that can be fixed quickly, tell me and I'll make those changes in SVN before releasing a bugfix version for 3.2. Best regards, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGWXnkJQIKXnJyDxURAtetAJ9Ec8+BSP0L9ZiGrvlUBjcYy5YoQACghQai 5PREZKNNOL7pOs8W1qRAcaI= =KGla -END PGP SIGNATURE-
Re: FuzzyOcr 3.5.1- error messages in logs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Frank Bures wrote: On Mon, 15 Jan 2007 20:34:38 +0100, Mark Martinec wrote: Frank Bures writes: Since I updated to 3.5.1 from 3.4.2, I am sometimes getting the following FuzzyOcr: Error running preprocessor(pamthreshold): /usr/local/bin/pamthreshold -simple -threshold 0.5 FuzzyOcr: Errors in Scanset ocrad-decolorize FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number - not a PAM, PPM, PGM, or PBM file Any explanation? No, but here is my complaint about the same problem: http://marc.theaimsgroup.com/?l=spamassassin-usersm=116837265504702 Mark In my case the gif's are spam text gif's, advertising stock and containing words listed in FuzzyOcr.words. After the above mentioned error, the Fuzzy Ocr does not trigger at all and the message is scored using any other rules except FuzzyOcr. If it would be helpful, please let me know where to post a message triggering this problem. Send it directly to me in a tar.gz or some other form of archive (the whole msg, not only the pic) Chris Thanks Frank Bures, Dept. of Chemistry, University of Toronto, M5S 3H6 [EMAIL PROTECTED] http://www.chem.utoronto.ca PGP public key: http://pgp.mit.edu:11371/pks/lookup?op=indexsearch=Frank+Bures -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFq+nSJQIKXnJyDxURAupxAJ0e5FlpbgjwVZqLqrnrc71PXcjWvgCggJTe kmdT4pPMXeHpPndaujBxdUs= =fD5B -END PGP SIGNATURE-
Re: FuzzyOcr 3.5.1 released
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Len Conrad wrote: With the severe obfuscation of spam images with: 1) low-contrast between f/g and b/g and 2) random images/edges in the b/g, ... how effective is FuzzyOCR in OCR accuracy? With these two factors, FuzzyOcr has not much problems using version 3.5.x. 1) Is covered by binarization, for example in ocrad with the -T percent switch 2) Is covered by the pamthreshold/pamditherbw scansets However, there are other things that don't work as well.. which I won't enumerate here for obvious reasons ;) Chris Len -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFpRGBJQIKXnJyDxURAicPAKDKgROcf7V3DW+KwqQv+RpsQUStzwCgx5fi gmXbWRwNT8u7XvksJs05X0I= =cEwx -END PGP SIGNATURE-
Re: FuzzyOcr 3.5.1 released
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 jdow wrote: From: Andy Dills [EMAIL PROTECTED] On Sun, 7 Jan 2007, Andy Dills wrote: On Sun, 7 Jan 2007, decoder wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello all, since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many testers and bug reporters :) so big thanks. I have something I'm curious about, having run FuzzyOcr in a medium size (3-400k messages per day) mail cluster for about a week now. Why do you do database maintenance with every unmatched check? From Hashing.pm: unless ($match) { my $then = time - ($conf-{focr_db_max_days}*86400); ---$sql = qq(select * from $db.$dbfile order by $dbfile.check); my $sth = $ddb-prepare($sql); $sth-execute; while (my @row = $sth-fetchrow_array) { my $hash2 = $row[1] || 0:0:0:0; $hash2 .= ::$row[0]; if (within_threshold($digest,$hash2)) { $txt = 'Approx'; $key = $row[0]; $next = $row[5] + 1; $when = $row[7] || $now; $ret = $dbfile eq $conf-{focr_mysql_hash} ? $row[8] : $row[5]; $dinfo = $row[9] || ''; infolog(Found[$dbfile]: Score='$row[8]' Info: '$row[9]'); last; } } # Expire old records... ---$sql = qq(delete from $db.$dbfile where $dbfile.check $then); debuglog($sql,2); $ddb-do($sql); } Those two queries are extremely expensive in a larger envrionment...I have commented this code segment out on our cluster, and have written a quick maintenance script that runs once per day...dropped the response time from 2-3s to .01-.05s on queries, and eliminated the suddenly large and customer-annoying mailqueues. Sorry to follow up to my own post, but now that I read this segment a little closer I realize that I'm basically commenting out the matching capability of the Hashing mechanism, eliminating all value of the Hashing in the first place. So...I guess my point is, unless there is a better way of determining the match than checking every single hash in the database (hoping that you find one that is close enough along the way), it's more efficient (in larger environments at least) to just scan each mail message without hashing enabled. Thoughts? Andy Hash the hashes and store them in a suitable tree? I explained before that you cannot hash the hashes because a cryptographic hash is tolerance resistant. A fuzzy matching on such a hash of the actual hash is impossible then. Chris {^_^} -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFokUVJQIKXnJyDxURAlWWAKCBlIaLmg6ToOLuWQJ/As5LlWPBpQCfUoGG rrSlnywraE1RLwK3YjEWqoc= =7b3V -END PGP SIGNATURE-
Re: Problems with FuzzyOcr 3.5.1
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ed Kasky wrote: I just upgraded to 3.5.1 and it seemed that everything was working until I tried using sa-learn on a few messages. Running spamassassin -D --lint produces the following errors: [22986] dbg: plugin: fixed relative path: /etc/mail/spamassassin/FuzzyOcr.pm [22986] dbg: plugin: loading FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm Subroutine FuzzyOcr::O_NONBLOCK redefined at /usr/lib/perl5/5.8.1/Exporter.pm line 60. at /usr/lib/perl5/5.8.1/i686-linux-thread-multi/POSIX.pm line 19 [22986] dbg: plugin: registered FuzzyOcr=HASH(0x98343b0) [22986] dbg: plugin: FuzzyOcr=HASH(0x98343b0) implements 'parse_config' [22986] dbg: FuzzyOcr: focr_bin_helper: 'pnmnorm,pnminvert,pamthreshold,ppmtopgm,pamtopnm' [22986] info: FuzzyOcr: Adding 5 new helper apps [22986] dbg: FuzzyOcr: focr_bin_helper: 'tesseract' [22986] info: FuzzyOcr: Adding 1 new helper apps [22986] warn: config: failed to parse line, skipping: focr_bin_gifasm /usr/local/bin/gifasm -(this is installed at /usr/local/bin/gifasm) [22986] warn: config: failed to parse line, skipping: focr_bin_convert /usr/local/bin/convert -(this is installed at /usr/local/bin/convert) [22986] warn: config: failed to parse line, skipping: focr_bin_identify /usr/local/bin/identify -(this is installed at /usr/local/bin/identify) Hi, seems like you are trying to define tools here that are not needed anymore in FuzzyOcr 3.5. Neither imagemagick (identify/convert), nor gifasm is required anymore for FuzzyOcr. Please read the dependencies carefully and look at the shipped FuzzyOcr.cf file. It contains everything that is possible to define :) Best regards, Chris I did not install the following executables but left them commented out in the cf. Am I correct in assuming that I need to install them as well as the others that were required for previous versions? If that's the case, I read that pamthreshold is part of the newer releases of Netpbm but what version? I looked at 10.33 and it's not there. [22986] warn: FuzzyOcr: Cannot find executable for gifsicle [22986] warn: FuzzyOcr: Cannot find executable for ocrad [22986] warn: FuzzyOcr: Cannot find executable for pamthreshold [22986] warn: FuzzyOcr: Cannot find executable for tesseract Thanks for any help on this one. FuzzyOcr has been a great addition to the arsenal... Ed Kasky ~ Randomly Generated Quote (104 of 522): If you know how to spend less than you get, you have the philosopher's stone. --Benjamin Franklin -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFoss6JQIKXnJyDxURAiwKAKCVoSN3Cm71xTFPmHh8pK3/n6M/ZACfbp0+ uHZWYeog+LDTjfJdMLnf54Q= =li9R -END PGP SIGNATURE-
FuzzyOcr 3.5.1 released
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello all, since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many testers and bug reporters :) so big thanks. Now, the version seems stable enough to replace the 3.4.x branch, and I recommend everyone to upgrade to it :) For those that don't know yet, whats new in the 3.5 branch, read the changelog here: http://fuzzyocr.own-hero.net/wiki/Changelog-3.x#version3.5.0 You can download version 3.5.1 at http://fuzzyocr.own-hero.net/wiki/Downloads For those that try to upgrade from 3.4.x or even 2.3b, please read the installation manual carefully, the 3.5.x branch is very different to earlier branches. Unfortunately, I didn't have the time yet to create a FAQ, so if you run into problems, try searching our ticket system and our mailing list archives first. If you can't solve the problem then, please use our mailing list to get help. Please DO NOT use the ticket system to get help for your problems, the ticket system is meant for bug reports, not for support requests. If you think you've found a bug, feel free to create a ticket. The same applies for errors or missing statements in documentation. Best regards, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFoQDBJQIKXnJyDxURAmH4AJ96/QkNcVmKBdcqM4al8f2XaJ+yFQCgqqR1 eIWq2eAy3D/cCoR7P/TIrGw= =t0cr -END PGP SIGNATURE-
Re: FuzzyOcr 3.5.1 released
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Giampaolo Tomassoni wrote: From: decoder [mailto:[EMAIL PROTECTED] Hello all, since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many testers and bug reporters :) so big thanks. Excellent work. Thank you for your efforts in bringing it to us. Anyway, I'm wondering why the image hashing is made that way, leading to: 1) a variable-length key and 2) possibly even a very long one (depending on focr_hash_max). This pretty inefficient to handle on SQL backends and, infact, FuzzyOcr.mysql must define the key columns as varchar(255)... If I had more time I'd develope a better hashing system, but I don't :( I see that the problem is due to the way the hashing is calculate in FuzzyOcr/Hashing.pm: code-snip my $cnt = 0; my $c = scalar(@stdout_data); my $s = (stat($pfile))[7] || 0; $hash = sprintf %d:%d:%d:%d,$s, defined $pic-{height} ? $pic-{height} : 0, defined $pic-{width} ? $pic-{width} : 0, $c; if ($Threshold{max_hash}) { foreach (@stdout_data) { $_ =~ s/ +/ /g; my(@d) = split(' ', $_); $hash .= sprintf(::%d:%d:%d:%d:%d,@d); if ($cnt++ ge $Threshold{max_hash}) { last; } } } /code-snip Why not use some form of digest? In example, something like this could be more interesting to me: code-snip my $cnt = 0; my $c = scalar(@stdout_data); my $s = (stat($pfile))[7] || 0; $hash = sprintf %d:%d:%d:%d,$s, defined $pic-{height} ? $pic-{height} : 0, defined $pic-{width} ? $pic-{width} : 0, $c; if ($Threshold{max_hash}) { use Digest; my $hctx = Digest-new('MD5'); my $clrcnt = 0; foreach (@stdout_data) { my(@d) = split(/ +/, $_); $hctx-add(pack('CCCN', $d[0], $d[1], $d[2], $d[4])); if (++$clrcnt = $Threshold{max_hash}) { last; } } $hctx-add(pack('N', $clrcnt)); $hash .= '::' . $hctx-hexdigest; } /code-snip Which basicly creates a digest on the first (most frequent) $Threshold{max_hash} palette colors instead of simply enumerating them. The output will be around 40-45 characters and will stick with this length irregardless of the value of the focr_hash_max setting. Please note I'm not a perl wizard no a SA developer, so there is space for optimizations here. In example, Digest-new('MD5') could probably even be globally definited and there initialized, and a $hctx-reset issued when a new digest have to be computed. What are your thoughts about? The point is, if you use a digest, then you need an exact match, no matter if you digest the image directly, or any of the parameters, because digests are designed to not accept any tolerance. But the FuzzyOcr matching algorithm depends on accepting tolerance, the hashes are never matched 100% exactly. That is why hashing any of the parameters will not work. Generally, any hashing algorithm is acceptable for FuzzyOcr as long as it has tolerance built in. Spammers never send the same pictures around, they are generated on the fly. Chris Regards, giampaolo -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFoSoJJQIKXnJyDxURAvVuAKCwJWgArxWYcY5OTlap+13sB8C9sACdHxOo KflJrH4H1zMFFJj1yFB3Eb8= =ST+n -END PGP SIGNATURE-
Re: Any modules use String::Approx?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Robert Nicholson wrote: Are there any plugins that use String::Approx as used by FuzzyOCR but used to match non-image spam? Not that I know of but it would definetly be possible. There are only problems with some words which are too similar to spammy words, as well as with spammy words contained in normal words (e.g. specialist vs. cialis), but such cases could be handled by adjusting the threshold on a per word basis and excluding some words. What do others think? Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFmqgyJQIKXnJyDxURAoWWAJ4y0iaqBFv/dzfymdTvypTlGI9LMQCgi83M GdDC5xcRa7Q4ihJQoIEE81Y= =Oxps -END PGP SIGNATURE-
Re: Error in FuzzyOcr 3.5.x branch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jim Knuth wrote: Heute (28.12.2006/05:10 Uhr) schrieb Gary V, Jim, I have been working on a doc for Debian. It is unfinished but may help you through some rough spots at this point. I have no idea when I'll have time to finish it. I have 3.5.0-rc1 running for two days now (works great). http://www200.pair.com/mecham/spam/image_spam2.html Gary V mmh, sorry. But the same game. spamassassin --lint Subroutine FuzzyOcr::O_NONBLOCK redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /usr/lib/perl/5.8/POSIX.pm line 19 Subroutine FuzzyOcr::debuglog redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 24 Subroutine FuzzyOcr::parse_config redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 25 Subroutine FuzzyOcr::check_image_hash_db redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 40 Subroutine FuzzyOcr::add_image_hash_db redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 40 Subroutine FuzzyOcr::calc_image_hash redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 40 Subroutine FuzzyOcr::wrong_ctype redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 42 Subroutine FuzzyOcr::corrupt_img redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 42 Subroutine FuzzyOcr::known_img_hash redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 42 Subroutine FuzzyOcr::max redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /etc/mail/spamassassin/FuzzyOcr.pm line 43 [2769] warn: Subroutine new redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 48. [2769] warn: Subroutine dummy_check redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 59. [2769] warn: Subroutine fuzzyocr_check redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 63. [2769] warn: config: failed to parse line, skipping: focr_end_config [2769] warn: lint: 1 issues detected, please rerun with debug enabled for more information -- Viele Gruesse, Kind regards, And what version of SpamAssassin are you running? SpamAssassin version 3.1.7 running on Perl version 5.8.4 Did you move all the old stuff out of the way and remove the loadplugin entry in v310.pre? no. ;) That was it! And now gets only: spamassassin --lint Subroutine FuzzyOcr::O_NONBLOCK redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /usr/lib/perl/5.8/POSIX.pm line 19 What is that still? This is no FuzzyOcr problem but a perl core problem. Two core perl modules export the same constant(s) (in this case O_NONBLOCK). You can safely ignore this. Upgrading perl might remove this warning. Chris In the 2.3 doc it had you comment out the loadplugin directive in FuzzyOcr.cf and add one to v310.pre. This doc does not do that. Gary V -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFk7gsJQIKXnJyDxURAidrAJsH31Iqt0oRgCFv1DDl/bjw3lGQGgCeK4jW xZprwz1WGaTzFgVsd681SSs= =ZEe1 -END PGP SIGNATURE-
Re: Despeckling images for OCR and anti-spam purposes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Kelly Jones wrote: Spammers are starting to put speckles in their images to defeat OCR-scanning plugins such as FuzzyOCR. Which images are you refering to? If you can put up a sample, then I can tell you which scanner setting will catch it :) Best regards, Chris I thought ImageMagick's -despeckle option would help, but it doesn't seem to, not even when applied multiple times, not even in conjunction with -monochrome. I want a filter that does this for each pixel X: 1) if any of X's 8 neighbor pixels is the same color, turn X black 2) otherwise, turn X white Can some combination of options to convert do this? I realize that: 1. This will only work w/ indexed-color images (eg, GIFs) and not JPEGs, etc. 2. Spammers will soon work around this, so this is just a short-term bandage. 3. I could write something in libgd to do this (blech!) -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFjRZqJQIKXnJyDxURAt4YAKCCpRPORjqRy2l6UejArzZKH6Ar1ACghlCC PcRpJ+Ur+RUvHMy0OY6eDms= =EJCE -END PGP SIGNATURE-
Re: Despeckling images for OCR and anti-spam purposes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Kenneth Porter wrote: --On Saturday, December 23, 2006 12:43 PM +0100 decoder [EMAIL PROTECTED] wrote: Which images are you refering to? If you can put up a sample, then I can tell you which scanner setting will catch it :) Does the SA wiki support uploading of images? Perhaps we could have a page of just problem images. Such a page is likely to grow large and consume a lot of bandwidth, so perhaps we could get a resource that thumbnails them and runs them through the Coral Cache. I'm not sure about the SA wiki but you can create a ticket for it on our side and attach the picture :) Maybe I can create a wiki page for it as well on our page that allows uploading/appending of images. You can find the page at fuzzyocr.own-hero.net. Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFjYNrJQIKXnJyDxURAs8PAJ0TMpqHh47zay0wN8MPwFkcyluknQCeJU9m YOi1MNkEKQ/0YcIe4VhCVSs= =2LK1 -END PGP SIGNATURE-
Re: FuzzyOcr questions
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ronnie Tartar wrote: I have a Qmail Toaster setup. I have everything working except the fuzzyocr. Should it have information in the header about being scanned? Here is a header but I don't see the fuzzyocr plugin working *X-Spam-Status:* Yes, score=10.5 required=1.0 tests=EXTRA_MPART_TYPE, HTML_IMAGE_ONLY_16,HTML_MESSAGE,HTML_SHORT_LINK_IMG_2,INVALID_TZ_GMT, MY_CID_AND_ARIAL2,MY_CID_AND_CLOSING,MY_CID_AND_STYLE, MY_CID_ARIAL2_CLOSING,MY_CID_ARIAL_STYLE,SARE_GIF_ATTACH, SARE_GIF_STOX autolearn=no version=3.1.7 *X-Spam-Report:* * 1.1 INVALID_TZ_GMT Invalid date in header (wrong GMT/UTC timezone) * 0.8 EXTRA_MPART_TYPE Header has extraneous Content-type:...type= entry * 0.0 HTML_MESSAGE BODY: HTML included in message * 0.6 HTML_IMAGE_ONLY_16 BODY: HTML: images with 1200-1600 bytes of words * 0.8 SARE_GIF_ATTACH FULL: Email has a inline gif * 1.1 MY_CID_ARIAL_STYLE SARE cid arial2 style * 1.0 HTML_SHORT_LINK_IMG_2 HTML is very short with a linked image * 0.9 MY_CID_AND_CLOSING SARE cid and closing * 0.7 MY_CID_AND_STYLE SARE cid and style * 0.7 MY_CID_AND_ARIAL2 SARE CID and Arial2 * 1.2 MY_CID_ARIAL2_CLOSING SARE cid arial2 closing * 1.7 SARE_GIF_STOX Inline Gif with little HTML spamassassin -D --lint shows fuzzyocr loading? [2261] dbg: plugin: fixed relative path: /etc/mail/spamassassin/FuzzyOcr.pm [2261] dbg: plugin: loading FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm [2261] dbg: plugin: registered FuzzyOcr=HASH(0xb305908) [2261] dbg: plugin: FuzzyOcr=HASH(0xb305908) implements 'parse_config' [2261] dbg: FuzzyOcr: Option verbose = 1 [2261] dbg: FuzzyOcr: Option logfile = /etc/mail/spamassassin/FuzzyOcr.log [2261] dbg: FuzzyOcr: Option global_wordlist = /etc/mail/spamassassin/FuzzyOcr.words [2261] dbg: FuzzyOcr: Valid search path: /usr/local/bin [2261] dbg: FuzzyOcr: Valid search path: /usr/bin [2261] dbg: config: allowing user rules! [2261] dbg: plugin: Mail::SpamAssassin::Plugin::ReplaceTags=HASH(0xac9d804) implements 'finish_parsing_end' [2261] dbg: plugin: FuzzyOcr=HASH(0xb305908) implements 'finish_parsing_end' [2261] dbg: replacetags: replacing tags [2261] dbg: replacetags: done replacing tags [2261] dbg: FuzzyOcr: Using gifsicle = /usr/bin/gifsicle [2261] dbg: FuzzyOcr: Cannot find executable for giffix [2261] dbg: FuzzyOcr: Cannot find executable for giftext [2261] dbg: FuzzyOcr: Cannot find executable for gifinter [2261] dbg: FuzzyOcr: Cannot find executable for giftopnm [2261] dbg: FuzzyOcr: Cannot find executable for jpegtopnm [2261] dbg: FuzzyOcr: Cannot find executable for pngtopnm [2261] dbg: FuzzyOcr: Cannot find executable for bmptopnm [2261] dbg: FuzzyOcr: Cannot find executable for tifftopnm [2261] dbg: FuzzyOcr: Cannot find executable for ppmhist [2261] dbg: FuzzyOcr: Cannot find executable for pamfile Can't you read? You need to tell the plugin where those binaries are located, if they are not in the standard locations. Did you even satisfy the dependencies? Chris [2261] dbg: FuzzyOcr: Using gocr = /usr/local/bin/gocr [2261] dbg: FuzzyOcr: Using ocrad = /usr/local/bin/ocrad [2261] dbg: FuzzyOcr: Loaded 49 words from /etc/mail/spamassassin/FuzzyOcr.words [2261] dbg: FuzzyOcr: Using scan: $gocr -i $pfile [2261] dbg: FuzzyOcr: Using scan: $gocr -l 180 -d 2 -i $pfile Any help would be greatly appreciated. Thanks -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFjDOzJQIKXnJyDxURAjAVAKCWT4V1yhFl4kyHoIzRCKHJQLnsQgCePc1A gwbjOF8+3Se2F8wafm7iuJc= =BZ1h -END PGP SIGNATURE-
Re: fuzzyocr slowing up my server
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 pinoyskull wrote: decoder wrote: pinoyskull wrote: I've been using fuzzyocr plugin for some time now and I think I noticed is its high cpu/memory usage resulting on delayed delivery of mails. The server is serving 2000+ clients. The server is a P4 2.6Ghz, 1GB memory running on FreeBSD 6.0. Should i upgrade the memory to 2GB or 4GB? Will it fix the problem? You need to give more details about the version of FuzzyOcr and the configuration. There are plenty of ways to lower the resource usage of FuzzyOcr effectively. Best regards, Chris Hi Chris, Im using the the Spamassassin 3.17 and FuzzyOCR version 2.3b, I just edited the log location and the helper programs, the rest are all defaults First of all, the FuzzyOcr version is quite old, I recommend 3.4.x, but most things I tell you now also work with 2.3b. a) Try to use ocrad instead of gocr. Ocrad is known to be less resource intensive. In 2.3b, you need to write your own ocrad scansets, 3.4.2 includes examples/support for ocrad already b) Use hashing, it decreases the amount of OCR scans actually done (use the MLDBM database stuff) c) Set the autodisable score setting to 5 (or whatever is enough at your place to count as spam) to minimize the amount of mails scanned I am aware of the resource problem and 3.5 includes lots of enhancements which decrease used resources even more, but currently only 3.5rc1 is available. It is still being tested by me and others (but performs stable at the moment with all patchsets applied). If you want, you can have a look at it as well (available at our download page) :) Best regards, Chris I need your input guys. Thanks and Merry Christmas. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFirUVJQIKXnJyDxURArX+AKDA05Q88gRSr4/f9xSmvgSnQYwb3QCePsy4 qa9f1G+DYh7/DM/yAbTCyg8= =CCcV -END PGP SIGNATURE-
Re: FuzzyOCR hashdb tagging commonly-used images like spacer.gif as spam
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Kelly Jones wrote: We turned on FuzzyOCR's experimental hashdb function, but had to turn it off again after it tagged the following images (hashes) as spam: 8:1:1:1::1:1:1:1:1 14:1:1:1::0:0:0:0:1 These appear to be spacer.gif-like images: small images commonly used in HTML messages for formatting purposes. Has anyone else run into this issue? Related questions: 1. How does FuzzyOCR compute an image hash? Skimming FuzzyOcr.pm shows this isn't a SHA1/MD5 of the image, but instead depends on ppmhist and identify (ImageMagick)? The hash is a feature fingerprint. The exact algorithm isn't disclosed to make it harder for spammers. 2. How do I FuzzyOCR-hash a given image? The naive way fails: perl -le 'require FuzzyOcr.pm; ($foo, $bar) = FuzzyOcr::calc_image_hash(filename.gif); print $foo,$bar' Since version 3.4.x, there is a tool fuzzy-find which can search for hashes in the db, and manage it. It will also show the image hash then. 3. If a spammer attaches 1 spam image + 5 good images and the message gets flagged as spam, do all *six* images get entered into the hashdb? The log files imply so. Would this explain why commonly-used images are in the hashdb? It was like that in 2.3b, but was changed in 3.4.x. There are still all images hashed and saved, but the word count is saved per image, so including good images won't do much, they won't have any counts, therefore won't cause a score later... If you are still running the old version, I'd recommend an upgrade. If you still have problems with good images recognized as bad by hashes, write me and I'll see if the code needs any change :) Best regards, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFhb5CJQIKXnJyDxURAsydAJwJZehK5ZiEbEBvW9EAd5gVKtFiAgCgvtfE q80m0080ov6YWsKiySROx4Q= =RnSt -END PGP SIGNATURE-
Re: Why don't my Fuzzyocr see some mails which has spam text in a jpeg file ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Halid Faith wrote: I use spamassassin3.1.7 and fuzzyocr3.4.2 Fuzzyocr usually work well. Yet some mails which contains jpeg can't see. Therefore fuzzyocr don't give any score them as FUZZY_OCR. Does the jpeg sample file provided within the tarball work? If that's the case, isolate a mail that didn't work with FuzzyOcr, and run it from the command line with debugging enabled. This isn't necessarily a bug, there are some small spam jpegs that aren't recognized well with the standard word list (namely 2 that I know of that need custom words). Best regards, Chris Here's my Fuzzyocr.cf Your scanset line contains a small error: $ocrad -s 0.5 -T 0.5 $pfile should be $ocrad -s 5 -T 0.4 $pfile That will provide better results, just as a tweak. Best regards, Chris body FUZZY_OCR eval:fuzzyocr_check() describe FUZZY_OCR Mail contains an image with common spam text inside body FUZZY_OCR_WRONG_CTYPE eval:dummy_check() describe FUZZY_OCR_WRONG_CTYPE Mail contains an image with wrong content-type set body FUZZY_OCR_CORRUPT_IMG eval:dummy_check() describe FUZZY_OCR_CORRUPT_IMG Mail contains a corrupted image body FUZZY_OCR_KNOWN_HASH eval:dummy_check() describe FUZZY_OCR_KNOWN_HASH Mail contains an image with known hash priority FUZZY_OCR 900 ### Plugin Configuration # Logging options # # Verbosity level (see manual) Attention: Don't set to 0, but to 0.0 for quiet operation, or comment out the focr_logfile line. (Def focr_verbose 2.0 # # Logfile (make sure it is writable by the plugin) (Default value: NONE) focr_logfile /usr/local/etc/mail/spamassassin/FuzzyOcr.log ## # Wordlists # # Here we defined the words to scan for (Default value: /etc/mail/spamassassin/FuzzyOcr.words) focr_global_wordlist /usr/local/etc/mail/spamassassin/FuzzyOcr.words # # This is the path RELATIVE to the respektive home directory for the personalized list # This list is merged with the global word list on execution (Default value: .spamassassin/fuzzyocr.words) # If focr_personal_wordlist begins with '/', treats option as fixed path and does not search HOME #focr_personal_wordlist .spamassassin/fuzzyocr.words # # These parameters can be used to change other detection settings # If you leave these commented out, the defaults will be used. # Do not use around any parameters! # # Location of helper applications (path + binary) (Default values: /usr/bin/app) # focr_bin_gifsicle /usr/local/bin/gifsicle focr_bin_giffix /usr/local/bin/giffix focr_bin_giftext /usr/local/bin/giftext focr_bin_gifinter /usr/local/bin/gifinter focr_bin_giftopnm /usr/local/bin/giftopnm focr_bin_jpegtopnm /usr/local/bin/jpegtopnm focr_bin_pngtopnm /usr/local/bin/pngtopnm focr_bin_bmptopnm /usr/local/bin/bmptopnm focr_bin_tifftopnm /usr/local/bin/tifftopnm focr_bin_ppmhist /usr/local/bin/ppmhist focr_bin_gocr /usr/local/bin/gocr focr_bin_ocrad /usr/local/bin/ocrad # focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin # # Scansets, comma seperated (Default value: $gocr -i -, $gocr -l 180 -d 2 -i -) # # Each scanset consists of one or more commands which make text out of pnm input. # Each scanset is run seperately on the PNM data, results are combined in scoring. #focr_scansets $gocr -i $pfile, $gocr -l 180 -d 2 -i $pfile # # An example that involves ocrad as well focr_scansets $gocr -i $pfile, $gocr -l 180 -d 2 -i $pfile, $ocrad -s 0.5 -T 0.5 $pfile # # Another one for ocrad only #focr_scansets $ocrad -s 0.5 -T 0.5 $pfile # # To use only one scan with default values, uncomment the next line instead #focr_scansets $gocr -i $pfile -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFg+L/JQIKXnJyDxURAqsMAJkBuj2GAZiYOwuktV/rI9yqUN30YACfV5n9 V7Gr+wPYEGkIb0u8EPCg6MA= =Y/t1 -END PGP SIGNATURE-
Re: How can I add to FuzzyOcr.hashdb manually a mail which contains spam text in gif/jpeg.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Halid Faith wrote: I use spamassassin3.1.7 and fuzzyocr-2.3b. it usually works well. Although Some mails which contain spam in gif/jpeg, fuzzyocr can't see them. So it doesn't give them any score as FUZZY_OCR. I want to add these mails to FuzzyOcr.hashdb manually. How can I do that? That isn't possible with the version you are using. 3.4.x is the first branch which introduced fuzzy-find, a tool to manage the database. Using 3.4.2 + the tools from the SVN (only they have command line switches --learn-spam and --learn-ham), you can also add your own hashes to the MLDBM database by passing the picture to the tool Best regards, Chris Thanks. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFgqGVJQIKXnJyDxURAq9dAJ42vPMXF7BvPzD+NgLE7W4nZT280QCgi6zx obN3XhPZHeKfXmTPf/zdYcc= =ljCU -END PGP SIGNATURE-
Re: Released patchset 2
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, can you provide me the message which triggered the 2 warnings + the error? Also, are your files unchanged or did you add any scanset/preprocessor? Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFgE8nJQIKXnJyDxURAlgEAKCachzu/zwwAPe9b63FTElPIJaFxACgnJ8Y gtE/tWg57I2fk/db1QL0BOc= =sGiS -END PGP SIGNATURE-
Re: Released patchset 2
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ignore that msg... wasn't meant to go here, sorry :) -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFgE9rJQIKXnJyDxURAhmgAJwIbRTfUXxcd2xACQXeSDXqcHsZwQCgoXIJ pqAyVW5SerjESMzZYXKarc8= =naxD -END PGP SIGNATURE-
Re: Botnet 0.6 plugin for Spam Assassin availabile
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 John Rudd wrote: Michael Schaap wrote: John Rudd wrote: The next version of the Botnet plugin for Spam Assassin is ready. The install instructions are in the Botnet.txt file, and in the INSTALL file. Great work! To Do before 1.0: (...) There's another thing that would be really nice to have. You know how the DNS rules' descriptions specify what actually matches? e.g.: 3.9 RCVD_IN_XBLRBL: Received via a relay in Spamhaus XBL [12.34.56.789 listed in sbl-xbl.spamhaus.org] 1.6 URIBL_SBL Contains an URL listed in the SBL blocklist [URIs: example.com] It would be great if Botnet could do something similar, like: 2.0 BOTNET The submitting mail server looks like part of a Botnet [ip=12.34.56.789 rdns=dhcp12.34.example.org] Any tips on how to do that? :-} Have a look at the FuzzyOcr plugin, especially on Scoring.pm in the SVN, found here: http://fuzzyocr.own-hero.net/browser/trunk/devel/FuzzyOcr/Scoring.pm In each of the functions, the mail is scored with a different rule, a custom score and a custom description which is generated there. That should be enough for you to reproduce that :) Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFeTiMJQIKXnJyDxURAicaAJ9n5XdSIpvWXrz3W4w2DtKmbiQ82ACgvyAB ywuRctN/qak0u61idiMFw5o= =obGb -END PGP SIGNATURE-
Re: FuzzyOcr helper apps
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Robert Fitzpatrick wrote: I have two gateways that filter using amavisd-new and SA 3.1.7 with the FuzzyOcr recipes used. On one of these FreeBSD servers, all the helper applications are present, but on the other, they're all missing. I just now realized this after a while and do not remember where those helper apps, like giffix, come from. All packages on both systems were installed using FreeBSD ports system. Can someone give me a pointer? Can I merely copy over the missing helper apps? http://fuzzyocr.own-hero.net/wiki/OSSpecificNotes At the bottom is a link to a FreeBSD tutorial, I'm sure it lists what you need :) Chris Thanks in advance! -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFeYFhJQIKXnJyDxURAvwsAKClBTQJmpVLCAR9FYgtQa4/yx2fuwCfdGkD czGZM7qXDec+mxKmzGvEtak= =1Ogr -END PGP SIGNATURE-
Re: Installed FuzzyOCR - What am I missing?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Evan Platt wrote: Installed FuzzyOCR on my os/x box per http://fuzzyocr.own-hero.net/wiki/Installation-3.x . Based on my reading of it, I don't need to do anything other than put the FuzzyOcr.cf file in my spamassassin directory (which on my install is /private/etc/opt/mail/spamassassin/ ) . So I have FuzzyOcr.cf, FuzzyOcr.pm (chmod +x'd) . The relevent (AFAICT) parts of .cf are: loadplugin FuzzyOcr FuzzyOcr.pm body FUZZY_OCR eval:fuzzyocr_check() describe FUZZY_OCR Mail contains an image with common spam text inside body FUZZY_OCR_WRONG_CTYPE eval:dummy_check() describe FUZZY_OCR_WRONG_CTYPE Mail contains an image with wrong content-type set body FUZZY_OCR_CORRUPT_IMG eval:dummy_check() describe FUZZY_OCR_CORRUPT_IMG Mail contains a corrupted image body FUZZY_OCR_KNOWN_HASH eval:dummy_check() describe FUZZY_OCR_KNOWN_HASH Mail contains an image with known hash focr_personal_wordlist ./spamassassin/FuzzyOcr.words (.words is in the same directory). I then ran spamassassin animated-gif.eml out out shows no FuzzyOCR hits. Am I missing something obvious? If I'm not providing enough details, please let me know. You should try to run spamassassin with -D to see more debug output. Watch out for FuzzyOcr lines :) Best regards, Chris Thanks. Evan -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFbI8vJQIKXnJyDxURAtR2AJ90OR9yKBE2rngmCFiLn3W+8yClCQCgqUKJ 15VKwaPTeOd2sxcRU6U3qrg= =aMj2 -END PGP SIGNATURE-
Re: Installed FuzzyOCR - What am I missing?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Evan Platt wrote: At 11:34 AM 11/28/2006, you wrote: You should try to run spamassassin with -D to see more debug output. Watch out for FuzzyOcr lines :) Didn't think of that.. :) Ok, did that. Only a few lines have Fuzzy: I forgot to tell you that you also need to increase the verbosity factor of the plugin: focr_verbose 2 will make sure that you see more (i.e. everything ;)) Best regards, Chris [554] dbg: config: read file /etc/opt/mail/spamassassin/FuzzyOcr.cf [554] dbg: plugin: fixed relative path: /etc/opt/mail/spamassassin/FuzzyOcr.pm [554] dbg: plugin: loading FuzzyOcr from /etc/opt/mail/spamassassin/FuzzyOcr.pm [554] dbg: plugin: FuzzyOcr=HASH(0x1d0e4a4) implements 'parse_config' Nothing there looks like a problem? I put the entire debug session at http://www.espphotography.com/sadebug.txt Thanks. Evan -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFbJ4TJQIKXnJyDxURAv+/AJ91Hiq7q8uZWopDe1aDvkZkP+KaTACfX0kt QF+pEYZA347kjVZBmtzLSi4= =Geew -END PGP SIGNATURE-
Re: Installed FuzzyOCR - What am I missing?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Evan Platt wrote: At 12:37 PM 11/28/2006, you wrote: I forgot to tell you that you also need to increase the verbosity factor of the plugin: focr_verbose 2 will make sure that you see more (i.e. everything ;)) Best regards, Did that, reran spamassassin -D animated--gif.eml out , same results :( Did you specify a logfile? If not, do so and check for output there :) Best regards, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFbKHIJQIKXnJyDxURAqiSAJ9aRyxKzuz//TW2XCicTiiDB6nLPgCfT/uq 8XuY1ycxz3nVDPDuyDf6gBw= =ypSP -END PGP SIGNATURE-
Re: Fuzzy OCR - first time user
Marc Perkel wrote: OK - trying out the FuzzyOCR plugin. So far it all the default stuff with minimal installation. I'm running Fedora Core 6. Used the gocr RPM and didn't patch the source. Everything is default and it doesn't seem to be complaining so . If I like this what do I need to change to really do it right? Should I grab the devel code? Do I really need the gocr patch? Should I tweek the scores? What do the hard core users change? My suggestion the FuzzyOcr version is 3.4.x, since it is a lot better. I also recommend to enable image hashing which is disabled by default. About the patch for gocr: I highly suggest to build it from source because I don't know if Fedora Core 6 has the proper bindings to netpbm compiled with gocr. Redhat does not. That leads to dramatical decrease in effectiveness. Also, the patch prevents segmentation faults with some pictures, and afaik, this bug still hasn't been fixed. The scores normally do not need change, unless you get serious problems with FPs.. And what the hardcore users change? lol... well, experienced users have different scansets, for example they invoke ocrad instead of gocr in their scansets because it runs faster and recognizes better in most situations. In the shipped config file, there is an example for a scanset which includes ocrad (If you wan't to try it out, make sure to read the Notes about the config file page on the FuzzyOcr download page as the ocrad scanset contains a small typo which should be fixed first :)) Finally, if you run into problems, try our mailing list at http://lists.own-hero.net/mailman/listinfo/devel-spam Best regards, Chris
Re: FuzzyOCR words file
Marc Perkel wrote: The words file needs a little documentation. Is it limited to single words or phrases too? What's with the colon and the numbers after the word? Phrases are possible too, spaces and numbers are stripped out in both the wordlist and the OCR output before matching :) The colon + the number after it indicates a custom matching threshold for this word. The default threshold is defined in the FuzzyOcr.cf, but it makes sense to override this setting for some specific words which often trigger FPs with the default threshold. Best regards, Chris
Re: image exception with FuzzyOCR??
Sietse van Zanen wrote: Ofcourse, save the image, calculate the hash and then use the fuzzy-find.pl script to delete it from the bad hash db. Next you’ll have to use a little trick to get it into the good hash db, as that’s not possible from the fuzzy-find.pl script. Simply make an empty word list and yank the image through FuzzyOcr again. It’ll put it into the known good db. It is planned to include this feature, it is really something that is missing... maybe I'll hack it up right now and release it :) Regards, Chris -Sietse *From:* Thiago LPS [mailto:[EMAIL PROTECTED] *Sent:* Friday, November 17, 2006 18:25 *To:* users@spamassassin.apache.org *Subject:* image exception with FuzzyOCR?? Hello everybody... there is a way to do a exception to some image that isn't a SPAM... but the FuzzyOCR thinks that it is a spam image?? i really dont want to disable the Hashdb...
Re: image exception with FuzzyOCR??
Thiago LPS wrote: On 11/17/06, *Sietse van Zanen* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: To be more exact, the procedure would be: 1. Save the image file, and the message 2. Calculate the hash and delete it from the bad hash db with the fuzzy-find.pl script 3. In the body of mail marked as spam , i have the hash value... so.. i removed this hash from hashdb... it was happen because i didnt yet apply the Patch to only include in hasb db pictures matched as pic-spam.. after removed the hash and applied the patch... the picture wasn't include in the hasb db anymore.. but.. the question is: even with patch applied if some good-picture be included in the hashdb nothing better than a white-hashdb to solve it.. :D im not expert with perl.. but it doesnt sounds dificult to do.. :D I'm not sure if I understand you correctly, but FuzzyOcr 3.x has already a whitelist hashdb :) And for all the others, I just checked in revision 40, which contains a modified fuzzy-find script, to be found at http://fuzzyocr.own-hero.net/browser/trunk/devel/Utils/fuzzy-find Please note that this is bleeding edge, if you want to try it out, go for it, but backup the database first in case something breaks... The script now features --learn-spam, and --learn-ham which will manually add the hash of a given image file, i.e. fuzzy-find --learn-ham somepic.gif Best regards, Chris Create an empty wordlist, or fill it with some bogus words, that don't appear in the image 4. Update the FuzzyOcr.cf file to point to the new wordlist. If you're using spamd don't restart, it'll keep using the correct wordlist. Otherwise you might want to stop incoming mail for a little while. 5. Pipe the message through FuccyOcr.pm directly, it'll put the hash into the known good db. 6. Correct the config. (and restart maild). 7. Send in a feature request to update the fuzzy-find.pl script to insert hashes into a db. ;-) -Sietse *From:* Sietse van Zanen [mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]] *Sent:* Friday, November 17, 2006 20:09 *To:* Thiago LPS; users@spamassassin.apache.org mailto:users@spamassassin.apache.org *Subject:* RE: image exception with FuzzyOCR?? Ofcourse, save the image, calculate the hash and then use the fuzzy-find.pl script to delete it from the bad hash db. Next you'll have to use a little trick to get it into the good hash db, as that's not possible from the fuzzy-find.pl script. Simply make an empty word list and yank the image through FuzzyOcr again. It'll put it into the known good db. -Sietse *From:* Thiago LPS [mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]] *Sent:* Friday, November 17, 2006 18:25 *To:* users@spamassassin.apache.org mailto:users@spamassassin.apache.org *Subject:* image exception with FuzzyOCR?? Hello everybody... there is a way to do a exception to some image that isn't a SPAM... but the FuzzyOCR thinks that it is a spam image?? i really dont want to disable the Hashdb... -- -- Thiago LPS C.E.S.A.R - Administrador de Sistemas msn: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 0xx 81 8735 2591 --
Re: Linked images in e-mail
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 John D. Hardin wrote: On the FuzzyOCR list (devel-spam) there was a question about OCR of remote images vs. embedded images. I ased there but didn't think to ask here: Does SA check URIBLs on IMG tags with remote sources? e.g. IMG src=http://known.spammer.com/gibberish.jpg; Yes it seems to do this. I just searched for an email in my spam folder that was caught by URIBL, took the url, edited another HTML mail and inserted an img src=theurl/blub.jpg and then ran it through SA again. It listed theurl in the URIBL results as well :) Regards, Chris -- John Hardin KA7OHZ http://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- There is no doubt in my mind that millions of lives could have been saved if the people were not brainwashed about gun ownership and had been well armed. ... Gun haters always want to forget the Warsaw Ghetto uprising, which is a perfect example of how a ragtag, half-starved group of Jews took 10 handguns and made asses out of the Nazis.-- Theodore Haas, Dachau Survivor --- -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFW2mvJQIKXnJyDxURAn6sAJ9c4T1y7z7hIOwSE3XgELhZsO1gGACeKxw4 t2XHhoIAE4ZNiYX4d2ZD3hc= =rJlo -END PGP SIGNATURE-
New FuzzyOcr Development Release (3.4.x)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello all, for those that are not on the devel-spam Mailing list, I'd like to announce a new development release here. If you are interested, our new website is located at http://fuzzyocr.own-hero.net/ The branch has been tested by me and some other people and seems to be very stable so far. This should be especially interesting for users of 2.3j or 2.3b which want to participate in testing. Major Changes are: For users of 2.3j: - - Logging Facility was fixed, you can specify a logfile again without getting the SA debug output into the logfile - - New animated gifs are all deanimated properly now - - No ImageMagick dependency anymore - - Improved Utilities for the hash database a bit - - Ocrad support For users of 2.3b: - - See http://www.joval.info/proj/FuzzyOcr-2.3j/CHANGES for changes between 2.3b - 2.3j, then read the above changes. The main reason for this release was to give users a version which also catches recent animated spam types, but also to show that we are still alive ;) Another major development branch is planned (3.5), it will hopefully be the last release before we release a new version labeled as stable with more features. The main features which are planned for 3.4 - 3.5 are: - - Splitting FuzzyOcr into multiple .pm files for better maintaining - - Config switches to disable scanning of specific extensions (like tiff... many people don't want this) - - Maximum image size and dimensions in configuration - - autodisable_score also for a minimum score, so messages which are to be considered ham already arent scanned anymore If you have more feature requests, ideas, bugs, or anything, please create a ticket on the mentioned website :) Best regards, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.2 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFV0/1JQIKXnJyDxURAuGZAKC3Pl+FNomog0jxu8taqYckmpLmYwCfXOFC TRHAS+XquHo2+qthph454X0= =e8xF -END PGP SIGNATURE-
Re: FuzzyOcr problem (Re: Relay Checker plugin v0.2)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 John Rudd wrote: decoder wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 John Rudd wrote: D.J. wrote: On 11/10/06, Patrick Sneyers [EMAIL PROTECTED] wrote: I get this warning: plugin: failed to create instance of plugin Mail::SpamAssassin::Plugin::RelayChecker: Can't locate object method new via package Mail::SpamAssassin::Plugin::RelayChecker at (eval 26) line 1. (This is my own build of SA 3.1.7 on Max OS X Server 10.4 ppc) It seems to work OK though: * 3.0 RELAY_CHECKER RELAY: badrdns (I lowered the score) Patrick Sneyers Belgium I also received some weirdness. When linting in debug mode, I found the following lines that seem to indicate that RelayChecker isn't playing nicely with FuzzyOCR: [28058] dbg: plugin: fixed relative path: /etc/mail/spamassassin/FuzzyOcr.pm [28058] dbg: plugin: loading FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm [28058] dbg: plugin: registered FuzzyOcr=HASH(0x9d04570) [28058] dbg: plugin: FuzzyOcr=HASH(0x9d04570) implements 'parse_config' [28058] dbg: FuzzyOcr: Option logfile = /home/amavis/.spamassassin/FuzzyOcr.log [28058] dbg: FuzzyOcr: Found scan: $gocr -i $pfile [28058] dbg: FuzzyOcr: Found scan: $gocr -l 180 -d 2 -i $pfile [28058] dbg: FuzzyOcr: Found scan: $gocr -l 140 -d 2 -i $pfile [28058] dbg: FuzzyOcr: Option threshold = 0.25 [28058] dbg: FuzzyOcr: Score{autodisable} = 10.01 [28058] dbg: FuzzyOcr: Option counts_required = 3 [28058] dbg: plugin: fixed relative path: /etc/mail/spamassassin/RelayChecker.pm [28058] dbg: plugin: loading RelayChecker from /etc/mail/spamassassin/RelayChecker.pm [28058] dbg: plugin: registered RelayChecker=HASH(0x9d94a80) [28058] dbg: plugin: FuzzyOcr=HASH(0x9d04570) implements 'parse_config' [28058] dbg: plugin: RelayChecker=HASH(0x9d94a80) implements 'parse_config' [28058] dbg: FuzzyOcr: unknown Score: relaychecker_score [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_nordns [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_badrdns [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_baddns [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_ipinhostname [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_dynhostname [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_clienthostname [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_ip [28058] dbg: FuzzyOcr: unknown Option: relaychecker_pass_auth Ok that really doesn't look nice... is the fault on our (FuzzyOcr's) side? Yes. If so, then maybe someone can explain me what the correct way would be to fix this :) When you encounter an option you don't own (ie. it's not a FuzzyOcr option), then parse_config should return 0. If you could verify that this also applies to the latest development version (3.4.1), then that would be nice Yup, I found this in your 3.4.1 code (my comments indicate the issues): Thank you very much for the work, I will patch this into our SVN version and the 3.4.x devel branch right now. Best regards Chris sub parse_config { my ( $self, $opts ) = @_; # this is good: you're restricting yourself to ^focr_bin_ keys if ( $opts-{key} =~ /^focr_bin_/i ) { my $p = lc $opts-{key}; $p =~ s/focr_bin_//; if (grep {m/$p/} @bin_utils) { $App{$p} = $opts-{value}; debuglog(App{$p} = $App{$p}); } else { debuglog(unknown App: $opts-{key}); } # you should tell SA you processed this config option: #$self-inhibit_further_callbacks(); } # this is bad: you're processing _score configs that may not belong to # FuzzyOcr. A better statement might be: #elsif (($opts-{key} =~ /^focr_/i) ($opts-{key} =~ m/_score$/i)) { # that way you're only processing _score configs that belong to focr elsif ( $opts-{key} =~ m/_score$/i ) { my $o = lc $opts-{key}; $o =~ s/focr_//; $o =~ s/_score//; if (grep {m/$o/} @pgm_scores) { $Score{$o} = $opts-{value}; debuglog(Score{$o} = $Score{$o}); } else { debuglog(unknown Score: $opts-{key}); } # again, inhibit further callbacks here: #$self-inhibit_further_callbacks(); } # same as above: now you're taking ANY key, from ANY plugin, and handling # it. Bad bad bad. This should be changed to: #elsif ($opts-{key} =~ /^focr_/i) { else { my $o = lc $opts-{key}; $o =~ s/focr_//; if (grep {m/$o/} @pgm_opts) { if ($o eq 'scansets') { @scansets = (); # remove foreach my $s (split(',',$opts-{value})) { $s =~ s/^\s*//; $s =~ s/\s*$//; push @scansets,$s; debuglog(Found scan: $s); } } elsif ($o eq 'path_bin') { @paths = (); # remove foreach my $p (split
Re: FuzzyOCR
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 sokka wrote: Hi, Can anyone post me URL or PDF of clear documentation of the FuzzyOcr ? The current URL for FuzzyOcr is http://fuzzyocr.own-hero.net/ The page (wiki) is still quite under construction, but you'll find installation instructions inside the tarball (you can try version 3.4.1 if you want, it performs better than the stable version 2.3b, just isnt tested as long yet..). Installation itself is not hard if you have all the dependencies installed :) If you need further assistance, check out our list at http://lists.own-hero.net/mailman/listinfo/devel-spam Once I get more time, I will also be able to do more work on the wiki :) Best regards, Chris thanks in advance -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.2 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFVbEaJQIKXnJyDxURAkYrAJ4/ObuZsaThvCh13jBycDpMZrUpqQCgsdO6 UmIM0FUXykERwXZTIN7wLPo= =dtEH -END PGP SIGNATURE-
Re: Questions about FuzzyOCR
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Pascal Maes wrote: Version 2.3b 1) Here is the ouptut of the scanner (gocr -i) : _ date Informations 9- 11-lO061O_30 Le __ek-end du 3-4r'11, les adresses de cou r_er jlectron_que des jtud_ants non ri_nscmts j _UCL ont jtj ddsact_vjes. La ra_son est pÄrement adm_n_strat_ve et I_je j Ia caNe j puce. Pour permeNre j ces jtud_ants de rjcupdrer leurs messaqes, nous avons fa_t en soNe qu'_Is pu_ssent encore accjder j leur boîte aux leNres jusqu'au l4.r l 1 ,/lo 06 . ANent_on, la consuttat_on se fera av_ un cI_ent de messager_e !Thunderb_rd. Eudora, Outlook.. .7 ou v_a le _IebMa_I ma_s plus v_a le poNa_I . We get almost the same result with gocr -l 180 -d 2 -i And FuzzyOCr says : 13 FUZZY_OCR BODY: Mail contains an image with common spam text inside Words found: wexe in 3 lines alert in 2 lines alert in 2 lines investor in 1 lines trade in 3 lines (11 word occurrences found) But I don't find any of these words in th text above ! You can try lowering your fuzz from 0.3 to 0.2, I didn't make any experience so far how the plugin reacts to text in different languages, so this might produce false positives. 2) How remove an image which as been stored by mistake in the hash database ? In version 2.3b, this is not possible yet with a tool, unfortunately. But the database is only a textfile, so you can simply search the hash there and delete the line. Version 3.4.1 brings a tool that removes a given hash from the database, but I am still improving it a bit, so one can also pass it an image file to look for. Best regards, Chris Thanks -- Pascal -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.2 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFVbIjJQIKXnJyDxURAkYjAJ9iFDj2oFrY+mVMyEBvEusYxxBxFQCgjZoM SJny4nTsw1G3XgGqBOVl7S8= =5S1J -END PGP SIGNATURE-
Re: Questions about FuzzyOCR
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 decoder wrote: Pascal Maes wrote: Version 2.3b 1) Here is the ouptut of the scanner (gocr -i) : _ date Informations 9- 11-lO061O_30 Le __ek-end du 3-4r'11, les adresses de cou r_er jlectron_que des jtud_ants non ri_nscmts j _UCL ont jtj ddsact_vjes. La ra_son est pÄrement adm_n_strat_ve et I_je j Ia caNe j puce. Pour permeNre j ces jtud_ants de rjcupdrer leurs messaqes, nous avons fa_t en soNe qu'_Is pu_ssent encore accjder j leur boîte aux leNres jusqu'au l4.r l 1 ,/lo 06 . ANent_on, la consuttat_on se fera av_ un cI_ent de messager_e !Thunderb_rd. Eudora, Outlook.. .7 ou v_a le _IebMa_I ma_s plus v_a le poNa_I . We get almost the same result with gocr -l 180 -d 2 -i And FuzzyOCr says : 13 FUZZY_OCR BODY: Mail contains an image with common spam text inside Words found: wexe in 3 lines alert in 2 lines alert in 2 lines investor in 1 lines trade in 3 lines (11 word occurrences found) But I don't find any of these words in th text above ! You can try lowering your fuzz from 0.3 to 0.2, I didn't make any experience so far how the plugin reacts to text in different languages, so this might produce false positives. 2) How remove an image which as been stored by mistake in the hash database ? In version 2.3b, this is not possible yet with a tool, unfortunately. But the database is only a textfile, so you can simply search the hash there and delete the line. Version 3.4.1 brings a tool that removes a given hash from the database, but I am still improving it a bit, so one can also pass it an image file to look for. I must correct myself there, passing it an image is already supported :) Best regards, Chris Best regards, Chris Thanks -- Pascal -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.2 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFVeMqJQIKXnJyDxURAhIbAKCpiYddgBqEBZZt1WnM9e4qjkgFfgCePG/R mWU8mtJuXQlVIHdO90e6xR0= =hMuz -END PGP SIGNATURE-
Re: FuzzyOcr problem (Re: Relay Checker plugin v0.2)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 John Rudd wrote: D.J. wrote: On 11/10/06, Patrick Sneyers [EMAIL PROTECTED] wrote: I get this warning: plugin: failed to create instance of plugin Mail::SpamAssassin::Plugin::RelayChecker: Can't locate object method new via package Mail::SpamAssassin::Plugin::RelayChecker at (eval 26) line 1. (This is my own build of SA 3.1.7 on Max OS X Server 10.4 ppc) It seems to work OK though: * 3.0 RELAY_CHECKER RELAY: badrdns (I lowered the score) Patrick Sneyers Belgium I also received some weirdness. When linting in debug mode, I found the following lines that seem to indicate that RelayChecker isn't playing nicely with FuzzyOCR: [28058] dbg: plugin: fixed relative path: /etc/mail/spamassassin/FuzzyOcr.pm [28058] dbg: plugin: loading FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm [28058] dbg: plugin: registered FuzzyOcr=HASH(0x9d04570) [28058] dbg: plugin: FuzzyOcr=HASH(0x9d04570) implements 'parse_config' [28058] dbg: FuzzyOcr: Option logfile = /home/amavis/.spamassassin/FuzzyOcr.log [28058] dbg: FuzzyOcr: Found scan: $gocr -i $pfile [28058] dbg: FuzzyOcr: Found scan: $gocr -l 180 -d 2 -i $pfile [28058] dbg: FuzzyOcr: Found scan: $gocr -l 140 -d 2 -i $pfile [28058] dbg: FuzzyOcr: Option threshold = 0.25 [28058] dbg: FuzzyOcr: Score{autodisable} = 10.01 [28058] dbg: FuzzyOcr: Option counts_required = 3 [28058] dbg: plugin: fixed relative path: /etc/mail/spamassassin/RelayChecker.pm [28058] dbg: plugin: loading RelayChecker from /etc/mail/spamassassin/RelayChecker.pm [28058] dbg: plugin: registered RelayChecker=HASH(0x9d94a80) [28058] dbg: plugin: FuzzyOcr=HASH(0x9d04570) implements 'parse_config' [28058] dbg: plugin: RelayChecker=HASH(0x9d94a80) implements 'parse_config' [28058] dbg: FuzzyOcr: unknown Score: relaychecker_score [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_nordns [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_badrdns [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_baddns [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_ipinhostname [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_dynhostname [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_clienthostname [28058] dbg: FuzzyOcr: unknown Option: relaychecker_skip_ip [28058] dbg: FuzzyOcr: unknown Option: relaychecker_pass_auth Ok that really doesn't look nice... is the fault on our (FuzzyOcr's) side? If so, then maybe someone can explain me what the correct way would be to fix this :) If you could verify that this also applies to the latest development version (3.4.1), then that would be nice Best regards, Chris That would seem to me to indicate that FuzzyOcr isn't returning the proper code when it finds an option it doesn't own. It should be returning 0 if it's not a FuzzyOcr option. -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.2 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFVPhzJQIKXnJyDxURAqXUAKC0gAy2TH0JvheiRuAGdcEV/y+7sACgn4tL VhzEF71Q2wCP5gI87DiTYtg= =geVq -END PGP SIGNATURE-
Re: ocrtext vs FuzzyOCR?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 James Lay wrote: On Mon, 30 Oct 2006 07:19:44 -0800 Jeff Chan [EMAIL PROTECTED] wrote: Does anyone have any opinions on which of these is better: http://wiki.apache.org/spamassassin/CustomPlugins OCR scanner and image validator SA-plugin Checks for specific keywords in gif/jpg/png attachments, using gocr. This can be used to detect spam that puts all the real contect in an attached image, accompanied with random text and html (no URL's, etc). There are also various rules to validate attached images and detect forged content types or broken images. This plugin needs SpamAssassin 3.1.1 or later. The version 2.0 is able to defeat recent gif animations which use gif tricks to avoid OCR. Created by: Martin Blapp Contact: mb -at- imp -dot- ch License Type: BSD Status: active Available at: [WWW] http://antispam.imp.ch/patches/patch-ocrtext Note: Feedback and new sample images are welcome. Please test and send reports. Fuzzy OCR Plugin Derived from OcrPlugin (see above), but has many feature enhancements, including an approximate matching algorithm to compensate recognition errors and obfuscation, support for broken gifs, jpeg and png, dynamic scoring, automatic content-type independant format detection and many more. Created by: Christian Holler Contact: decoder_at_own-hero_dot_net License Type: Same as SpamAssassin Status: active Available at: FuzzyOcrPlugin Note: Feedback and new sample images are welcome. Please test and send reports. Jeff C. -- I'd like to see something on this myself. The segfault patch for Fuzzy OCR failed, so I stopped right there as I wasn't sure what to do next. This is no patch for FuzzyOcr but for gocr. You will need it with every OCR plugin that uses gocr... It should work with version 0.40 Best regards, Chris James -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFRiU2JQIKXnJyDxURAhB4AJ4vDRdlck+1I0D0HSNu0AFikgn13QCffOyi 0Tq0HJJvW7lrUGUKEKwX/EE= =xWpz -END PGP SIGNATURE-
Re: This image is turning frequent..
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Anders Norrbring wrote: This type of image spam is getting more common, and is not detected.. At least not here.. Yes, this picture is indeed hard to detect... I'd need a blackbox like Input: Animated gif of any kind Output: NonAnimated gif which shows what the user will see But that is a difficult task considering how many things are possible with the GIF standard. This picture uses offsets and slow frame rates, others use transparency etc. A simple way to block these images would be to scan the GIF for offset frames. I don't think there is any valid GIF which makes use of these techniques... Best regards, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFNOuPJQIKXnJyDxURAsLVAKDIdS8QJ38I6snB/lq4mejK8y9r6gCfSoSg PGMfmUQ35Aez6I7kfJB91h8= =nHuo -END PGP SIGNATURE-
Re: FuzzyOCR/SpamAssassin questions
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Bill wrote: I just installed FuzzyOCR and have questions about 2 things: 1) I am getting the following errors in the fuzzy.log file. Are these something I should be concerned about? I have verbose enabled. FuzzyOcr received timeout after running 10 seconds. Unexpected error in pipe to external programs. Please check that all helper programs are installed and in the correct path. (Pipe Command /usr/bin/giftopnm -, Pipe exit code 1 (), Temporary file: /tmp/.spamassassin4571WcpHuTtmp) Does that happen with all pictures you scan or only with particular ones? 2) The server I installed this on handles a fairly small volume of email each day. I would like to install it on higher volume mail server but am afraid FuzzyOCR would overload it. Both servers are Xeon 2.6 with about 2 gig of RAM. I ue MimeDefang and the latest SpamAssassin. Is there something I can adjust in FuzzyOCR to make it more efficient? Yes there are several options. The first one is the image hashing db... it saves image features as a kind of hash to avoid a gocr run twice on the same type of image. Also, if that is not enough, you can change the default scansets to contain only 2 or 1 gocr scan. I have the Priority setting set to 900I have MimeDefang set to reject emails that score over 14. Can I set FuzzyOCR/SpamAssassin to NOT scan the graphics if the email already scores over 14 and will be rejected anyway? Yes, this is even the default behavior, but the default auto_disable is 10. You can set it to 14 if you want, by modifying the corresponding config entry :) Best regards, Chris Bill -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFNVVQJQIKXnJyDxURAoTEAJ9HeGSyV4s3zZBjAm9+jFr9jePLnwCgunq1 alABUBzZg19Y6P5drvQRrCw= =jlM5 -END PGP SIGNATURE-
Re: FuzzyOCR request
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Duncan Hill wrote: On Wednesday 04 October 2006 22:23, Alan Munday wrote: I've been following your developments and looking at how to integrate with my (few) systems. But as I don't have a test environment (until I have built a VMWare one) I was cautious at trying this with one of the live box's. Zero scoring seemed to be a good way round this. SA treats a 0-scored rule as a rule that must not be run. FuzzyOcr does not use a standard rule to score but does the scoring itself. But to the main subject: I haven't tried it out, but to archieve a zero score, you could as well try to set the scores that are configurable to 0, or to a very small amount... Also I recommend only using 2.3b in production environments :) Best regards, Chris Score at 0.01 and the rule will fire, but should have a non-significant impact. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFJO0cJQIKXnJyDxURAoIdAJ94WXbh/azaNswXjxRNT4R38yBFUACfeSdY 1axXA+NqRmcW2TTnOy2OV1o= =Awmk -END PGP SIGNATURE-
Re: FuzzyOCR seems to not like gif and png
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Loren Wilton wrote: @page Section1 {size: 8.5in 11.0in; margin: 1.0in 1.0in 1.0in 1.0in; } P.MsoNormal { FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: Times New Roman } LI.MsoNormal { FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: Times New Roman } DIV.MsoNormal { FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: Times New Roman } A:link { COLOR: blue; TEXT-DECORATION: underline } SPAN.MsoHyperlink { COLOR: blue; TEXT-DECORATION: underline } A:visited { COLOR: purple; TEXT-DECORATION: underline } SPAN.MsoHyperlinkFollowed { COLOR: purple; TEXT-DECORATION: underline } SPAN.EmailStyle17 { COLOR: windowtext; FONT-FAMILY: Arial; mso-style-type: personal-compose } DIV.Section1 { page: Section1 } There are newer versions of FuzzyOCR that probably fix or at least get around this. A lot of image spam mails have broken images in them, and this messes up a lot of stuff. The latest versions use ImageMagic. This is reputedly hard to install on many systems. But if you can get it installed it seems to work much better in terms of the images that it can handle. You might want to join the FuzzyOCR mailing list: List-Id: devel-spam.lists.own-hero.net List-Unsubscribe: http://lists.own-hero.net/mailman/listinfo/devel-spam, mailto:[EMAIL PROTECTED] List-Archive: http://lists.own-hero.net/mailman/private/devel-spam List-Post: mailto:[EMAIL PROTECTED] List-Help: mailto:[EMAIL PROTECTED] List-Subscribe: http://lists.own-hero.net/mailman/listinfo/devel-spam, mailto:[EMAIL PROTECTED] If you search the list archive you will see a number of posts on the current release and where to get it. I think the current version is something like J. The current version is b. J is a devel version as are all versions higher than b. Please note that when trying out these versions. A new stable version will follow soon, once I get the time again. Chris Loren -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFI61qJQIKXnJyDxURAkZTAJwN39dvgOtmYg4gp63OAivuBx8cYQCgjH7c f3p/ug6HPt+YEjoly1iETPA= =wgR7 -END PGP SIGNATURE-
Re: Stock spam in images
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Theo Van Dinter wrote: On Mon, Oct 02, 2006 at 03:18:58PM +0100, Randal, Phil wrote: undetected). Wouldn't it be better to inject the detected text back to SA? There should be enough variants of spam worlds to let SA fuzzily catch the ones from images. I think so. Some of the words would be perfectly legitimate in the text of emails but rarely found in attached legitimate images. Quite apart from the fact that Spamassassin isn't designed for reinjection. FWIW, 3.2 adds in support to have rendering of non-text parts. So a plugin could, for instance, OCR text from an image, and then the normal body rules and such would be able to use that information. This sounds great. Once I am back to continue the developing process of FuzzyOcr, I might add an option to pass the text back to SA. Combined with a new, more precise OCR engine like tesseract, this will probably work very well. Unfortunately, there is currently a lot of picture spam being sent around which won't be caught at all by FuzzyOcr because they use new obfuscation technics with animated gifs etc and I don't have the time atm to adjust the plugin to these... Best regards Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFIVIfJQIKXnJyDxURAlIlAKCCcaD5O43KmvAHUxcew85d7cE82wCgwbGG NAd6j8vgv1pvV9zVBN+5oqE= =LB3n -END PGP SIGNATURE-
Re: Stock spam in images
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Randal, Phil wrote: This has been covered so many times on this list. 1: if you're not on spamassassin 3.1.5 get it now, and run sa-update (via a cron job daily, but test first with a manual sa-update -D) 2: pop over to http://www.rulesemporium.com and get an appropriate selection of their rules, and configure Rules du Jour ( http://www.exit0.us/index.php?pagename=RulesDuJour ) to download them daily. 3: don't forget the additional rules here: http://www.rulesemporium.com/other-rules.htm I've found Fred's header rules helpful 4: add the ImageInfo plugin from http://www.rulesemporium.com/plugins.htm 5: if you want to be adventurous, make sure you have ImageMagick, ImageMagick-perl and other prerequisites installed and use the FuzzyOCR plugin ( latest version at http://www.joval.info/proj/FuzzyOcr.html , but see also http://wiki.apache.org/spamassassin/FuzzyOcrPlugin ). The FuzzyOCR mailing list is very helpful too. What do you mean with adventurous? Those versions published by joval are all devel. The stable version is available at http://users.own-hero.net/~decoder/fuzzyocr/ and works fine. There is nothing adventurous about them and the prerequisites are also lower than for the devel stuff. I am simply not able to continue development at the moment, but maybe in a few weeks, I'll start again. Best regards, Chris In my experience here a well-trained Bayes plus the various RulesEmporium rulesets gets most of them. Cheers, Phil -- Phil Randal Network Engineer Herefordshire Council Hereford, UK -Original Message- From: Dylan Bouterse [mailto:[EMAIL PROTECTED] Sent: 02 October 2006 14:38 To: users@spamassassin.apache.org Subject: Stock spam in images I'm a newbie to the list and have been scanning recent posts to see if what I'm about to ask about has been covered but I haven't seen anything yet. Lately I have been getting more and more of the stock alert spam but now all the good info is in an image and typically following the image is random text to fool the Bayesian filter. I think the random text thing has been covered here recently. It's frustrating when sa is giving a -1.6 (or so) score to these emails right off the bat. Quite a few of these aren't even getting spam headers because they aren't scoring high enough. Is there some magical trick to help score these messages higher? Maybe a future version of sa will incorporate an OCR module? :) Dylan -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFIVpDJQIKXnJyDxURAoTiAJ0SS12lfncMkv/vaLpPX2dscSMkWwCfftby uosbxGicE+jBtHgaYCd0Klc= =RRVE -END PGP SIGNATURE-
FuzzyOcr development/support stop for 7 weeks
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello all, since I will have a very tight time schedule in the next 7 weeks for a project at the university, I will not be able to release any new versions of FuzzyOcr, fix bugs, reply to questions or give support. Instead of writing me, you can write to either this mailing list or the devel-spam mailing list and other people will try to answer your questions. Moderator privileges for the devel-spam mailing list will be given to some people that helped with the development earlier. Best regards, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE9/pcJQIKXnJyDxURApkCAJ0eY0CdeN5ssYNTcMO0PSkU7V3hMgCfUxGF FcvWk8cr6/9VIEuKm+JRYjA= =ARBX -END PGP SIGNATURE-
Strange SPF problem/wrong result
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, today I saw a strange SPF bug occuring. The original mail header was: Return-Path: [EMAIL PROTECTED] Received: from mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) by wjpserver.cs.uni-sb.de (8.12.11.20060308/8.12.11) with ESMTP id k7T8rU6P012050; Tue, 29 Aug 2006 10:53:30 +0200 Received: from mail-eur1.microsoft.com (mail-eur1.microsoft.com [213.199.128.139]) by mail.cs.uni-sb.de (8.13.8/2006081400) with ESMTP id k7T8rT98004989; Tue, 29 Aug 2006 10:53:29 +0200 (CEST) Received: from x.europe.corp.microsoft.com ([65.53.193.xxx]) by mail-eur1.microsoft.com with Microsoft SMTPSVC(6.0.3790.1830); Tue, 29 Aug 2006 09:53:29 +0100 (Some unrelated privacy details replaced with xxx). Now what SPF should do is (as far as I understood): - - Get the mail server that sent this (mail-eur1.microsoft.com) - - Check that its IP is in the allowed SPF record of microsoft.com This check passes as you can see here: http://www.dnsstuff.com/tools/spf.ch?server=microsoft.comip=213.199.128.139 Now SpamAssassin did something else, it took mail.cs.uni-sb.de as the mailserver that sent, and tried to match it against microsoft.com's SPF records which produced a SOFTFAIL: 1.4 SPF_SOFTFAIL Sending host does not match SPF-record (softfail) [SPF failed: Please see http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available] 2.4 SPF_HELO_SOFTFAIL HELO-Name does not match SPF-record (softfail) [SPF failed: Please see http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available] Can someone explain me this failure? Thanks Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE+BcYJQIKXnJyDxURAl22AJ9D1gsr9/mjmevWVe63mRcdOkeWqACgxYs8 S2NysNSm5mdscg2H2OsSsiI= =ghdo -END PGP SIGNATURE-
Re: Strange SPF problem/wrong result
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 So adding the line trusted_networks 134.96.254.200 to local.cf will fix this problem and this mail would be recognized correctly (as in pass SPF) ? Thanks Chris Justin Mason wrote: it's trusted_networks -- SpamAssassin doesn't know that it can trust mail.cs.uni-sb.de. --j. decoder writes: today I saw a strange SPF bug occuring. The original mail header was: Return-Path: [EMAIL PROTECTED] Received: from mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) by wjpserver.cs.uni-sb.de (8.12.11.20060308/8.12.11) with ESMTP id k7T8rU6P012050; Tue, 29 Aug 2006 10:53:30 +0200 Received: from mail-eur1.microsoft.com (mail-eur1.microsoft.com [213.199.128.139]) by mail.cs.uni-sb.de (8.13.8/2006081400) with ESMTP id k7T8rT98004989; Tue, 29 Aug 2006 10:53:29 +0200 (CEST) Received: from x.europe.corp.microsoft.com ([65.53.193.xxx]) by mail-eur1.microsoft.com with Microsoft SMTPSVC(6.0.3790.1830); Tue, 29 Aug 2006 09:53:29 +0100 (Some unrelated privacy details replaced with xxx). Now what SPF should do is (as far as I understood): - - Get the mail server that sent this (mail-eur1.microsoft.com) - - Check that its IP is in the allowed SPF record of microsoft.com This check passes as you can see here: http://www.dnsstuff.com/tools/spf.ch?server=microsoft.comip=213.199.128.139 Now SpamAssassin did something else, it took mail.cs.uni-sb.de as the mailserver that sent, and tried to match it against microsoft.com's SPF records which produced a SOFTFAIL: 1.4 SPF_SOFTFAIL Sending host does not match SPF-record (softfail) [SPF failed: Please see http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available] 2.4 SPF_HELO_SOFTFAIL HELO-Name does not match SPF-record (softfail) [SPF failed: Please see http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available] Can someone explain me this failure? Thanks Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE+BcYJQIKXnJyDxURAl22AJ9D1gsr9/mjmevWVe63mRcdOkeWqACgxYs8 S2NysNSm5mdscg2H2OsSsiI= =ghdo -END PGP SIGNATURE- -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE+BxWJQIKXnJyDxURAhQ1AKCsicr906Fy7RkBZtU3TduR/cgFHgCfWJGe 2KZKNwn4ZfYBx4yh/xUwoHw= =AtZw -END PGP SIGNATURE-
Re: Strange SPF problem/wrong result
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Gino Cerullo wrote: On 1-Sep-06, at 7:18 AM, decoder wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, today I saw a strange SPF bug occuring. The original mail header was: Return-Path: [EMAIL PROTECTED] Received: from mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) by wjpserver.cs.uni-sb.de (8.12.11.20060308/8.12.11) with ESMTP id k7T8rU6P012050; Tue, 29 Aug 2006 10:53:30 +0200 Received: from mail-eur1.microsoft.com (mail-eur1.microsoft.com [213.199.128.139]) by mail.cs.uni-sb.de (8.13.8/2006081400) with ESMTP id k7T8rT98004989; Tue, 29 Aug 2006 10:53:29 +0200 (CEST) Received: from x.europe.corp.microsoft.com ([65.53.193.xxx]) by mail-eur1.microsoft.com with Microsoft SMTPSVC(6.0.3790.1830); Tue, 29 Aug 2006 09:53:29 +0100 (Some unrelated privacy details replaced with xxx). Now what SPF should do is (as far as I understood): - - Get the mail server that sent this (mail-eur1.microsoft.com) - - Check that its IP is in the allowed SPF record of microsoft.com This check passes as you can see here: http://www.dnsstuff.com/tools/spf.ch?server=microsoft.comip=213.199.128.139 Now SpamAssassin did something else, it took mail.cs.uni-sb.de as the mailserver that sent, and tried to match it against microsoft.com's SPF records which produced a SOFTFAIL: 1.4 SPF_SOFTFAIL Sending host does not match SPF-record (softfail) [SPF failed: Please see http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available] 2.4 SPF_HELO_SOFTFAIL HELO-Name does not match SPF-record (softfail) [SPF failed: Please see http://www.openspf.org/why.html?sender=xxx%40microsoft.comip=134.96.254.200receiver=This%20account%20is%20currently%20not%20available] Can someone explain me this failure? Spamassassin gave the correct result. It compared the IP address of the last received server mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) against the SPF record for Microsoft and did not see a match. Result SOFTFAIL Why do you think it should compare to mail-eur1.microsoft.com (mail-eur1.microsoft.com [213.199.128.139]). SPF compares the IP address of the last server to handle the message before it was handed off to a server on your receiving end. If the message was sent to someone who is using forwarding and forwarded through mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) then this would explain the SOFTFAIL. Forwarding breaks SPF. This is no real forwarding, but all mail for us gets received by that server first, and this server passes it to us. This is a common structure for a bigger mail setup. The trusted_networks option solved my problems, but it should definetly be included in the wiki somewhere. Maybe we should add a note about trusted_networks being important for SPF in the install manual where SPF installation is explained Chris -- Gino Cerullo Pixel Point Studios 21 Chesham Drive Toronto, ON M3M 1W6 416-247-7740 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE+C2ZJQIKXnJyDxURAp3eAJ9qvVbNz2OaPygoLghms+3KiPc1SQCgpCpD splrSRz31hg6UjCgJPWVKhY= =Sb9E -END PGP SIGNATURE-
Re: Strange SPF problem/wrong result
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ramprasad wrote: Return-Path: [EMAIL PROTECTED] Received: from mail.cs.uni-sb.de (mail.cs.uni-sb.de [134.96.254.200]) by wjpserver.cs.uni-sb.de (8.12.11.20060308/8.12.11) with ESMTP id k7T8rU6P012050; Tue, 29 Aug 2006 10:53:30 +0200 Received: from mail-eur1.microsoft.com (mail-eur1.microsoft.com [213.199.128.139]) by mail.cs.uni-sb.de (8.13.8/2006081400) with ESMTP id k7T8rT98004989; Tue, 29 Aug 2006 10:53:29 +0200 (CEST) snip This is no real forwarding, but all mail for us gets received by that server first, and this server passes it to us. This is a common structure for a bigger mail setup. The trusted_networks option solved my problems, but it should definetly be included in the wiki somewhere. Maybe we should add a note about trusted_networks being important for SPF in the install manual where SPF installation is explained snip If 134.96.254.200 is accepting mails for you then you must do all SPF checks on that host. SPF checks dont work unless you do the checks on the receiving host. In a big infrastructure, this is hardly possible. This mailserver is not under our control but belongs to the University directly, not to our chair. Chris Thanks Ram -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE+ELcJQIKXnJyDxURAn12AJ9OSP19czmLi1KNEmunB37WkWC75wCffMa4 15iEKJqbZOzSycS3nwn4RKU= =4Exp -END PGP SIGNATURE-
Re: [Devel-spam] Hash Stats
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --[ UxBoD ]-- wrote: How many hits are you getting ? Database changed mysql select count(*) from maillog where spamreport like '%FUZZY_OCR%' and date = '2006-08-29'; +--+ | count(*) | +--+ | 385 | +--+ 1 row in set (0.10 sec) mysql select count(*) from maillog where spamreport like '%FUZZY_OCR_KNOWN_HASH%' and date = '2006-08-29'; +--+ | count(*) | +--+ |1 | +--+ 1 row in set (0.05 sec) mysql select count(*) from maillog where spamreport like '%FUZZY_OCR_CORRUPT%' and date = '2006-08-29'; +--+ | count(*) | +--+ | 298 | +--+ 1 row in set (0.05 sec) --[ UxBoD ]-- // PGP Key: curl -s http://www.splatnix.net/uxbod.asc | gpg --import // Fingerprint: 543A E778 7F2D 98F1 3E50 9C1F F190 93E0 E8E8 0CF8 // Keyserver: www.keyserver.net Key-ID: 0xE8E80CF8 Did you apply the patch I sent to the SA mailing list? There is a bug in 2.3b which breaks the database completely. Please fix the corresponding line: line 492: It says: print DB $score::$digest\n; Should be: print DB ${score}::${digest}\n; As a result, the produced hashdb is corrupted, delete it and start with a new one... Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE9XUrJQIKXnJyDxURAoWOAJ9ej8U66qKCGiJSrPYM51ZP0WHGnQCfZWqa 8BxDIenQxw0JrGD/31hQshI= =lDtr -END PGP SIGNATURE-
Re: FuzzyOCR Install - Issues processing ONLY Gif images.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Michael Grey wrote: !-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:Times New Roman;} a:link, span.MsoHyperlink {color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-compose; font-family:Arial; color:windowtext;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in;} div.Section1 {page:Section1;} -- Installed FuzzyOCR and believe all the dependencies. Using the sample images I get a Pipe Error ONLY on gif images; resulting in no hits on FUZZY_OCR. Pipe Command /usr/bin/giftopnm - Giftopnm exists in that path. Running giftopnm on the command line seems to work with no errors, spitting out a binary file to stdout as expected. Any ideas of what might be missing ? ( Fedora Core 4 ). You can try step by step debugging, first of all, what sample is producing the error? (there are two gif samples). If it doesn't work with the corrupted sample, try extracting the image from that eml file (ripmime), then run the pipe manually: cat filename.gif | giffix | giftopnm - blah.gif If that fails, try splitting the commands up and trace down which binary causes the problem. Chris Thanks? Michael Grey - log / reports - Corrupted-gif.eml pts rule name description -- -- 0.1 HTML_MESSAGE BODY: HTML included in message 3.0 BAYES_95 BODY: Bayesian spam probability is 95 to 99% [score: 0.9694] 1.5 FUZZY_OCR_WRONG_CTYPE BODY: Mail contains an image with wrong content-type set Image has format GIF but content-type is image/jpeg [2006-08-29 19:20:00] Debug mode: Image has format GIF but content-type is image/jpeg [2006-08-29 19:20:01] Debug mode: Image is single non-interlaced... [2006-08-29 19:20:01] Unexpected error in pipe to external programs. Please check that all helper programs are installed and in the correct path. (Pipe Command /usr/bin/giftopnm -, Pipe exit code 1 (), Temporary file: /tmp/.spamassassin23614sXR9Dltmp) [2006-08-29 19:20:01] Debug mode: FuzzyOcr ending successfully... bash-3.00$ animated-gif.eml pts rule name description -- -- 0.7 DATE_IN_PAST_06_12 Date: is 6 to 12 hours before Received: date 0.1 HTML_MESSAGE BODY: HTML included in message 0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score: 0.5000] [2006-08-29 19:22:12] Debug mode: Analyzing file with content-type image/gif [2006-08-29 19:22:12] Debug mode: Image is single non-interlaced... [2006-08-29 19:22:12] Unexpected error in pipe to external programs. Please check that all helper programs are installed and in the correct path. (Pipe Command /usr/bin/giftopnm -, Pipe exit code 1 (), Temporary file: /tmp/.spamassassin23644bPPq3jtmp) [2006-08-29 19:22:12] Debug mode: FuzzyOcr ending successfully... -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE9XXqJQIKXnJyDxURAuv0AKCNGLWfDNggpjyOLGLQiXQZHh4ukgCgtlBi ptzwNcXJ4pIaQJGVhZ7yiKE= =IH6h -END PGP SIGNATURE-
Re: wrong ml, ignore ;)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 decoder wrote: --[ UxBoD ]-- wrote: How many hits are you getting ? Database changed mysql select count(*) from maillog where spamreport like '%FUZZY_OCR%' and date = '2006-08-29'; +--+ | count(*) | +--+ | 385 | +--+ 1 row in set (0.10 sec) mysql select count(*) from maillog where spamreport like '%FUZZY_OCR_KNOWN_HASH%' and date = '2006-08-29'; +--+ | count(*) | +--+ |1 | +--+ 1 row in set (0.05 sec) mysql select count(*) from maillog where spamreport like '%FUZZY_OCR_CORRUPT%' and date = '2006-08-29'; +--+ | count(*) | +--+ | 298 | +--+ 1 row in set (0.05 sec) --[ UxBoD ]-- // PGP Key: curl -s http://www.splatnix.net/uxbod.asc | gpg --import // Fingerprint: 543A E778 7F2D 98F1 3E50 9C1F F190 93E0 E8E8 0CF8 // Keyserver: www.keyserver.net Key-ID: 0xE8E80CF8 Did you apply the patch I sent to the SA mailing list? There is a bug in 2.3b which breaks the database completely. Please fix the corresponding line: line 492: It says: print DB $score::$digest\n; Should be: print DB ${score}::${digest}\n; As a result, the produced hashdb is corrupted, delete it and start with a new one... Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE9XYVJQIKXnJyDxURAsR4AJ472simn6QDxPJOJiFMhgrWgJVNmgCgypsb 43SCSvXwBGAHNlTbJzrPKdE= =Ez80 -END PGP SIGNATURE-
Silent bug in FuzzyOcr 2.3b, database feature - hotfix
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, someone discovered that the DB was not working properly in most cases, please fix line 492: It says: print DB $score::$digest\n; Should be: print DB ${score}::${digest}\n; As a result, the produced hashdb is unusable, please delete it, or convert it by adding a score + :: before each entry, like: 190959:221:288:64::255:255:255:255:45599::18:26:58:27:3991::11:67:247:71:1875::254:254:253:254:1417::236:14:8:80:1180 gets to 10::190959:221:288:64::255:255:255:255:45599::18:26:58:27:3991::11:67:247:71:1875::254:254:253:254:1417::236:14:8:80:1180 The bug caused the score not to get written in front of the file size... Thanks to all the people on the devel-spam list that helped finding this bug... Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE9HkvJQIKXnJyDxURAk6jAJ0eP/tTzBCiqwxHHSf/cJ0UXZmbUQCdHq1H dLJKZ0yRDo968QVY6TN0Ek8= =QDuC -END PGP SIGNATURE-
Re: Hashcash
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Arik Raffael Funke wrote: Hello, how does spamassassin handle hashcash? It is turned on by default, right? Yes but you still need to define your accept range as you tried to do above:) I am using v3.1.2 and have in init.pre loadplugin Mail::SpamAssassin::Plugin::Hashcash. However, the hashcash contained in incoming mails seems to have been ignored. I added following to local.cf, but I am still out of luck: use_hashcash 1 hashcash_accept [EMAIL PROTECTED] try [EMAIL PROTECTED] How do I get this to work? (Can it be a problem that I installed hashcash after spamassassin?) That shouldn't matter. Best regards, Chris Cheers, Arik -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE9LfyJQIKXnJyDxURAo5YAJwJ9RYdpq8khY7lHOMXuMpU1gvNAQCfeMug ZHh0X6YHdAqr/uLO8yJtp5A= =v81o -END PGP SIGNATURE-
Re: Hashcash
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Arik Raffael Funke wrote: decoder wrote: Arik Raffael Funke wrote: Hello, how does spamassassin handle hashcash? It is turned on by default, right? Yes but you still need to define your accept range as you tried to do above:) I am using v3.1.2 and have in init.pre loadplugin Mail::SpamAssassin::Plugin::Hashcash. However, the hashcash contained in incoming mails seems to have been ignored. I added following to local.cf, but I am still out of luck: use_hashcash 1 hashcash_accept [EMAIL PROTECTED] try [EMAIL PROTECTED] That doesn't seem to help either. Any other ideas? Run with -D on a hashcash stamped message and check the output for relevant data.. Chris Regards, Arik BTW: I am not using v.3.1.2 as said above but v.3.1.4... -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE9MEHJQIKXnJyDxURAsccAKCrJSIsOHABvlJiEE2xi2Kqbj/2AgCfcHX9 rtW8EHKC+x9gPaDA+AFsBZQ= =MQwm -END PGP SIGNATURE-
Re: Animated images in mails
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Plenz wrote: decoder wrote: gifasm can split them into multiple files, etc. Thanks, gifasm works very well. Seems that I only have to choose the biggest one of the output files, it contains the text. That is what FuzzyOcr does automatically for you :) (If you set the gif frame option in the cf file to a low value... with 1 it will always be used..) Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8rskJQIKXnJyDxURAiDsAJ0SuPpt+3SU+CZP6zx2BTrN0CsTawCfWEVf sEyehX84ZiLrpvV/kTZwGMk= =Ak2M -END PGP SIGNATURE-
Now ascii spam instead of real pictures
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello there, A friend of mine recently received a mail containing an ASCII image advertising meds. The mail is attached. Anyone seen this before? Do rules exist already against this kind of spam? Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8vb3JQIKXnJyDxURAlpDAKCcMJNgdznSZnga1uZst+Lhc2iCIwCgvJP3 41fvggKf/jHtrac0n+sAdQA= =4Ehg -END PGP SIGNATURE- ---BeginMessage--- Monty, Y,our sourse for poopular M,EDS __ __ (_) __ ___ _ _ __ __ _ /$\$ $ ___ ___ __ __ \ \ / / | | / _` | / _` | | '__| / _` | $ $ $$$$ / __| / _ \ | \/ | \ V / | | | (_| | | (_| | | | | (_| | \$$ $ $ $ $| (__ | (_) || |\/| | \_/ |_| \__,_| \__, | |_|\__,_| $ $ $ \___| \___/ |_| |_| |___/ $$ $ 0 inculcate MqY ---End Message---
Re: Now ascii spam instead of real pictures
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Loren Wilton wrote: Ah. Sig-file format. That is I guess a slight new twist. This sort of thing was popular for a month or two a couple of years ago. I suspect they gave up on it then because it was probably done by hand and not worth the effort. Probably not too hard to catch this sort of thing. Loren Yea this could probably be catched by looking for huge amounts of typical ascii art characters like \ | / ( ) etc... Seems like a case for SARE ;D Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8xrvJQIKXnJyDxURAmzGAJ4idww42k0/f/P+7ah7Jg3+skFW/QCgitFW rAp+8JQeCHf9m1/QUyaSw4s= =gg81 -END PGP SIGNATURE-
Re: [Devel-spam] FuzzyOcr 2.3b released,fixes bugs and improves stability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 jdow wrote: From: decoder [EMAIL PROTECTED] -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Expertsites, Inc. wrote: From: decoder [EMAIL PROTECTED] Hello, I just uploaded FuzzyOcr 2.3b to the download site. If you find bugs or run into problems, please mail back :) This release failed to recognize the sample png.eml file with logfile error message: Debug mode: Image type not recognized, unknown format. Skipping this image... I resolved this problem by changing one line in FuzzyOcr.pm Changed: elsif ( substr($picture_data,0,5) eq \x89\x50\x4e\x47 ) { To read: elsif ( substr($picture_data,0,4) eq \x89\x50\x4e\x47 ) { ^ Tom Green -- Expertsites, Inc. Thank you for reporting this... seems I cant count bytes anymore ;) For anyone who is downloading this past this message, the tarball has been updated... As someone else pointed out - it has not been updated. I just checked, Chris. {^_^} Hrm what the hell... I am 1000% sure I uploaded it -_- ok NOW it is fixed... if not, then there is some kind of gremlin in our server... Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8YOyJQIKXnJyDxURApY6AJsGyauiMoSbKvgAGQVUxr1iUqXASgCfd09k bE/7zCyzwI8wGCFw9TZSwIw= =OfOj -END PGP SIGNATURE-
Re: Fuzzy 2.3b and PNG
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Gary V wrote: Rose, Bobby wrote: What am I missing? I updated but not png isn't working. If I switch to debug logging 2 I see in the log when I run the sample thru. [2006-08-26 18:16:40] Debug mode: Analyzing file with content-type image/png [2006-08-26 18:16:40] Debug mode: Image type not recognized, unknown format. Skipping this image... Thanks Bobby Yes, I already posted this in this thread, there is a bug in this line: elsif ( substr($picture_data,0,5) eq \x89\x50\x4e\x47 ) correct is: elsif ( substr($picture_data,0,4) eq \x89\x50\x4e\x47 ) The tarball which is available for download has been fixed already... Chris I just downloaded it from http://users.own-hero.net/~decoder/fuzzyocr/ and line 733 says: elsif ( substr($picture_data,0,5) eq \x89\x50\x4e\x47 ) { Gary V Yea my problem it seems like the tarball was not uploaded... now it should be... ;) Chris _ Get real-time traffic reports with Windows Live Local Search http://local.live.com/default.aspx?v=2cp=42.336065~-109.392273style=rlvl=4scene=3712634trfc=1 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8YPfJQIKXnJyDxURApZaAJ9c3DmDnJyBWM/7kUCGf0s2pCBlMQCfbBj8 C0yO4KQrMU3UIPrfNeyowtE= =unf7 -END PGP SIGNATURE-
Re: Animated images in mails
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Loren Wilton wrote: Sure. giftopnm will do it. The FuzzyOCR plugin is using some other tool that will also do it, I don't recall what just at the moment. Loren giftopnm wont do it as far as I tested it... it only extracts the first frame... FuzzyOcr is using two different tests... for few frames, it simples glues them to one frame using imagemagick, for many frames, it picks the best and tests that.. Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8YQ9JQIKXnJyDxURAo+eAJ9Wk+gzU2jssvSYK+a8MfFtbiJJbgCgmrpi 4zx5qlGfVPqRqVxO/7HMFIY= =Xu9s -END PGP SIGNATURE-
Re: [Devel-spam] FuzzyOcr 2.3b released,fixes bugs and improves stability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Gary V wrote: Hello, I just uploaded FuzzyOcr 2.3b to the download site. If you find bugs or run into problems, please mail back :) The jpeg.eml and png.eml samples failed to provide FuzzyOcr hits on my system because the messages scored higher than the default focr_autodisable_score. You should mention in the README file in the samples directory that you may need to temporarily raise the focr_autodisable_score while testing. Ah thanks... I didn't think about that... earlier, the score was 50 by default and I lowered it to 10 without redoing the tests :) Chris Gary V _ Check the weather nationwide with MSN Search: Try it now! http://search.msn.com/results.aspx?q=weatherFORM=WLMTAG -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8B+CJQIKXnJyDxURAvgMAJ9+zygJtk0qHNWjOoNwkKxfQMOanACeImox I2+dh0H9UAtHxmkyHurPtfo= =0TIT -END PGP SIGNATURE-
Re: Animated images in mails
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Plenz wrote: Today I got animated spam. The first frame only with dots an lines, the second frame with spam text, the third frame again with dots and lines. The duration of the text frame is very long, the others are very short. Is there a command line utility which can extract animated GIFs? Various... imagemagick can either extract them or put them into one image, gifasm can split them into multiple files, etc. FuzzyOcr utilizies both as needed to scan animated gifs. Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8CCgJQIKXnJyDxURAtr1AJ4/6ONiWg3t5mQJVt9MUcNpYfY3YACfcXW/ xQ4dD6PpT9CW79pekPvfQQw= =PU49 -END PGP SIGNATURE-
Re: FuzzyOcr 2.3b release, broken with SA 3.1.0
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, I was just informed that the latest FuzzyOcr version, 3.2b, includes a function (module from SA) which is only available in 3.1.4, not in 3.1.0. The missing module is Mail::SpamAssassin::Timeout. Currently, the only way to fix this is to upgrade to 3.1.4. I am still unsure wether I should add my own timeout stuff with alert() only to support 3.1.0. Maybe someone else here has a better idea :) Chris decoder wrote: Hello, I just uploaded FuzzyOcr 2.3b to the download site. If you find bugs or run into problems, please mail back :) The major changes are: - Added a configurable timeout (maximum runtime) for the plugin, to avoid any lockups/unwanted delays - The default matching threshold (set in the config file) can now be overridden on a per-word basis in the wordlist An example, wordlist contains: word1 word2::0 word3::0.2 Then word1 is matched with the default threshold set in the config file, word2 must be an exact match (threshold 0), and word 3 is matched with a threshold of 0.2. This is especially useful for words which trigger false positives very often like: penis, money or news. Note that the tendency to produce a FP is not directly connected to the word length. The word buy produces very few FP compared to penis, when both are being matched with the same threshold. The FuzzyOcr.words.sample contains some suggestions for word specific thresholds which I recommend. - The experimental MD5 database has been replaced by a custom hash database which is able to match very similar images. Often, you get the same image twice, or all your customers get the same spam mail. But even though the pictures look the same, they are not identical. That is why MD5 was useless. The newly introduced hash (self invented) is able to recognize almost identical images based on features that I won't explain here as it would make it easier for spammers :) If a message contains a picture previously registered in the database, the original score is reread from the database and the message is immediatly tagged with this score and the plugin ends. - Some non-alpha-alpha translations are now used on the gocr output, that fix common mistakes, like i being misread as ; or a as 8. - There are now 2 scores for broken images, one is used when the picture is recognized as broken, but giffix was able to correct the errors and it gave some output that can be scanned, the other one is used if the image is unfixable (that means either too broken, or interlaced/animated and broken). The first one is set lower than the second one (2.5 vs. 5). -Various bugfixes TODO: -Write an external program to manage the database (add, remove and verify given pictures). -Rewrite the temp file system to do all external program operations on files (saves memory). Another wish: I'd like to create a database to ship with the plugin so it can be used out of the box but I do not have much samples here, so it would be nice if you sent me picture samples of common picture spam you get with [picture sample] in the subject to my mail address. I will post here again if I got enough :). Thanks to Jorge Valdes, Michael Alan Dorman and UxBoD for finding bugs and sending improvement suggestions for this version Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8JRFJQIKXnJyDxURAgY1AJ97hGp6zw94H+eUCeH2lay9T2mVDgCdFWEE 4VOwP8X4yVlPguHD6S1m9tI= =ufN9 -END PGP SIGNATURE-
Re: Fuzzy 2.3b and PNG
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Rose, Bobby wrote: What am I missing? I updated but not png isn't working. If I switch to debug logging 2 I see in the log when I run the sample thru. [2006-08-26 18:16:40] Debug mode: Analyzing file with content-type image/png [2006-08-26 18:16:40] Debug mode: Image type not recognized, unknown format. Skipping this image... Thanks Bobby Yes, I already posted this in this thread, there is a bug in this line: elsif ( substr($picture_data,0,5) eq \x89\x50\x4e\x47 ) correct is: elsif ( substr($picture_data,0,4) eq \x89\x50\x4e\x47 ) The tarball which is available for download has been fixed already... Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE8MoeJQIKXnJyDxURAiVFAKCleKLAkgiklWw1yZdsWPmmXvibOgCfQa5K eIWLLQcS1Lch1Rcd41tjB38= =jYbC -END PGP SIGNATURE-
Re: Broken images in mails
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Plenz wrote: Adding a point for corrupted images is sounding better and better. I disagree. To check out what happens I converted a JPG picture into a GIF file and sent it to myself. One time I converted it with IrfanView and the second time with PaintShop Pro. Both GIF files had the result giftopnm: EOF or error reading data portion... So I produced a corrupt (?) image, but it was not spam. I have no idea what is wrong and how it could be fixed. Only this: a GIF file seems to be divided into several blocks. Perhaps one block (perhaps the last block) is too short and does not match to its block header (if any exists?). Perhaps it is possible to read out the correct block length from a header and fill the block with 00h to get a valid GIF file. Ah... I just found that there is a program named GIFFIX. I should try it out. FuzzyOcr will try to invoke Giffix if an image is broken. If giffix does not completely fail, then it will only give a low score for the picture being corrupted. If it isn't able to fix the image at all, then it will give a higher score. Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE7sVkJQIKXnJyDxURAv29AJ9i/LjlLx1me4TZiwRrSuD0KasBYQCfagl2 95Nt5kXjo3v+WO7i2jngnCk= =XN3X -END PGP SIGNATURE-
Re: Discourage broken content
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Kenneth Porter wrote: --On Friday, August 25, 2006 12:05 AM -0700 Plenz [EMAIL PROTECTED] wrote: I disagree. To check out what happens I converted a JPG picture into a GIF file and sent it to myself. One time I converted it with IrfanView and the second time with PaintShop Pro. Both GIF files had the result giftopnm: EOF or error reading data portion... So I produced a corrupt (?) image, but it was not spam. I think we should discourage all broken content in email and on the web. At one time we could assume that broken content was an honest mistake and make an attempt at fixing it. But with the rise of malicious content attempting to exploit bugs in content handlers (like overruns in image libraries), we should simply reject anything that fails to pass validation, on the assumption that's it out to get us. This includes not just broken images but also broken HTML, which is so commonly used to conceal spam. We need to stop giving a free pass to broken content creation software just because it's popular. When someone sends you broken content, you should react the same way you would if they sent you documents on dirt-smeared paper. Stop letting your emperor walk around naked. I completely agree, the problem is, some implementations makes this impossible. For example MailScanner. I've heard that it truncates the mail at 30kb, no matter if that is within a MIME block or not... So my plugin gets a broken image.. though it was not broken originally... Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE705eJQIKXnJyDxURAiGZAJ4q2f5KIxWjrYN3U6vB4kFhLbZ2igCfVM1l n13w21PXoSH7IethDVc3uio= =IWPe -END PGP SIGNATURE-
Re: Discourage broken content
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Logan Shaw wrote: On Fri, 25 Aug 2006, enediel gonzalez wrote: From: decoder [EMAIL PROTECTED] Kenneth Porter wrote: I completely agree, the problem is, some implementations makes this impossible. For example MailScanner. I've heard that it truncates the mail at 30kb, no matter if that is within a MIME block or not... So my plugin gets a broken image.. though it was not broken originally... Yes, if you leave the default Max SpamAssassin Size = 3 setting in place, it will do this. Could somebody explain to me the reason why MailScanner acts this way? Performance. The theory, I think, is that if a message is spam, there should be some evidence of that in the first 3 bytes, so there is no need to pass the whole message to SpamAssassin. I think this was a good assumption and a good plan when SpamAssassin didn't check a lot of attachments. Now that there are plugins which do check attachments, leaving the MIME structure of the message intact is more important, but MailScanner hasn't caught up with this reality. I heard that a proposal on letting the MIME structure intact has been made... so at least if the message was truncated, it wouldn't be truncated in the middle of an attachment (which would make absolutely no sense, either you truncate before or after the attachment, a broken attachment doesnt help anyone and will only cause unnecessary errors) Chris Of course, you can always just remove the limitation by changing the MailScanner configuration file. A good question could be decide if you adapt this plugin to be compatible with MailScanner or tha last one should change this practice. MailScanner calls SpamAssassin, so no adaptation needed in most cases. Unless you are talking about workarounds for issues like the above. - Logan -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE71X+JQIKXnJyDxURAnGdAKC2aHFPzyX8lFhhsoSsrIgl+ci6QgCeJO4q 58fKQR01gJE0I/0P2Zpdprw= =MU3c -END PGP SIGNATURE-
Re: Discourage broken content
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Rick Cooper wrote: -Original Message- From: decoder [mailto:[EMAIL PROTECTED] Sent: Friday, August 25, 2006 2:24 PM To: users@spamassassin.apache.org Subject: Re: Discourage broken content -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Kenneth Porter wrote: --On Friday, August 25, 2006 12:05 AM -0700 Plenz [EMAIL PROTECTED] wrote: I disagree. To check out what happens I converted a JPG picture into a GIF file and sent it to myself. One time I converted it with IrfanView and the second time with PaintShop Pro. Both GIF files had the result giftopnm: EOF or error reading data portion... So I produced a corrupt (?) image, but it was not spam. I think we should discourage all broken content in email and on the web. At one time we could assume that broken content was an honest mistake and make an attempt at fixing it. But with the rise of malicious content attempting to exploit bugs in content handlers (like overruns in image libraries), we should simply reject anything that fails to pass validation, on the assumption that's it out to get us. This includes not just broken images but also broken HTML, which is so commonly used to conceal spam. We need to stop giving a free pass to broken content creation software just because it's popular. When someone sends you broken content, you should react the same way you would if they sent you documents on dirt-smeared paper. Stop letting your emperor walk around naked. I completely agree, the problem is, some implementations makes this impossible. For example MailScanner. I've heard that it truncates the mail at 30kb, no matter if that is within a MIME block or not... So my plugin gets a broken image.. though it was not broken originally... That is patently false. I have a graphics design/advertising department at one of my locations and these fellas send huge graphics files back and forth when they have emergency proofs/changes and MailScanner has *never* damaged anything, ever, anywhere. Now, there is a setting for scanning (much like exiscan IIRCC) that allows you to truncate the message and only scan xxx amount, it's optional and doesn't modify the actual message in anyway. Rick I did not say it damages the mail. I said it feds only a given amount of the message to SpamAssassin and THAT breaks plugins requiring the whole message, especially when MailScanner breaks messages in the middle of attachments. And as far as I know, it is the default setting of mailscanner to feed only a given amount of kb to SpamAssassin. That does not mean it truncates the message before delivering it. Chris -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE71wLJQIKXnJyDxURAtxUAJ9/O5F4cC/1vlsE6EsRb6vLcepH+ACfcTCA x4CmnLDyZbUFtAr2kWK9koY= =Ckpc -END PGP SIGNATURE-
FuzzyOcr 2.3b released, fixes bugs and improves stability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, I just uploaded FuzzyOcr 2.3b to the download site. If you find bugs or run into problems, please mail back :) The major changes are: - - Added a configurable timeout (maximum runtime) for the plugin, to avoid any lockups/unwanted delays - - The default matching threshold (set in the config file) can now be overridden on a per-word basis in the wordlist An example, wordlist contains: word1 word2::0 word3::0.2 Then word1 is matched with the default threshold set in the config file, word2 must be an exact match (threshold 0), and word 3 is matched with a threshold of 0.2. This is especially useful for words which trigger false positives very often like: penis, money or news. Note that the tendency to produce a FP is not directly connected to the word length. The word buy produces very few FP compared to penis, when both are being matched with the same threshold. The FuzzyOcr.words.sample contains some suggestions for word specific thresholds which I recommend. - - The experimental MD5 database has been replaced by a custom hash database which is able to match very similar images. Often, you get the same image twice, or all your customers get the same spam mail. But even though the pictures look the same, they are not identical. That is why MD5 was useless. The newly introduced hash (self invented) is able to recognize almost identical images based on features that I won't explain here as it would make it easier for spammers :) If a message contains a picture previously registered in the database, the original score is reread from the database and the message is immediatly tagged with this score and the plugin ends. - - Some non-alpha-alpha translations are now used on the gocr output, that fix common mistakes, like i being misread as ; or a as 8. - - There are now 2 scores for broken images, one is used when the picture is recognized as broken, but giffix was able to correct the errors and it gave some output that can be scanned, the other one is used if the image is unfixable (that means either too broken, or interlaced/animated and broken). The first one is set lower than the second one (2.5 vs. 5). - -Various bugfixes TODO: - -Write an external program to manage the database (add, remove and verify given pictures). - -Rewrite the temp file system to do all external program operations on files (saves memory). Another wish: I'd like to create a database to ship with the plugin so it can be used out of the box but I do not have much samples here, so it would be nice if you sent me picture samples of common picture spam you get with [picture sample] in the subject to my mail address. I will post here again if I got enough :). Thanks to Jorge Valdes, Michael Alan Dorman and UxBoD for finding bugs and sending improvement suggestions for this version Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE72jaJQIKXnJyDxURApfeAJ47JcACEeIaYtEA8z6wDdFxGPhrUgCZAZSE sdWROYeF8IFdbUX0njAdV+o= =y7XM -END PGP SIGNATURE-
Re: FuzzyOcr 2.3b released, fixes bugs and improves stability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 John Andersen wrote: On Friday 25 August 2006 13:17, decoder wrote: Another wish: I'd like to create a database to ship with the plugin so it can be used out of the box but I do not have much samples here, so it would be nice if you sent me picture samples of common picture spam you get with [picture sample] in the subject to my mail address. I will post here again if I got enough :). Wouldn't it be more productive to the community to work with SURBL to enable the centralized storage of these hashes? Or perhaps with Razor2? I'm not an expert on Razor, but my limited understanding of it is that it generates hashes of (portions of) message bodies and stores that hash for future comparison. It would seem that once someone decide something is spam, one could take your hash and wrap a minimal message around it and report THAT to razor. Then your engine could examine an image, generate your hash, and wrap it in the same minimal message and Query Razor. Presumably getting a hit. No local database is needed, because a world wide one would be substituted. That way, if you get this spam and report it, It will already be known by the time I get the spam. Maybe it would. But this kind of hash is no real hash. It is just a combination of picture features that I invented... but it seems reliable in my tests so far. Once it has been tested in public, such a cooperation with SURBL or Razor might be possible Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE724mJQIKXnJyDxURAuW6AKClt1V0/faPEJaTwjLRXChXqhtTkwCfc9Yp UBsuigcaOac6pOZz2EP7Gkk= =LJEa -END PGP SIGNATURE-
Re: FuzzyOcr 2.3b released, fixes bugs and improves stability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Michael Scheidell wrote: Now if you could just ocr the whole thing as text, and pass it back to SA to score! I explained before why this is not going to happen really soon: a) It is VERY hard to realize. To preserve the message, you would need two plugins, one that runs as first rule, converts the message to text only, and another one that runs as last rule and puts the image back into the message (so the message stays unchanged). b) The default gocr output is not reliable enough for text only rules. The current FuzzyOcr archives better results by doing multiple scans with different settings. Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE728SJQIKXnJyDxURAlaQAJ447+AJu7pHwnqfHR5MkdCRIf5zDQCfedAb 7PyOxUGE4oTuoVmd5JRGuGw= =dMnX -END PGP SIGNATURE-