[PHP] detecting spam keywords with stripos
Hi there, I am matching text against an array of keywords to detect spam. Unfortunatelly there are some false positives due to the fact that stripos also finds the keyword inside a word. E.G. Bewerbung - Werbung First thought: use strpos, but this does not help in all cases Second thought: split text into words and use in_array, but this does not find things like zu Hause or flexible/Arbeit Does somebody have an idea on how to make my function better in terms of not detecting the string inside a word? Here is the code: while ($row = db_get_row($result)){ $keyword[] = $row-keyword; $weight[] = $row-weight; }; $num_results = db_numrows($result); for ($i=0;$i$num_results;$i++){ $findme = $keyword[$i]; $pos = stripos($data[txt], $findme); $pos2 = stripos($data[title], $findme); if ($pos !== false OR $pos2 !== false){ // spam! $spam_level += $weight[$i]; $triggered_keywords .= $keyword[$i].', '; } } $spam[score] += $spam_level; Thank you for any help! Merlin -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] detecting spam keywords with stripos
Merlin Morgenstern wrote: Hi there, I am matching text against an array of keywords to detect spam. Unfortunatelly there are some false positives due to the fact that stripos also finds the keyword inside a word. E.G. Bewerbung - Werbung First thought: use strpos, but this does not help in all cases Second thought: split text into words and use in_array, but this does not find things like zu Hause or flexible/Arbeit First thought - use Spamassassin. Second thought - use regexes. /Per -- Per Jessen, Zürich (17.1°C) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] detecting spam keywords with stripos
On 5/29/09 5:36 AM, Merlin Morgenstern merli...@fastmail.fm wrote: Does somebody have an idea on how to make my function better in terms of not detecting the string inside a word? i agree with per. learn pcre: http://us.php.net/manual/en/book.pcre.php as for successfully filtering spam by keyword matching: good luck! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] detecting spam keywords with stripos
Per Jessen wrote: Merlin Morgenstern wrote: Hi there, I am matching text against an array of keywords to detect spam. Unfortunatelly there are some false positives due to the fact that stripos also finds the keyword inside a word. E.G. Bewerbung - Werbung First thought: use strpos, but this does not help in all cases Second thought: split text into words and use in_array, but this does not find things like zu Hause or flexible/Arbeit First thought - use Spamassassin. Second thought - use regexes. /Per sorry this is a different scneario. I do need to to it this way in my case. It is about spam inside user postings. Any ideas? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] detecting spam keywords with stripos
2009/5/29 Merlin Morgenstern merli...@fastmail.fm: Per Jessen wrote: Merlin Morgenstern wrote: Hi there, I am matching text against an array of keywords to detect spam. Unfortunatelly there are some false positives due to the fact that stripos also finds the keyword inside a word. E.G. Bewerbung - Werbung First thought: use strpos, but this does not help in all cases Second thought: split text into words and use in_array, but this does not find things like zu Hause or flexible/Arbeit First thought - use Spamassassin. Second thought - use regexes. /Per sorry this is a different scneario. I do need to to it this way in my case. It is about spam inside user postings. Any ideas? I've had to solve this problem before and the conclusion I came to is that when doing this kind of simple matching you either accept false positives or false negatives. Alternatives include implementing Bayesian filtering or some other algorithm that's more complex than simple matching or use a pre-existing solution. I'm sure you could integrate SpamAssassin or similar because at the end of the day all those systems expect is a bunch of text. If they require the headers of an email you can supply fake ones and remove any effect headers have on the score. Whether that's worth it depends on the volume your talking about and how much manual moderation checks you want to have to do. -Stuart -- http://stut.net/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] detecting spam keywords with stripos
On Fri, May 29, 2009 at 10:02 AM, Merlin Morgenstern merli...@fastmail.fmwrote: Per Jessen wrote: Merlin Morgenstern wrote: Hi there, I am matching text against an array of keywords to detect spam. Unfortunatelly there are some false positives due to the fact that stripos also finds the keyword inside a word. E.G. Bewerbung - Werbung First thought: use strpos, but this does not help in all cases Second thought: split text into words and use in_array, but this does not find things like zu Hause or flexible/Arbeit First thought - use Spamassassin. Second thought - use regexes. /Per sorry this is a different scneario. I do need to to it this way in my case. It is about spam inside user postings. Any ideas? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Regex is your best bet, but nothing will be fool proof. Case in point (shit, shiite, sh*t, s**t, merde, Scheiße! as/a and so on) -- Bastien Cat, the other other white meat
Re: [PHP] detecting spam keywords with stripos
Stuart wrote: I'm sure you could integrate SpamAssassin or similar because at the end of the day all those systems expect is a bunch of text. Exactly. You can run SA as a daemon (spamd) and feed data to it using spamc. Works very well. The full ruleset is probably too much, but it's easy to roll your own too. If they require the headers of an email you can supply fake ones and remove any effect headers have on the score. SA doesn't require them, and without them scoring would (obviously) be based on the text only. /Per -- Per Jessen, Zürich (20.9°C) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php