[PHP] detecting spam keywords with stripos

2009-05-29 Thread Merlin Morgenstern

Hi there,

I am matching text against an array of keywords to detect spam. 
Unfortunatelly there are some false positives due to the fact that 
stripos also finds the keyword inside a word.

E.G. Bewerbung - Werbung

First thought: use strpos, but this does not help in all cases
Second thought: split text into words and use in_array, but this does 
not find things like zu Hause or flexible/Arbeit


Does somebody have an idea on how to make my function better in terms of 
not detecting the string inside a word? Here is the code:


while ($row = db_get_row($result)){
$keyword[]  = $row-keyword;
$weight[]   = $row-weight;
};  
$num_results = db_numrows($result); 

for ($i=0;$i$num_results;$i++){
$findme  = $keyword[$i];
$pos = stripos($data[txt], $findme);
$pos2 = stripos($data[title], $findme);
if ($pos !== false OR $pos2 !== false){ // spam!
$spam_level += $weight[$i];
$triggered_keywords .= $keyword[$i].', ';
}
}
$spam[score] += $spam_level;

Thank you for any help!

Merlin

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] detecting spam keywords with stripos

2009-05-29 Thread Per Jessen
Merlin Morgenstern wrote:

 Hi there,
 
 I am matching text against an array of keywords to detect spam.
 Unfortunatelly there are some false positives due to the fact that
 stripos also finds the keyword inside a word.
 E.G. Bewerbung - Werbung
 
 First thought: use strpos, but this does not help in all cases
 Second thought: split text into words and use in_array, but this does
 not find things like zu Hause or flexible/Arbeit

First thought - use Spamassassin.
Second thought - use regexes.

/Per

-- 
Per Jessen, Zürich (17.1°C)


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] detecting spam keywords with stripos

2009-05-29 Thread Tom Worster
On 5/29/09 5:36 AM, Merlin Morgenstern merli...@fastmail.fm wrote:

 Does somebody have an idea on how to make my function better in terms of
 not detecting the string inside a word?

i agree with per. learn pcre: http://us.php.net/manual/en/book.pcre.php

as for successfully filtering spam by keyword matching: good luck!



-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] detecting spam keywords with stripos

2009-05-29 Thread Merlin Morgenstern



Per Jessen wrote:

Merlin Morgenstern wrote:


Hi there,

I am matching text against an array of keywords to detect spam.
Unfortunatelly there are some false positives due to the fact that
stripos also finds the keyword inside a word.
E.G. Bewerbung - Werbung

First thought: use strpos, but this does not help in all cases
Second thought: split text into words and use in_array, but this does
not find things like zu Hause or flexible/Arbeit


First thought - use Spamassassin.
Second thought - use regexes.

/Per




sorry this is a different scneario. I do need to to it this way in my 
case. It is about spam inside user postings.


Any ideas?

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] detecting spam keywords with stripos

2009-05-29 Thread Stuart
2009/5/29 Merlin Morgenstern merli...@fastmail.fm:


 Per Jessen wrote:

 Merlin Morgenstern wrote:

 Hi there,

 I am matching text against an array of keywords to detect spam.
 Unfortunatelly there are some false positives due to the fact that
 stripos also finds the keyword inside a word.
 E.G. Bewerbung - Werbung

 First thought: use strpos, but this does not help in all cases
 Second thought: split text into words and use in_array, but this does
 not find things like zu Hause or flexible/Arbeit

 First thought - use Spamassassin.
 Second thought - use regexes.

 /Per



 sorry this is a different scneario. I do need to to it this way in my case.
 It is about spam inside user postings.

 Any ideas?

I've had to solve this problem before and the conclusion I came to is
that when doing this kind of simple matching you either accept false
positives or false negatives. Alternatives include implementing
Bayesian filtering or some other algorithm that's more complex than
simple matching or use a pre-existing solution.

I'm sure you could integrate SpamAssassin or similar because at the
end of the day all those systems expect is a bunch of text. If they
require the headers of an email you can supply fake ones and remove
any effect headers have on the score. Whether that's worth it depends
on the volume your talking about and how much manual moderation checks
you want to have to do.

-Stuart

-- 
http://stut.net/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] detecting spam keywords with stripos

2009-05-29 Thread Bastien Koert
On Fri, May 29, 2009 at 10:02 AM, Merlin Morgenstern
merli...@fastmail.fmwrote:



 Per Jessen wrote:

 Merlin Morgenstern wrote:

  Hi there,

 I am matching text against an array of keywords to detect spam.
 Unfortunatelly there are some false positives due to the fact that
 stripos also finds the keyword inside a word.
 E.G. Bewerbung - Werbung

 First thought: use strpos, but this does not help in all cases
 Second thought: split text into words and use in_array, but this does
 not find things like zu Hause or flexible/Arbeit


 First thought - use Spamassassin.
 Second thought - use regexes.

 /Per



 sorry this is a different scneario. I do need to to it this way in my case.
 It is about spam inside user postings.

 Any ideas?

 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php


Regex is your best bet, but nothing will be fool proof. Case in point (shit,
shiite, sh*t, s**t, merde, Scheiße! as/a and so on)




-- 

Bastien

Cat, the other other white meat


Re: [PHP] detecting spam keywords with stripos

2009-05-29 Thread Per Jessen
Stuart wrote:

 I'm sure you could integrate SpamAssassin or similar because at the
 end of the day all those systems expect is a bunch of text. 

Exactly.  You can run SA as a daemon (spamd) and feed data to it using
spamc. Works very well. The full ruleset is probably too much, but it's
easy to roll your own too.

 If they require the headers of an email you can supply fake ones and
 remove any effect headers have on the score. 

SA doesn't require them, and without them scoring would (obviously) be
based on the text only.


/Per

-- 
Per Jessen, Zürich (20.9°C)


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php