php-general Digest 29 May 2009 18:12:18 -0000 Issue 6148

Topics (messages 293340 through 293355):

detecting spam keywords with stripos
        293340 by: Merlin Morgenstern
        293341 by: Per Jessen
        293343 by: Tom Worster
        293346 by: Merlin Morgenstern
        293347 by: Stuart
        293348 by: Bastien Koert
        293352 by: Per Jessen

Re: Hebrew Directory Names
        293342 by: Tom Worster
        293345 by: Nitsan Bin-Nun

Re: cURL loop?
        293344 by: Daniel Brown

recipes anyone?
        293349 by: PJ
        293350 by: Richard Heyes
        293351 by: Bob McConnell

Numerical Recipe - Scheduling Question
        293353 by: bruce
        293354 by: kyle.smith
        293355 by: Stuart

Administrivia:

To subscribe to the digest, e-mail:
        php-general-digest-subscr...@lists.php.net

To unsubscribe from the digest, e-mail:
        php-general-digest-unsubscr...@lists.php.net

To post to the list, e-mail:
        php-gene...@lists.php.net


----------------------------------------------------------------------
--- Begin Message ---
Hi there,

I am matching text against an array of keywords to detect spam. Unfortunatelly there are some false positives due to the fact that stripos also finds the keyword inside a word.
E.G. "Bewerbung" -> "Werbung"

First thought: use strpos, but this does not help in all cases
Second thought: split text into words and use in_array, but this does not find things like "zu Hause" or "flexible/Arbeit"

Does somebody have an idea on how to make my function better in terms of not detecting the string inside a word? Here is the code:

while ($row = db_get_row($result)){
        $keyword[]      = $row->keyword;
        $weight[]       = $row->weight;
};      
$num_results = db_numrows($result);     

for ($i=0;$i<$num_results;$i++){
        $findme  = $keyword[$i];
        $pos = stripos($data[txt], $findme);
        $pos2 = stripos($data[title], $findme);
        if ($pos !== false OR $pos2 !== false){ // spam!
                $spam_level += $weight[$i];
                $triggered_keywords .= $keyword[$i].', ';
        }
}
$spam[score] += $spam_level;

Thank you for any help!

Merlin

--- End Message ---
--- Begin Message ---
Merlin Morgenstern wrote:

> Hi there,
> 
> I am matching text against an array of keywords to detect spam.
> Unfortunatelly there are some false positives due to the fact that
> stripos also finds the keyword inside a word.
> E.G. "Bewerbung" -> "Werbung"
> 
> First thought: use strpos, but this does not help in all cases
> Second thought: split text into words and use in_array, but this does
> not find things like "zu Hause" or "flexible/Arbeit"

First thought - use Spamassassin.
Second thought - use regexes.

/Per

-- 
Per Jessen, Zürich (17.1°C)


--- End Message ---
--- Begin Message ---
On 5/29/09 5:36 AM, "Merlin Morgenstern" <merli...@fastmail.fm> wrote:

> Does somebody have an idea on how to make my function better in terms of
> not detecting the string inside a word?

i agree with per. learn pcre: http://us.php.net/manual/en/book.pcre.php

as for successfully filtering spam by keyword matching: good luck!



--- End Message ---
--- Begin Message ---


Per Jessen wrote:
Merlin Morgenstern wrote:

Hi there,

I am matching text against an array of keywords to detect spam.
Unfortunatelly there are some false positives due to the fact that
stripos also finds the keyword inside a word.
E.G. "Bewerbung" -> "Werbung"

First thought: use strpos, but this does not help in all cases
Second thought: split text into words and use in_array, but this does
not find things like "zu Hause" or "flexible/Arbeit"

First thought - use Spamassassin.
Second thought - use regexes.

/Per



sorry this is a different scneario. I do need to to it this way in my case. It is about spam inside user postings.

Any ideas?

--- End Message ---
--- Begin Message ---
2009/5/29 Merlin Morgenstern <merli...@fastmail.fm>:
>
>
> Per Jessen wrote:
>>
>> Merlin Morgenstern wrote:
>>
>>> Hi there,
>>>
>>> I am matching text against an array of keywords to detect spam.
>>> Unfortunatelly there are some false positives due to the fact that
>>> stripos also finds the keyword inside a word.
>>> E.G. "Bewerbung" -> "Werbung"
>>>
>>> First thought: use strpos, but this does not help in all cases
>>> Second thought: split text into words and use in_array, but this does
>>> not find things like "zu Hause" or "flexible/Arbeit"
>>
>> First thought - use Spamassassin.
>> Second thought - use regexes.
>>
>> /Per
>>
>
>
> sorry this is a different scneario. I do need to to it this way in my case.
> It is about spam inside user postings.
>
> Any ideas?

I've had to solve this problem before and the conclusion I came to is
that when doing this kind of simple matching you either accept false
positives or false negatives. Alternatives include implementing
Bayesian filtering or some other algorithm that's more complex than
simple matching or use a pre-existing solution.

I'm sure you could integrate SpamAssassin or similar because at the
end of the day all those systems expect is a bunch of text. If they
require the headers of an email you can supply fake ones and remove
any effect headers have on the score. Whether that's worth it depends
on the volume your talking about and how much manual moderation checks
you want to have to do.

-Stuart

-- 
http://stut.net/

--- End Message ---
--- Begin Message ---
On Fri, May 29, 2009 at 10:02 AM, Merlin Morgenstern
<merli...@fastmail.fm>wrote:

>
>
> Per Jessen wrote:
>
>> Merlin Morgenstern wrote:
>>
>>  Hi there,
>>>
>>> I am matching text against an array of keywords to detect spam.
>>> Unfortunatelly there are some false positives due to the fact that
>>> stripos also finds the keyword inside a word.
>>> E.G. "Bewerbung" -> "Werbung"
>>>
>>> First thought: use strpos, but this does not help in all cases
>>> Second thought: split text into words and use in_array, but this does
>>> not find things like "zu Hause" or "flexible/Arbeit"
>>>
>>
>> First thought - use Spamassassin.
>> Second thought - use regexes.
>>
>> /Per
>>
>>
>
> sorry this is a different scneario. I do need to to it this way in my case.
> It is about spam inside user postings.
>
> Any ideas?
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
Regex is your best bet, but nothing will be fool proof. Case in point (shit,
shiite, sh*t, s**t, merde, Scheiße! <a>s</a> and so on)




-- 

Bastien

Cat, the other other white meat

--- End Message ---
--- Begin Message ---
Stuart wrote:

> I'm sure you could integrate SpamAssassin or similar because at the
> end of the day all those systems expect is a bunch of text. 

Exactly.  You can run SA as a daemon (spamd) and feed data to it using
spamc. Works very well. The full ruleset is probably too much, but it's
easy to "roll your own" too.

> If they require the headers of an email you can supply fake ones and
> remove any effect headers have on the score. 

SA doesn't require them, and without them scoring would (obviously) be
based on the text only.


/Per

-- 
Per Jessen, Zürich (20.9°C)


--- End Message ---
--- Begin Message ---
On 5/28/09 2:06 PM, "Nitsan Bin-Nun" <nit...@binnun.co.il> wrote:

> preg_replace("/([\xE0-\xFA])/e","chr(215).chr(ord(\${1})-80)",$s);

...

> The preg_replace() above convert the Hebrew chars into UTF8.

that preg_replace takes a byte string $s and:

- leaves bytes with value 0-127 intact
- converts bytes with value 224-250 to the utf8 multibyte character code of
the corresponding win-1255 character
- produces an invalid utf8 output for all other code points, for which see:
http://en.wikipedia.org/wiki/Windows-1255

so it's a win-1255 to utf-8 converter only so long as you are confident that
code points 128-223 and 251-255 are never in the subject string.

try iconv('CP1255', 'UTF-8', $s) instead.





--- End Message ---
--- Begin Message ---
Your point is right but these code point does not exist in the subject
string so this isn't the issue here.

I'm really stuck at this one :S

Thank you again for trying to help!

On Fri, May 29, 2009 at 2:40 PM, Tom Worster <f...@thefsb.org> wrote:

> On 5/28/09 2:06 PM, "Nitsan Bin-Nun" <nit...@binnun.co.il> wrote:
>
> > preg_replace("/([\xE0-\xFA])/e","chr(215).chr(ord(\${1})-80)",$s);
>
> ...
>
> > The preg_replace() above convert the Hebrew chars into UTF8.
>
> that preg_replace takes a byte string $s and:
>
> - leaves bytes with value 0-127 intact
> - converts bytes with value 224-250 to the utf8 multibyte character code of
> the corresponding win-1255 character
> - produces an invalid utf8 output for all other code points, for which see:
> http://en.wikipedia.org/wiki/Windows-1255
>
> so it's a win-1255 to utf-8 converter only so long as you are confident
> that
> code points 128-223 and 251-255 are never in the subject string.
>
> try iconv('CP1255', 'UTF-8', $s) instead.
>
>
>
>
>

--- End Message ---
--- Begin Message ---
On Thu, May 28, 2009 at 23:31, espontaneo <acohln...@gmail.com> wrote:
>
> Hello! I am currently working on a script that will scrape data from a
> property advertising web page. The web page has multiple pages. What I'm
> getting is only the first page. What I wanted to do is to use curl to scrape
> all the data from that page. I just learned php so I don't know how I can do
> this.

    There are a variety of ways, but rather than writing your own
spider script, you may want to look into the built-in spidering
capabilities of `wget` and `curl` from the command line.  Both have
Windows and *NIX builds, so platform isn't an issue; so long as you
have access to the shell, you should be fine.

-- 
</Daniel P. Brown>
daniel.br...@parasane.net || danbr...@php.net
http://www.parasane.net/ || http://www.pilotpig.net/
50% Off All Shared Hosting Plans at PilotPig: Use Coupon DOW10000

--- End Message ---
--- Begin Message ---
I'd like to get some input on how to deal with recipes.
use html pages to store and display, XML or db or... ? And what about
clips, like flvs ? TIA.

-- 
Hervé Kempf: "Pour sauver la planète, sortez du capitalisme."
-------------------------------------------------------------
Phil Jourdan --- p...@ptahhotep.com
   http://www.ptahhotep.com
   http://www.chiccantine.com/andypantry.php


--- End Message ---
--- Begin Message ---
Hi,

> I'd like to get some input on how to deal with recipes.
> use html pages to store and display, XML or db or... ? And what about
> clips, like flvs ? TIA.

Actual recipes? As in a pork roast? I would put them on the file
system in .html files. You could use a PHP file to serve them, and
have a URL like this:

http://www.pig-supper.com/recipe/pork-roast.html

"recipe" could be a PHP file that adds a common header and footer. I
do similar with my site. Eg:

http://www.phpguru.org/static/canvas.html

Or did you mean something else entirely...?

-- 
Richard Heyes
HTML5 graphing: RGraph (www.rgraph.net - updated 23rd May)
PHP mail: RMail (www.phpguru.org/rmail)
PHP datagrid: RGrid (www.phpguru.org/rgrid)
PHP Template: RTemplate (www.phpguru.org/rtemplate)
PHP SMTP: http://www.phpguru.org/smtp

--- End Message ---
--- Begin Message ---
From: PJ
> 
> I'd like to get some input on how to deal with recipes.
> use html pages to store and display, XML or db or... ? And what about
> clips, like flvs ? TIA.
> 

There are as many ways to do cookbooks as there are cooks. I am familiar
with half a dozen, without counting the professional packages put out by
another department here where I work.

RecipeML is one option, but it is seriously incomplete if you need to
include nutritional information.

Qookbooks, Gormet (Gnome), Krecipes (KDE) MealMaster, Master Cook,
Recipants, etc. all have different storage formats and display formats.
Some are well documented, some are buried in the code, and some are
still kept secret. You can take your pick, or combine them and roll your
own.

A bigger issue is how to import existing recipe files. I have several
years of messages collected from newsgroups like rec.food.recipes,
r.f.cooking, r.f.baking, etc. that I would like to put into a usable,
and searchable format. But there are too many variations in the formats
and naming conventions used to be able to write a single routine to
handle them all. It is much easier just to use those already published
in MealMaster formats. At least that one is documented clearly now that
they are out of business.

Bob McConnell

--- End Message ---
--- Begin Message ---
Hi..

Got a need to be able to allow a user to specify the frequency to run
certain apps/processes.. I need to be able to have the user specify a start
Time, as well as a periodic frequency (once, hourly, daily, weekly...) as
well as allow the user to specify every XX minutes...

So i basically need to be able to determine when the future
events/occurances are, based on the user input.

I've searched the net for alogorithms dealing with scheduling and haven't
come up with any php based solutions.. I've also looked at numerical recipes
and some other sources (freshmeat/sourceforge/etc..) with no luck..

I have found an approach in another language that I could port to php.. But
before I code/recreate this, I figured I'd see if anyone here has pointers
or suggestions...

Cron doesn't work for me, as it can run a process at a given time.. but it
doesn't tell me when the next 'X' occurance would be...

Thoughts/Comments..

Thanks


--- End Message ---
--- Begin Message ---
I'm confused as to why cron doesn't work for you.  It doesn't explicitly
tell you when the next X occurences will be, but math does.  If you
schedule something to run every 5 minutes starting at 1:45 PM, it's
simple work to be able to report that the next times would be 1:50 PM,
1:55 PM, 2:00 PM etc.

Is this running in a web browser, somehow?  If not, PHP is not the
solution.

HTH,
Kyle

-----Original Message-----
From: bruce [mailto:bedoug...@earthlink.net] 
Sent: Friday, May 29, 2009 1:11 PM
To: php-gene...@lists.php.net
Subject: [PHP] Numerical Recipe - Scheduling Question

Hi..

Got a need to be able to allow a user to specify the frequency to run
certain apps/processes.. I need to be able to have the user specify a
start Time, as well as a periodic frequency (once, hourly, daily,
weekly...) as well as allow the user to specify every XX minutes...

So i basically need to be able to determine when the future
events/occurances are, based on the user input.

I've searched the net for alogorithms dealing with scheduling and
haven't come up with any php based solutions.. I've also looked at
numerical recipes and some other sources (freshmeat/sourceforge/etc..)
with no luck..

I have found an approach in another language that I could port to php..
But before I code/recreate this, I figured I'd see if anyone here has
pointers or suggestions...

Cron doesn't work for me, as it can run a process at a given time.. but
it doesn't tell me when the next 'X' occurance would be...

Thoughts/Comments..

Thanks


--
PHP General Mailing List (http://www.php.net/) To unsubscribe, visit:
http://www.php.net/unsub.php


--- End Message ---
--- Begin Message ---
2009/5/29 kyle.smith <kyle.sm...@inforonics.com>:
> I'm confused as to why cron doesn't work for you.  It doesn't explicitly
> tell you when the next X occurences will be, but math does.  If you
> schedule something to run every 5 minutes starting at 1:45 PM, it's
> simple work to be able to report that the next times would be 1:50 PM,
> 1:55 PM, 2:00 PM etc.

You can be a lot more intelligent than that. I have a job queue system
running on several sites I maintain that uses a simple run_at
timestamp. A cron job runs every minute and essentially does this...

* Locks the job queue.

* Does the equivalent of "select job from job_queue where run_at <=
unix_timestamp() order by run_at asc limit 1".

* If no jobs need running it simply exits otherwise it locks the job
it got back and unlocks the queue.

* Runs the job (wrapped in a safe environment that catches output and
errors and does something useful with them).

* Marks the job as completed or with an error status.

* If the job is marked as recurring it creates a new job by cloning
the job it just ran, sets run_at based upon the schedule definition
(which can be a time of day, a time of day + a day of week, a time of
day + a day of month or simply a number of seconds) and sets the
status to new.

* Either removes the completed job from the queue or archives it
complete with errors and output for later inspection depending on the
job config and status.

* If this processor has been running for > 60 minutes it exits,
otherwise it looks for another job to run.

This system will automatically scale up to 60 job processors per hour,
but obviously you can modify the cron config to run more or less as
your requirements dictate. Assuming the job queue is on a shared
resource such as a database this can also scale across machines
effectively infinitely.

There's also a whole bunch of stuff around catching crashed jobs and
doing something useful with them, but I'll leave how to handle those
as an exercise for the reader.

> Is this running in a web browser, somehow?  If not, PHP is not the
> solution.

Total codswallop! PHP is no more tied to web browsers than a
hovercraft is tied to water. My job queue system is 100% PHP (although
it can run jobs not written in PHP, but that's a topic for another
day) and beyond initial development it's never given me any problems.

Hmm, might have to write that lot up as a blog post with some example
code. Sometime...

-Stuart

-- 
http://stut.net/

> -----Original Message-----
> From: bruce [mailto:bedoug...@earthlink.net]
> Sent: Friday, May 29, 2009 1:11 PM
> To: php-gene...@lists.php.net
> Subject: [PHP] Numerical Recipe - Scheduling Question
>
> Hi..
>
> Got a need to be able to allow a user to specify the frequency to run
> certain apps/processes.. I need to be able to have the user specify a
> start Time, as well as a periodic frequency (once, hourly, daily,
> weekly...) as well as allow the user to specify every XX minutes...
>
> So i basically need to be able to determine when the future
> events/occurances are, based on the user input.
>
> I've searched the net for alogorithms dealing with scheduling and
> haven't come up with any php based solutions.. I've also looked at
> numerical recipes and some other sources (freshmeat/sourceforge/etc..)
> with no luck..
>
> I have found an approach in another language that I could port to php..
> But before I code/recreate this, I figured I'd see if anyone here has
> pointers or suggestions...
>
> Cron doesn't work for me, as it can run a process at a given time.. but
> it doesn't tell me when the next 'X' occurance would be...
>
> Thoughts/Comments..
>
> Thanks
>
>
> --
> PHP General Mailing List (http://www.php.net/) To unsubscribe, visit:
> http://www.php.net/unsub.php
>
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

--- End Message ---

Reply via email to