Re: [PHP] Fast search

2006-05-18 Thread Richard Lynch
On Wed, May 17, 2006 11:37 am, René Fournier wrote:
 Looking for suggestions on the most compute-efficient way to search
 variable-length strings (~200 characters) for the occurrence of one
 of about 100 possible needles. In other words:

 $needles = array ( 1 = Hello Jim , 2 = Yellow Banana , 3 =
 Red Car, ... etc  (100 elements)

 $haystack = Once upon a time there was a programming language that
 everyone loved. Its name was PHP, and the people that worked with it
 also liked Yellow Bananas. One day...;

 Now perhaps I'm blind, but I can't see an obvious string or array
 function that simply allows you to use an array as a needle while
 searching a string haystack. What I'm really looking for is something
 like the reverse of in_array.

 The obvious thing would be to loop through the needles and run
 substr_count on the haystack... But is there a faster way?

//Kind of a hack, I guess, but might be decently fast:
$pattern = implode('|', $needles);
if (preg_match(/($pattern)/, $haystack, $matches)){
  var_dump($matches);
}

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Fast search

2006-05-17 Thread Robin Vickery

On 17/05/06, René Fournier [EMAIL PROTECTED] wrote:

Looking for suggestions on the most compute-efficient way to search
variable-length strings (~200 characters) for the occurrence of one
of about 100 possible needles. In other words:






$needles = array ( 1 = Hello Jim , 2 = Yellow Banana , 3 =
Red Car, ... etc  (100 elements)

$haystack = Once upon a time there was a programming language that
everyone loved. Its name was PHP, and the people that worked with it
also liked Yellow Bananas. One day...;



What version of php?

If you're using php5  then you could do:

?php
  $count = 0;
  str_replace($needles, '', $haystack, $count);
  if ($count)  { /* there was a match */ }
?

-robin

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Fast search

2006-05-17 Thread Evan Priestley
You can make a pretty effective time/memory tradeoff -- particularly  
if your array of patterns is relatively fixed -- but unless you re- 
implement this in C, it will probably be slower than a brute force  
approach with the str* functions unless you are searching for a  
fairly huge number of needles.


Below, I build a tree of pattern matches, which you navigate branch- 
by-branch by using successive letters of the pattern. So, after the  
tree is built, the index $tree[ 'h' ][ 'a' ][ 't' ] exists and has  
the value `true'. To search the string, you push cursors onto a  
stack, advance them as you advance through the string, and destroy  
them when they hit a dead end.


This has a worst case of O( N^2 ) (on the length of the string),  
which happens when you search for patterns like a  
and b in the string h. The  
average case for random or typical strings should be very close to O 
( N ). Critically, search time is a function only of the length of  
the string, not of the number of patterns. Memory usage is also only  
a function of the size of the character set and the length of the  
longest pattern, not of the total number of patterns.


This particular implementation has some nice side-effects; uncomment  
the var_dump() line and note how 'pirate' and 'firm' have been  
optimized out of the tree in favor of 'pir' and 'fir'.


Evan

?php

$pattern = array(
 'hat'
,'hut'
,'blue dog'
,'blue cat'
,'pirate'
,'pir'
,'fir'
,'firm'
);

$tree = array( );
foreach( $pattern as $str ) {
$len = strlen( $str );
$cursor = $tree;
for( $ii = 0; $ii  $len; $ii++ ) {
if( !isset( $cursor[ $str[ $ii ] ] ) ) {
$cursor[ $str[ $ii ] ] = array( );
}
$cursor = $cursor[ $str[ $ii ] ];
if( $cursor === true ) break;
}
$cursor = true;
}

/*  var_dump 
( $tree );  */


$corpus = 'lorem ipsum dolor sur amet pi blue do he hi h fipir  
slog';

$clen   = strlen( $corpus );

$stack  = array( );
for( $ii = 0; $ii  $clen; $ii++ ) {
$c = $corpus[ $ii ];
foreach( $stack as $k = $_ ) {
if( isset( $stack[ $k ][ $c ] ) ) {
if( $stack[ $k ][ $c ] === true ) {
die( 'Matched at position '.$ii );
}
$stack[ $k ] = $stack[ $k ][ $c ];
}
else {
unset( $stack[ $k ] );
}
}
if( isset( $tree[ $c ] ) ) {
$stack[] = $tree[ $c ];
}
}
die( 'Did not find a match.' );

?



On May 17, 2006, at 12:37 PM, René Fournier wrote:

Looking for suggestions on the most compute-efficient way to search  
variable-length strings (~200 characters) for the occurrence of one  
of about 100 possible needles. In other words:




$needles = array ( 1 = Hello Jim , 2 = Yellow Banana , 3 =  
Red Car, ... etc  (100 elements)


$haystack = Once upon a time there was a programming language that  
everyone loved. Its name was PHP, and the people that worked with  
it also liked Yellow Bananas. One day...;




Now perhaps I'm blind, but I can't see an obvious string or array  
function that simply allows you to use an array as a needle while  
searching a string haystack. What I'm really looking for is  
something like the reverse of in_array.


The obvious thing would be to loop through the needles and run  
substr_count on the haystack... But is there a faster way?


...Rene

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Fast search

2006-05-17 Thread Robert Cummings
On Wed, 2006-05-17 at 16:38, Robin Vickery wrote:
 On 17/05/06, René Fournier [EMAIL PROTECTED] wrote:
  Looking for suggestions on the most compute-efficient way to search
  variable-length strings (~200 characters) for the occurrence of one
  of about 100 possible needles. In other words:
 
 
 
 
  $needles = array ( 1 = Hello Jim , 2 = Yellow Banana , 3 =
  Red Car, ... etc  (100 elements)
 
  $haystack = Once upon a time there was a programming language that
  everyone loved. Its name was PHP, and the people that worked with it
  also liked Yellow Bananas. One day...;
 
 
 What version of php?
 
 If you're using php5  then you could do:
 
 ?php
$count = 0;
str_replace($needles, '', $haystack, $count);
if ($count)  { /* there was a match */ }
 ?

Surely you meant strpos( $haystack, $needle ) ???

Incidentally do you need to match on words beginnings or endings? For
instance should 'put' match all of the following:

I put the apple core in the compost.
Someone always puts the fire out.
The car sputtered and died.

Cheers,
Rob.
-- 
..
| InterJinn Application Framework - http://www.interjinn.com |
::
| An application and templating framework for PHP. Boasting  |
| a powerful, scalable system for accessing system services  |
| such as forms, properties, sessions, and caches. InterJinn |
| also provides an extremely flexible architecture for   |
| creating re-usable components quickly and easily.  |
`'

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php