[PHP] looking for a PHP texte indexer

2012-06-11 Thread Mihamina Rakotomandimby

Hi all,

I have a small job ad website, where some poster tend to flood with the 
same ad, just in order to be on top of the recent sort.


To perturb the strict duplication detection (yes it's weak), they add 
one or two words that makes difference.


The result is a duplication of many ads.

I would like to search for duplicates by looking for ads with 80%-90% 
same words and decide they're the same, so that I can group them.


Of course, putting a limiting mecanism or even a moderation is 
scheduled, but I want to process existing first.


I dont want to use MySQL for indexing, I believe text indexers are best 
tools for this: Am I wrong?


What would you suggest me to process and lookup for duplicates in that 
situation?


--
RMA.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] looking for a PHP texte indexer

2012-06-11 Thread ma...@behnke.biz


Mihamina Rakotomandimby miham...@rktmb.org hat am 11. Juni 2012 um 11:12
geschrieben:

 Hi all,

 I have a small job ad website, where some poster tend to flood with the
 same ad, just in order to be on top of the recent sort.

 To perturb the strict duplication detection (yes it's weak), they add
 one or two words that makes difference.

 The result is a duplication of many ads.

 I would like to search for duplicates by looking for ads with 80%-90%
 same words and decide they're the same, so that I can group them.

 Of course, putting a limiting mecanism or even a moderation is
 scheduled, but I want to process existing first.

 I dont want to use MySQL for indexing, I believe text indexers are best
 tools for this: Am I wrong?

 What would you suggest me to process and lookup for duplicates in that
 situation?

Maybe take a look at

http://de.php.net/manual/de/function.similar-text.php
http://de.php.net/manual/de/function.levenshtein.php



 --
 RMA.

 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php

Marco Behnke
Dipl. Informatiker (FH), SAE Audio Engineer Diploma
Zend Certified Engineer PHP 5.3

Tel.: 0174 / 9722336
e-Mail: ma...@behnke.biz

Softwaretechnik Behnke
Heinrich-Heine-Str. 7D
21218 Seevetal

http://www.behnke.biz

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php