Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-20 Thread Richard Quadling
On 19 July 2010 19:46, tedd tedd.sperl...@gmail.com wrote:
 At 12:39 PM +0100 7/19/10, Richard Quadling wrote:

 I'm using MS SQL, not mySQL.

 Found a extended stored procedure with a UDF.

 Testing it looks excellent.

 Searching for a match on 30,000 vehicles next to no additional time -
 a few seconds in total, compared to the over 3 minutes to search using
 SQL code.

 That seems a bit slow.

 For example, currently I'm searching over 4,000 records (which contains
 4,000 paragraphs taken from the text of the King James version of the Bible)
 for matching words, such as %created% and the times are typically around
 0.009 seconds.

 As such, searching ten times that amount should be in the range of tenths of
 a second and not seconds -- so taking a few seconds to search 30,000 records
 seems excessive to me.


Tedd,

I'm not looking for a word. I'm looking for similar wrds.

Word is closer to the misspelled wrds that it is to wars.

select dbo.DamerauLevenshteinDistance('words', 'wars'),
dbo.DamerauLevenshteinDistance('words', 'wrds')

(No column name)(No column name)
2   1

Lower is better.

Also, I have to compare every row in the set and then sort it to find
the lowest values for the Damerau-Levenshtein or the highest for the
Jaro–Winkler distance.

As the value entered is always going to be the unknown, I can't
pre-calculate the distances.

I do an exact match test first.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-19 Thread tedd

At 12:39 PM +0100 7/19/10, Richard Quadling wrote:


I'm using MS SQL, not mySQL.

Found a extended stored procedure with a UDF.

Testing it looks excellent.

Searching for a match on 30,000 vehicles next to no additional time -
a few seconds in total, compared to the over 3 minutes to search using
SQL code.


That seems a bit slow.

For example, currently I'm searching over 4,000 records (which 
contains 4,000 paragraphs taken from the text of the King James 
version of the Bible) for matching words, such as %created% and the 
times are typically around 0.009 seconds.


As such, searching ten times that amount should be in the range of 
tenths of a second and not seconds -- so taking a few seconds to 
search 30,000 records seems excessive to me.


Cheers,

tedd
--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-19 Thread Andrew Ballard
On Mon, Jul 19, 2010 at 2:46 PM, tedd tedd.sperl...@gmail.com wrote:
 At 12:39 PM +0100 7/19/10, Richard Quadling wrote:

 I'm using MS SQL, not mySQL.

 Found a extended stored procedure with a UDF.

 Testing it looks excellent.

 Searching for a match on 30,000 vehicles next to no additional time -
 a few seconds in total, compared to the over 3 minutes to search using
 SQL code.

 That seems a bit slow.

 For example, currently I'm searching over 4,000 records (which contains
 4,000 paragraphs taken from the text of the King James version of the Bible)
 for matching words, such as %created% and the times are typically around
 0.009 seconds.

 As such, searching ten times that amount should be in the range of tenths of
 a second and not seconds -- so taking a few seconds to search 30,000 records
 seems excessive to me.

 Cheers,

 tedd

I would be surprised if a Levenshtein or similar_text comparison in a
database were NOT slower than even a wildcard search because of the
calculations that have to be performed on each row in the column being
compared. That, and the fact that user-defined functions in SQL Server
often have a performance penalty of their own.

Just for kicks, you could try loading the values in that column into
an array in PHP and then time iterating the array to calculate the
Levenshtein distances for each value to see how it compares.

Andrew

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-17 Thread tedd

At 12:09 PM -0400 7/15/10, Daniel P. Brown wrote:

On Thu, Jul 15, 2010 at 11:54, Richard Quadling rquadl...@gmail.com wrote:


 I'm looking for is a way to compare what they've entered against a
 known list and to provide my 10 best guesses.


Look into the following functions and families:

levenshtein()
similar_text()


What would one use to have MySQL do the heavy lifting here?

In other words, are there similar MySQL functions?

Cheers,

tedd

PS: I thought of this Friday, but finally got my fingers to work today.

--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-17 Thread shiplu
There is an algorithm called longest common sub sequence.
If you can find the longest common sub sequence of the strings of
database for the given string and sort it, you'll get the most matched
word.
But I think this algo is developed already and available in your
context. It's name can be different.
As a developer I am not sure actually what function in php or mysql
serves the purpose.

Shiplu Mokadd.im
My talks, http://talk.cmyweb.net
Follow me, http://twitter.com/shiplu
SUST Programmers, http://groups.google.com/group/p2psust
Innovation distinguishes bet ... ... (ask Steve Jobs the rest)

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-16 Thread Richard Quadling
On 15 July 2010 17:09, Daniel P. Brown daniel.br...@parasane.net wrote:
    Look into the following functions and families:

        levenshtein()
        similar_text()


Having just found a levenshtein() UDF for MS SQL [1]

I'm very impressed.

Thank you for the suggestion.

Regards,

Richard.

[1] http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540whichpage=2#425160

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-16 Thread Daniel P. Brown
On Fri, Jul 16, 2010 at 07:09, Richard Quadling rquadl...@gmail.com wrote:

 Having just found a levenshtein() UDF for MS SQL [1]

 I'm very impressed.

 Thank you for the suggestion.

 Regards,

 Richard.

 [1] http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540whichpage=2#425160

Dear dearest sir;

Thank you for your formal reply.  Your email is important to me.
Please contact me at your convenience to discuss this reply to your
reply.

Sincerely,

Bozo.



-- 
/Daniel P. Brown
UNADVERTISED DEDICATED SERVER SPECIALS
SAME-DAY SETUP
Just ask me what we're offering today!
daniel.br...@parasane.net || danbr...@php.net
http://www.parasane.net/ || http://www.pilotpig.net/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-16 Thread Richard Quadling
On 16 July 2010 13:47, Daniel P. Brown daniel.br...@parasane.net wrote:
 On Fri, Jul 16, 2010 at 07:09, Richard Quadling rquadl...@gmail.com wrote:

 Having just found a levenshtein() UDF for MS SQL [1]

 I'm very impressed.

 Thank you for the suggestion.

 Regards,

 Richard.

 [1] http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540whichpage=2#425160

    Dear dearest sir;

    Thank you for your formal reply.  Your email is important to me.
 Please contact me at your convenience to discuss this reply to your
 reply.

    Sincerely,

    Bozo.



 --
 /Daniel P. Brown
 UNADVERTISED DEDICATED SERVER SPECIALS
 SAME-DAY SETUP
 Just ask me what we're offering today!
 daniel.br...@parasane.net || danbr...@php.net
 http://www.parasane.net/ || http://www.pilotpig.net/


Oh. It must be Friday!

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-15 Thread Richard Quadling
Hi.

It seems that users cannot enter a vehicle registration 100% accurately.

We have recently released a small mobile web app which allows service
engineers/inspectors to enter a vehicle registration number and a pin
number to get service history for the vehicle.

We are getting around a 40% fail rate on the registrations for the
first time of entry. This drops to around a 1.5% error rate on the
second attempt.

Most of the time it is simply a case of 2 letters/numbers being
swapped. Sometimes a letter/number is entered for a number/letter.

0/O
1/I/l
2/Z
3/E
4/A
5/S
6/G
7/T
8/B
9/q

Some of the registrations are private and don't obey any format (NOTE:
Trailer registrations aren't the same as vehicle registrations - they
can be anything the owner wants. In some cases we have them as
straight numbers - 1, 2, 3, etc.).

I'm looking for is a way to compare what they've entered against a
known list and to provide my 10 best guesses.

What I'm stuck on is what criteria do I use.

I think something like the old style colour Mastermind game (right
colour in the right place, right colour wrong place, wrong colour).
But that's going to be slow. One of the contracts has over 30,000
vehicles/trailers available to them.

Any suggestions really.

Regards,

Richard.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-15 Thread Ashley Sheridan
On Thu, 2010-07-15 at 16:54 +0100, Richard Quadling wrote:

 Hi.
 
 It seems that users cannot enter a vehicle registration 100% accurately.
 
 We have recently released a small mobile web app which allows service
 engineers/inspectors to enter a vehicle registration number and a pin
 number to get service history for the vehicle.
 
 We are getting around a 40% fail rate on the registrations for the
 first time of entry. This drops to around a 1.5% error rate on the
 second attempt.
 
 Most of the time it is simply a case of 2 letters/numbers being
 swapped. Sometimes a letter/number is entered for a number/letter.
 
 0/O
 1/I/l
 2/Z
 3/E
 4/A
 5/S
 6/G
 7/T
 8/B
 9/q
 
 Some of the registrations are private and don't obey any format (NOTE:
 Trailer registrations aren't the same as vehicle registrations - they
 can be anything the owner wants. In some cases we have them as
 straight numbers - 1, 2, 3, etc.).
 
 I'm looking for is a way to compare what they've entered against a
 known list and to provide my 10 best guesses.
 
 What I'm stuck on is what criteria do I use.
 
 I think something like the old style colour Mastermind game (right
 colour in the right place, right colour wrong place, wrong colour).
 But that's going to be slow. One of the contracts has over 30,000
 vehicles/trailers available to them.
 
 Any suggestions really.
 
 Regards,
 
 Richard.
 


I'm not sure how well it would work for things like a registration
number, but metaphone keys are pretty good for detecting similar words
based on phonetic sounds. If you don't get a match first time, maybe
convert all the digits in the reg no. to letters, and create a metaphone
key that you compare against the list of other keys in your DB that you
already created in the same manner (you'll need to create these first
time)

Like I said, I'm not sure how well it will work, but it might possibly
reduce the failure rate a bit.

Also, there's the electric shock treatment. Find out which engineers are
the worst typists, and... well you get the idea!

Thanks,
Ash
http://www.ashleysheridan.co.uk




Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.

2010-07-15 Thread Daniel P. Brown
On Thu, Jul 15, 2010 at 11:54, Richard Quadling rquadl...@gmail.com wrote:

 I'm looking for is a way to compare what they've entered against a
 known list and to provide my 10 best guesses.

Look into the following functions and families:

levenshtein()
similar_text()

You wouldn't want to use metaphone() or soundex(), because you're
working with alphanumeric combinations, but if you want to be a
glutton for punishment, it could be fun.  Something along the lines of
using levenshtein() to get the best matches, then similar_text() to
score each of the matches would probably be a viable solution for you.
 The last time I incorporated something like that was for my own use
to track down a record in a MySQL database for which I could remember
just a small blip of text.  It looked through about 23 million records
in under two minutes and I found exactly what it was for which I was
searching --- even though I was off by one word.  Would've taken me
forever to find it using alternative methods.

-- 
/Daniel P. Brown
UNADVERTISED DEDICATED SERVER SPECIALS
SAME-DAY SETUP
Just ask me what we're offering today!
daniel.br...@parasane.net || danbr...@php.net
http://www.parasane.net/ || http://www.pilotpig.net/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php