[issue21344] save scores or ratios in difflib get_close_matches

2014-10-23 Thread Michael Ohlrogge

Michael Ohlrogge added the comment:

This is my first time posting here, so apologies if I'm breaking rules.

I'd like to put in a vote in favor of this patch to get the matching scores.

I am a researcher at Stanford University using this tool to match up about 
100,000 different names of companies/entities in two different datasets that I 
have.  The names reflect the same underlying entities but because they come 
from different datasets, the spellings, abbreviations, etc. differ.

It would be helpful to me to be able to run the get_scored_close_matches() 
function and then sort the results by how close the matches were.  If I could 
for instance determine, based on some spot checking / sampling of the results, 
that everything with a match above a certain threshold is almost certainly 
correct, whereas those below a certain threshold need to be reviewed by hand, 
that would be helpful for me.  

I suppose I can accomplish something similar by playing around with setting the 
matching threshold at different levels.  Nevertheless, with as many possible 
matches as I am doing, the algorithm takes a decent amount of time to run, and 
I don't have a good way to know ex-ante what a reasonable threshold would be.

Just in general, I think it can be useful information for users to know how 
much confidence to have in the matches produced by the algorithm.  Users could 
choose to formulate this confidence either as a direct function of the score or 
perhaps based on some other factors, such as a statistical analysis procedure 
that takes the score into account.  

Thanks to everyone who put this package together and who suggested the patch.

--
nosy: +michaelohlrogge
versions: +Python 2.7 -Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21344
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21344] save scores or ratios in difflib get_close_matches

2014-10-23 Thread Michael Ohlrogge

Michael Ohlrogge added the comment:

Another way the scores could be useful would be to write an algorithm that 
would give you a number of possible answers based on the scores that you get.  
In other words, for example, perhaps if one of the possible matches has a score 
about .9, then it would only give you one, but if all were below .8, it would 
give you several.  Or, if the highest score were at least .1 greater than the 
next highest, it would only give you one, but if there were a bunch that were 
close together, it would return those.  

I'm not saying these specific applications should be part of the package, they 
are just more examples of how you could productively use the scores.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21344
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com