Re: How to pick out the same titles.

2016-10-16 Thread duncan smith
On 16/10/16 16:16, Seymore4Head wrote:
> How to pick out the same titles.
> 
> I have a  long text file that has movie titles in it and I would like
> to find dupes.
> 
> The thing is that sometimes I have one called "The Killing Fields" and
> it also could be listed as "Killing Fields"  Sometimes the title will
> have the date a year off.
> 
> What I would like to do it output to another file that show those two
> as a match.
> 
> I don't know the best way to tackle this.  I would think you would
> have to pair the titles with the most consecutive letters in a row.
> 
> Anyone want this as a practice exercise?  I don't really use
> programming enough to remember how.
> 

Tokenize, generate (token) set similarity scores and cluster on
similarity score.


>>> import tokenization
>>> bigrams1 = tokenization.n_grams("The Killing Fields".lower(), 2,
pad=True)
>>> bigrams1
['_t', 'th', 'he', 'e ', ' k', 'ki', 'il', 'll', 'li', 'in', 'ng', 'g ',
' f', 'fi', 'ie', 'el', 'ld', 'ds', 's_']
>>> bigrams2 = tokenization.n_grams("Killing Fields".lower(), 2, pad=True)
>>> import pseudo
>>> pseudo.Jaccard(bigrams1, bigrams2)
0.7


You could probably just generate token sets, then iterate through all
title pairs and manually review those with similarity scores above a
suitable threshold. The code I used above is very simple (and pasted below).


def n_grams(s, n, pad=False):
# n >= 1
# returns a list of n-grams
# or an empty list if n > len(s)
if pad:
s = '_' * (n-1) + s + '_' * (n-1)
return [s[i:i+n] for i in range(len(s)-n+1)]

def Jaccard(tokens1, tokens2):
# returns exact Jaccard
# similarity measure for
# two token sets
tokens1 = set(tokens1)
tokens2 = set(tokens2)
return len(tokens1&tokens2) / len(tokens1|tokens2)


Duncan


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to pick out the same titles.

2016-10-16 Thread Alain Ketterlin
Seymore4Head  writes:

[...]
> I have a  long text file that has movie titles in it and I would like
> to find dupes.
>
> The thing is that sometimes I have one called "The Killing Fields" and
> it also could be listed as "Killing Fields"  Sometimes the title will
> have the date a year off.
>
> What I would like to do it output to another file that show those two
> as a match.

Try the difflib module (read the doc, its default behavior may be
surprising).

-- Alain.
-- 
https://mail.python.org/mailman/listinfo/python-list


How to pick out the same titles.

2016-10-16 Thread Seymore4Head
How to pick out the same titles.

I have a  long text file that has movie titles in it and I would like
to find dupes.

The thing is that sometimes I have one called "The Killing Fields" and
it also could be listed as "Killing Fields"  Sometimes the title will
have the date a year off.

What I would like to do it output to another file that show those two
as a match.

I don't know the best way to tackle this.  I would think you would
have to pair the titles with the most consecutive letters in a row.

Anyone want this as a practice exercise?  I don't really use
programming enough to remember how.
-- 
https://mail.python.org/mailman/listinfo/python-list