Hi,
I have created a csv file that lists how often each word in the Internet Movie
Database occurs with different star-ratings and in different genres. The input
file looks something like this--since movies can have multiple genres, there
are three genre rows. (This is fake, simplified data.)
ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count
film1 Drama Thriller Western 1 the 20
film2 Comedy Musical NA 2 the 20
film3 Musical History Biography 1 the
20
film4 Drama Thriller Western 1 the 10
film5 Drama Thriller Western 9 the 20
I can get the program to tell me how many occurrence of "the" there are in
Thrillers (50), how many "the"'s in 1-stars (50), and how many 1-star drama
"the"'s there are (30). But I need to be able to expand beyond a particular
word and say "how many words total are in "Drama"? How many total words are in
1-star ratings? How many words are there in the whole corpus? On these all-word
totals, I'm stumped.
What I've done so far:
I used shelve() to store my input csv in a database format.
Here's how I get count information so far:
def get_word_count(word, db, genre=None, rating=None):
c = 0
vals = db[word]
for val in vals:
if not genre and not rating:
c += val['count']
elif genre and not rating:
if genre in val['genres']:
c += val['count']
elif rating and not genre:
if rating == val['rating']:
c += val['count']
else:
if rating == val['rating'] and genre in val['genres']:
c += val['count']
return c
(I think there's something a little wrong with the rating stuff, here, but this
code generally works and produces the right counts.)
With "get_word_count" I can do stuff like this to figure out how many times
"the" appears in a particular genre.
vals=db[word]
for val in vals:
genre_ct_for_word = get_word_count(word, db, genre, rating=None)
return genre_ct_for_word
I've tried to extend this thinking to get TOTAL genre/rating counts for all
words, but it doesn't work. I get a type error saying that string indices must
be integers. I'm not sure how to overcome this.
# Doesn't work:
def get_full_rating_count(db, rating=None):
full_rating_ct = 0
vals = db
for val in vals:
if not rating:
full_rating_ct += val['count']
elif rating == val['rating']:
if rating == val['rating']: # Um, I know this looks dumb, but in
the other code it seems to be necessary for things to work.
full_rating_ct += val['count']
return full_rating_ct
Can anyone suggest how to do this?
Thanks!
Tyler
Background for the curious:
What I really want to know is which words are over- or under-represented in
different Genre x Rating categories. "The" should be flat, but something like
"wow" should be over-represented in 1-star and 10-star ratings and
under-represented in 5-star ratings. Something like "gross" may be
over-represented in low-star ratings for romances but if grossness is a good
thing in horror movies, then we'll see "gross" over-represented in HIGH-star
ratings for horror.
To figure out over-representation and under-representation I need to compare
"observed" counts to "expected" counts. The expected counts are probabilities
and they require me to understand how many words I have in the whole corpus and
how many words in each rating category and how many words in each genre
category.
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor