[Tutor] Getting total counts

aeneas24 Fri, 01 Oct 2010 13:40:55 -0700

Hi,
 
I have created a csv file that lists how often each word in the Internet Movie 
Database occurs with different star-ratings and in different genres. The input 
file looks something like this--since movies can have multiple genres, there 
are three genre rows. (This is fake, simplified data.)
 
ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count
film1        Drama        Thriller        Western        1        the        20
film2        Comedy        Musical        NA        2        the        20
film3        Musical        History        Biography        1        the        
20
film4        Drama        Thriller        Western        1        the        10
film5        Drama        Thriller        Western        9        the        20
 
I can get the program to tell me how many occurrence of "the" there are in 
Thrillers (50), how many "the"'s in 1-stars (50), and how many 1-star drama 
"the"'s there are (30). But I need to be able to expand beyond a particular 
word and say "how many words total are in "Drama"? How many total words are in 
1-star ratings? How many words are there in the whole corpus? On these all-word 
totals, I'm stumped. 
 
What I've done so far:
I used shelve() to store my input csv in a database format. 
 
Here's how I get count information so far:
def get_word_count(word, db, genre=None, rating=None):
    c = 0
    vals = db[word]
    for val in vals:
        if not genre and not rating:
            c += val['count']
        elif genre and not rating:
            if genre in val['genres']:            
                c += val['count']
        elif rating and not genre:
            if rating == val['rating']:
                c += val['count']        
        else:
            if rating == val['rating'] and genre in val['genres']:
                c += val['count']            
    return c
 
(I think there's something a little wrong with the rating stuff, here, but this 
code generally works and produces the right counts.)
 
With "get_word_count" I can do stuff like this to figure out how many times 
"the" appears in a particular genre. 
vals=db[word]
for val in vals:
genre_ct_for_word = get_word_count(word, db, genre, rating=None)
return genre_ct_for_word
 
I've tried to extend this thinking to get TOTAL genre/rating counts for all 
words, but it doesn't work. I get a type error saying that string indices must 
be integers. I'm not sure how to overcome this.
 
# Doesn't work:
def get_full_rating_count(db, rating=None):
    full_rating_ct = 0
    vals = db
    for val in vals:
        if not rating:
            full_rating_ct += val['count']
        elif rating == val['rating']:
            if rating == val['rating']: # Um, I know this looks dumb, but in 
the other code it seems to be necessary for things to work. 
                full_rating_ct += val['count']
    return full_rating_ct
 
Can anyone suggest how to do this? 
 
Thanks!
 
Tyler
 
 
Background for the curious:
What I really want to know is which words are over- or under-represented in 
different Genre x Rating categories. "The" should be flat, but something like 
"wow" should be over-represented in 1-star and 10-star ratings and 
under-represented in 5-star ratings. Something like "gross" may be 
over-represented in low-star ratings for romances but if grossness is a good 
thing in horror movies, then we'll see "gross" over-represented in HIGH-star 
ratings for horror. 
 
To figure out over-representation and under-representation I need to compare 
"observed" counts to "expected" counts. The expected counts are probabilities 
and they require me to understand how many words I have in the whole corpus and 
how many words in each rating category and how many words in each genre 
category.

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Getting total counts

Reply via email to