Here's one way to iterate over that to get the counts. I'm sure there are dozens. ###
x = 'asdf foo bar foo' counts = {} for word in x.split():
... counts[word] = x.count(word)
...
counts
{'foo': 2, 'bar': 1, 'asdf': 1} ### The dictionary takes care of duplicates. If you are using a really big file, it might pay to eliminate duplicates from the list before running x.count(word)
Be wary of using the count() function for this, it can be very slow. The problem is that every time you call count(), Python has to look at every element of the list to see if it matches the word you passed to count(). So if the list has n words in it, you will make n*n comparisons. In contrast, the method that directly accumulates counts in a dictionary just makes one pass over the list. For small lists this doesn't matter much, but for a longer list you will definitely see the difference.
For example, I downloaded "The Art of War" from Project Gutenberg (http://www.gutenberg.org/dirs/1/3/132/132a.txt) and tried both methods. Here is a program that times how long it takes to do the counts using two different methods:
######### WordCountTest.py ''' Count words two different ways '''
from string import punctuation from time import time
def countWithDict(words): ''' Word count by accumulating counts in a dictionary ''' counts = {} for word in words: counts[word] = counts.get(word, 0) + 1
return counts
def countWithCount(words): ''' Word count by calling count() for each word ''' counts = {} for word in words: counts[word] = words.count(word)
return counts
def timeOne(f, words): ''' Time how long it takes to do f(words) ''' startTime = time() f(words) endTime = time() print '%s: %f' % (f.__name__, endTime-startTime)
# Get the word list and strip off punctuation words = open(r'D:\Personal\Tutor\ArtOfWar.txt').read().split() words = [ word.strip(punctuation) for word in words ]
# How many words is it, anyway? print len(words), 'words'
# Time the word counts c1 = timeOne(countWithDict, words) c2 = timeOne(countWithCount, words)
# Check that both get the same result assert c1 == c2
####################
The results (times are in seconds): 14253 words countWithDict: 0.010000 countWithCount: 9.183000
It takes 900 times longer to count() each word individually!
Kent
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor