Thanks Alan, Peter and Steve, Instead of answering each one of you independently let me try to use my response to Steve's message as the basis for an answer to all of you.
It turns out that matters of efficiency appear to be VERY important in this case. The example in my message was a very short string but the file that I'm trying to process is pretty big (20MB of text). I'm writing to you as my computer is about to burst in flames. I'm exaggerating a little bit because I'm checking the temperature and things so far seem to be under control but I ran the script that I made up following your recommendations (see below) on the real file for which I wanted to get word frequencies and it has been running for over half an hour without having generated the output file yet. I'm using a pretty powerful computer (core i7 with 8GB of RAM) so I'm a little surprised (and a bit worried as well) that the process hasn't finished yet. I tested the script before with a much smaller file and the output was as desired. When I look at the current processes running on my computer, I see the Python process taking 100% of the CPU. Since my computer has a multi-core processor, I'm assuming this process is using only one of the cores because another monitor tells me that the CPU usage is under 20%. This doesn't make much sense to me. I bought a computer with a powerful CPU precisely to do these kinds of things as fast as possible. How can it be that Python is only using such a small amount of processing power? But I digress, I will start another thread to ask about this because I'm curious to know whether this can be changed in any way. Now, however, I'm more interested in getting the right answer to my original question. OK, I'll start with Steve's answer first. > When you run that code, are you SURE that it merely results in the output > file being blank? When I run it, I get an obvious error: > > Traceback (most recent call last): > File "<stdin>", line 4, in <module> > TypeError: argument 1 must be string or read-only character buffer, not list > > Don't you get this error too? Nope. I was surprised myself, but I did not get any errors. But I suspect that this is because I don't have my IDE well configured. Although (see below) I do get many other error messages, I didn't get any in this case. See, I'm not only a newbie in Python but a newbie with IDEs as well. I'm using Eclipse (probably I should have started with something smaller and simpler) and I see the following error message: -------------------- Pylint: Executing command line:' /Applications/eclipse/Eclipse.app/Contents/MacOS --include-ids=y /Volumes/DATA/Documents/workspace/GCA/src/prova.py 'Pylint: The stdout of the command line is: Pylint: The stderr of the command line is: /usr/bin/python: can't find '__main__.py' in '/Applications/eclipse/Eclipse.app/Contents/MacOS' ----------------- Anyway, I tried the different alternatives all of you suggested with a small test file and everything worked perfectly. With the big file, however, none of the alternatives seems to work. Well, I don't know whether they work or not because the process takes so long that I have had to kill it out of desperation. The process I talk about at the beginning of this message is the one involving Peter's alternative. I think I'm going to kill it as well because now it has been running for 45 minutes and this seems way too long. So, here is how I wrote the code. You'll see that there are two different functions that do the same thing: countWords(wordlist) and countWords2(wordlist). countWords2 is adapted from Peter Otten's suggestion. This was the one that according to him would be more efficient. However, none of the versions (including Alan's as well) work when the file being processed is a large file. def countWords(wordlist): word_table = {} for word in wordlist: count = wordlist.count(word) word_table[word] = count def countWords2(wordlist): #as proposed by Peter Otten word_table = {} for word in wordlist: if word in word_table: word_table[word] += 1 else: word_table[word] = 1 count = wordlist.count(word) word_table[word] = count return sorted( word_table.items(), key=lambda item: item[1], reverse=True ) def getWords(filename): with open(filename, 'r') as f: words = f.read().split() return words def writeTable(filename, table): with open(filename, 'w') as f: for word, count in table: f.write("%s\t%s\n" % (word, count)) words = getWords('tokens_short.txt') table = countWords(words) # or table = countWords2(words) writeTable('output.txt', table) > For bonus points, you might want to think about why countWords will be so > inefficient for large word lists, although you probably won't see any > problems until you're dealing with thousands or tens of thousands of words. Well, now it will be clear to you that I AM seeing big problems because the files I need to process contain tens of thousands of words. The reason it is inefficient, I'm guessing, is because you have to repeat the counting of how many times a word appears in the list every time you encounter the same word in the loop. This is more or less what Peter said of the solution proposed by Alan, right? However, even with countWords2, which is supposed to overcome this problem, it feels as if I've entered an infinite loop. Josep M. > _______________________________________________ > Tutor maillist - tu...@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor