On 06/09/2011 03:49 PM, B G wrote: > I'm trying to analyze thousands of different cancer datasets and run the > same python program on them. I use Windows XP, Python 2.7 and the IDLE > interpreter. I already have the input files in a directory and I want to > learn the syntax for the quickest way to execute the program over all these > datasets. > > As an example,for the sample python program below, I don't want to have to > go into the python program each time and change filename and countfile. A > computer could do this much quicker than I ever could. Thanks in advance! >
I think os.listdir() would be better for you than os.walk(), as Walter suggested, but if you have a directory tree, walk is better. Your file code could be simplified a lot by using context managers, which for a file looks like this: with open(filename, mode) as f: f.write("Stuff!") f will automatically be closed and everything. Now for some code review! > > import string > > filename = 'draft1.txt' > countfile = 'draft1_output.txt' > > def add_word(counts, word): > if counts.has_key(word): > counts[word] += 1 > else: > counts[word] = 1 > See the notes on this later. > def get_word(item): > word = '' > item = item.strip(string.digits) > item = item.lstrip(string.punctuation) > item = item.rstrip(string.punctuation) > word = item.lower() > return word This whole function could be simplified to: return item.strip(string.digits + string.punctuation).lower() Note that foo.strip(bar) == foo.lstrip(bar).rstrip(bar) > > > def count_words(text): > text = ' '.join(text.split('--')) #replace '--' with a space How about text = text.replace('--', ' ') > items = text.split() #leaves in leading and trailing punctuation, > #'--' not recognised by split() as a word separator Or, items = text.split('--') You can specify the split string! You should read the docs on string methods: http://docs.python.org/library/stdtypes.html#string-methods > counts = {} > for item in items: > word = get_word(item) > if not word == '': That should be 'if word:', which just checks if it evaluates to True. Since the only string that evaluate to False is '', it makes the code shorter and more readable. > add_word(counts, word) > return counts A better way would be using a DefaultDict, like so: from collections import defaultdict [...] def count_words(text): counts = defaultdict(int) # Every key starts off at 0! items = text.split('--') for item in items: word = get_word(item) if word: counts[word] += 1 return counts Besides that things have a default value, a defaultdict is the same as any other dict. We pass 'int' as a parameter because defaultdict uses the parameter as a function for the default value. It works out because int() == 0. > > infile = open(filename, 'r') > text = infile.read() > infile.close() This could be: text = open(filename).read() When you're opening a file as 'r', the mode is optional! > > counts = count_words(text) > > outfile = open(countfile, 'w') > outfile.write("%-18s%s\n" %("Word", "Count")) > outfile.write("=======================\n") It may just be me, but I think outfile.write(('=' * 23) + '\n') looks better. > > counts_list = counts.items() > counts_list.sort() > for word in counts_list: > outfile.write("%-18s%d\n" %(word[0], word[1])) > > outfile.close Parenthesis are important! outfile.close is a method object, outfile.close() is a method call. Context managers make this easy, because you don't have to manually close things. Hope it helped, -- Corey Richardson _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor