How do I skip over multiple words in a file?
Let's say that I have an article. What I want to do is read in this file and have the program skip over ever instance of the words the, and, or, and but. What would be the general strategy for attacking a problem like this? -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I skip over multiple words in a file?
On 11/11/10 09:07, chad wrote: Let's say that I have an article. What I want to do is read in this file and have the program skip over ever instance of the words the, and, or, and but. What would be the general strategy for attacking a problem like this? I'd keep a file of stop words, read them into a set (normalizing case in the process). Then, as I skim over each word in my target file, check if the case-normalized version of the word is in your stop-words and skipping if it is. It might look something like this: def normalize_word(s): return s.strip().upper() stop_words = set( normalize_word(word) for word in file('stop_words.txt') ) for line in file('data.txt'): for word in line.split(): if normalize_word(word) in stop_words: continue process(word) -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I skip over multiple words in a file?
On 11/11/10 15:07, chad wrote: Let's say that I have an article. What I want to do is read in this file and have the program skip over ever instance of the words the, and, or, and but. What would be the general strategy for attacking a problem like this? If your files are not too big I'd simply read them into a string and do a string replace for each word you want to skip. If you want case insensitivity use re.replace() instead of the default string.replace() method. Neither are elegant or all that efficient but both are very easy. If your use case requires something high performance then best keep looking :) Roger. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I skip over multiple words in a file?
On 2010-11-11 08:07, chad wrote: Let's say that I have an article. What I want to do is read in this file and have the program skip over ever instance of the words the, and, or, and but. What would be the general strategy for attacking a problem like this? I realize that you may need or want to do this in Python. This would be trivial in an awk script. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I skip over multiple words in a file?
chad cdal...@gmail.com writes: Let's say that I have an article. What I want to do is read in this file and have the program skip over ever instance of the words the, and, or, and but. What would be the general strategy for attacking a problem like this? Something like (untested): stopwords = set (('and', 'or', 'but')) def goodwords(): for line in file: for w in line.split(): if w.lower() not in stopwords: yield w Removing punctuation is left as an exercise. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I skip over multiple words in a file?
Am 11.11.2010 21:33, schrieb Paul Watson: On 2010-11-11 08:07, chad wrote: Let's say that I have an article. What I want to do is read in this file and have the program skip over ever instance of the words the, and, or, and but. What would be the general strategy for attacking a problem like this? I realize that you may need or want to do this in Python. This would be trivial in an awk script. There are several ways to do this. skip = ('and','or','but') all=[] [[all.append(w) for w in l.split() if w not in skip] for l in open('some.txt').readlines()] print all If some.txt contains your original question, it returns this: [Let's, 'say', 'that', 'I', 'have', 'an', 'article.', 'What', 'I', 'want', 'to ', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program', 'skip', ' over', 'ever', 'instance', 'of', 'the', 'words', 'the,', 'and,', 'or,', ' but.', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem', 'like', 'this?'] But this _one_ way to get there. Faster solutions could be based on a regex: import re skip = ('and','or','but') all = re.compile('(\w+)') print [w for w in all.findall(open('some.txt').read()) if w not in skip] this gives this result (you loose some punctuation etc): ['Let', 's', 'say', 'that', 'I', 'have', 'an', 'article', 'What', 'I', 'want', ' to', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program', 'skip', 'over', 'ever', 'instance', 'of', 'the', 'words', 'the', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem', 'like', 'this '] But there are some many ways to do it ... attachment: stefan_sonnenberg.vcf-- http://mail.python.org/mailman/listinfo/python-list