How do I skip over multiple words in a file?

2010-11-11 Thread chad
Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words the,
and,  or, and but. What would be the general strategy for
attacking a problem like this?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I skip over multiple words in a file?

2010-11-11 Thread Tim Chase

On 11/11/10 09:07, chad wrote:

Let's say that I have an article. What I want to do is read in
this file and have the program skip over ever instance of the
words the, and,  or, and but. What would be the
general strategy for attacking a problem like this?


I'd keep a file of stop words, read them into a set 
(normalizing case in the process).  Then, as I skim over each 
word in my target file, check if the case-normalized version of 
the word is in your stop-words and skipping if it is.  It might 
look something like this:


  def normalize_word(s):
return s.strip().upper()

  stop_words = set(
normalize_word(word)
for word in file('stop_words.txt')
)
  for line in file('data.txt'):
for word in line.split():
  if normalize_word(word) in stop_words: continue
  process(word)

-tkc



--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I skip over multiple words in a file?

2010-11-11 Thread r0g

On 11/11/10 15:07, chad wrote:

Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words the,
and,  or, and but. What would be the general strategy for
attacking a problem like this?



If your files are not too big I'd simply read them into a string and do 
a string replace for each word you want to skip. If you want case 
insensitivity use re.replace() instead of the default string.replace() 
method. Neither are elegant or all that efficient but both are very 
easy. If your use case requires something high performance then best 
keep looking :)


Roger.
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I skip over multiple words in a file?

2010-11-11 Thread Paul Watson

On 2010-11-11 08:07, chad wrote:

Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words the,
and,  or, and but. What would be the general strategy for
attacking a problem like this?


I realize that you may need or want to do this in Python.  This would be 
trivial in an awk script.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I skip over multiple words in a file?

2010-11-11 Thread Paul Rubin
chad cdal...@gmail.com writes:

 Let's say that I have an article. What I want to do is read in this
 file and have the program skip over ever instance of the words the,
 and,  or, and but. What would be the general strategy for
 attacking a problem like this?

Something like (untested):

stopwords = set (('and', 'or', 'but'))

def goodwords():
  for line in file:
 for w in line.split():
if w.lower() not in stopwords:
   yield w

Removing punctuation is left as an exercise.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I skip over multiple words in a file?

2010-11-11 Thread Stefan Sonnenberg-Carstens

Am 11.11.2010 21:33, schrieb Paul Watson:

On 2010-11-11 08:07, chad wrote:

Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words the,
and,  or, and but. What would be the general strategy for
attacking a problem like this?


I realize that you may need or want to do this in Python.  This would 
be trivial in an awk script.

There are several ways to do this.

skip = ('and','or','but')
all=[]
[[all.append(w) for w in l.split() if w not in skip] for l in 
open('some.txt').readlines()]

print all

If some.txt contains your original question, it returns this:
[Let's, 'say', 'that', 'I', 'have', 'an', 'article.', 'What', 'I', 
'want', 'to
', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program', 
'skip', '
over', 'ever', 'instance', 'of', 'the', 'words', 'the,', 'and,', 
'or,', '
but.', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for', 
'attacking',

 'a', 'problem', 'like', 'this?']

But this _one_ way to get there.
Faster solutions could be based on a regex:
import re
skip = ('and','or','but')
all = re.compile('(\w+)')
print [w for w in all.findall(open('some.txt').read()) if w not in skip]

this gives this result (you loose some punctuation etc):
['Let', 's', 'say', 'that', 'I', 'have', 'an', 'article', 'What', 'I', 
'want', '
to', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program', 
'skip',
 'over', 'ever', 'instance', 'of', 'the', 'words', 'the', 'What', 
'would', 'be',
 'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem', 
'like', 'this

']

But there are some many ways to do it ...

attachment: stefan_sonnenberg.vcf-- 
http://mail.python.org/mailman/listinfo/python-list