On Thu, Dec 20, 2018 at 10:49:25AM -0500, Mary Sauerland wrote: > I want to get rid of words that are less than three characters
> f1_name = "/Users/marysauerland/Documents/file1.txt" > #the opinions > f2_name = "/Users/marysauerland/Documents/file2.txt" > #the constitution Better than comments are meaningful file names: opinions_filename = "/Users/marysauerland/Documents/file1.txt" constitution_filename = "/Users/marysauerland/Documents/file2.txt" > def read_words(words_file): > return [word.upper() for line in open(words_file, 'r') for word in > line.split()] Don't try to do too much in a single line of code. While technically that should work (I haven't tried it to see that it actually does) it would be better written as: def read_words(words_file): with open(words_file, 'r') as f: return [word.upper() for line in f for word in line.split()] This also has the advantage of ensuring that the file is closed after the words are read. In your earlier version, it is possible for the file to remain locked in an open state. Note that in this case Python's definition of "word" may not agree with the human reader's definition of a word. For example, Python, being rather simple-minded, will include punctuation in words so that "HELLO" "HELLO." count as different words. Oh well, that's something that can be adjusted later. For now, let's just go with the simple-minded definition of a word, and worry about adjusting it to something more specialised later. > read_words(f1_name) > #performs the function on the file The above line of code (and comment) are pointless. The function is called, the file is read, the words are generated, and then immediately thrown away. To use the words, you need to assign them to a variable, as you do below: > set1 = set(read_words(f1_name)) > #makes each word into a set and removes duplicate words A meaningful name is better. Also the comment is inaccurate: it is not that *each individual* word is turned into a set, but that the *list* of all the words are turned into a set. So better would be: opinions_words = set(read_words(opinions_filename)) constitition_words = set(read_words(constitution_filename)) This gives us the perfect opportunity to skip short words: opinions_words = set( word for word in read_words(opinions_filename) if len(word) >= 3) constitition_words = set( word for word in read_words(constitution_filename) if len(word) >= 3) Now you have two sets of unique words, each word guaranteed to be at least 3 characters long. The next thing you try to do is count how many words appear in each set. You do it with a double loop: > count_same_words = 0 > for word in set1: > if word in set2: > count_same_words += 1 but the brilliant thing about sets is that they already know how to do this themselves! Let's see the sorts of operations sets understand: py> set1 = set("abcdefgh") py> set2 = set("defghijk") py> set1 & set2 # the intersection (overlap) of both sets {'h', 'd', 'f', 'g', 'e'} py> set1 | set2 # the union (combination) of both sets {'f', 'd', 'c', 'b', 'h', 'i', 'k', 'j', 'a', 'g', 'e'} py> set1 ^ set2 # items in one or the other but not both sets {'i', 'k', 'c', 'b', 'j', 'a'} py> set1 - set2 # items in set1 but not set2 {'c', 'b', 'a'} (In the above, "py>" is the Python prompt. On your computer, your prompt is probably set to ">>>".) Can you see which set operation, one of & | ^ or - , you would use to get the set of words which appear in both sets? Hint: it isn't the - operation. If you wanted to know how many words appear in the constitution but NOT in the opinions, you could write: word_count = len(constitition_words - opinions_words) Does that give you a hint how to approach this? Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor