Re: [Tutor] Most common words in a text file

2017-10-01 Thread Mats Wichmann
On 09/30/2017 11:12 AM, Sri G. wrote:
> I'm learning programming with Python.
> 
> I’ve written the code below for finding the most common words in a text
> file that has about 1.1 million words. It's working fine, but I believe
> there is always room for improvement.
> 
> When run, the function in the script gets a text file from the command-line
> argument sys.argv[1], opens the file in read mode, converts the text to
> lowercase, makes a list of words from the text after removing any
> whitespaces or empty strings, and stores the list elements as dictionary
> keys and values in a collections.Counter object. Finally, it returns a
> dictionary of the most common words and their counts. The
> words.most_common() method gets its argument from the optional top
>  parameter.
> 
> import sysimport collections
> def find_most_common_words(textfile, top=10):
> ''' Returns the most common words in the textfile.'''
> 
> textfile = open(textfile)
> text = textfile.read().lower()
> textfile.close()
> words = collections.Counter(text.split()) # how often each word appears
> 
> return dict(words.most_common(top))
> 
> filename = sys.argv[1]
> top_five_words = find_most_common_words(filename, 5)
> 
> I need your comments please.

Others have made some pertinent comments already.

How much you spend time to "improve" a bit of code depends on what
you're going to do with it.  If you've solved your problem, and it's a
one-shot: don't worry much about it. Nothing wrong with a bit of
throwaway code (although things do sometimes take on a life much longer
than you intended, I can say from experience!!!)

I'd ask a question or two to think about:  first off, if you know the
intended use of this function always will be to get a "top 10 list" -
then why convert it back to a dict (unsorted) to return after
Counter.most_common() has just given you a sorted list?  You're most
likely going to have to take your top_five_words dictionary and turn it
back into something sorted to report back out on the counts.

Second, if your function is likely to be called from many places in your
code, is it too specialized?  Can you design it as a more general API
that could be used for other things than this specific purpose?  For
example, perhaps you just want to return the Counter instance without
the further processing of "most_common", and let the client decide what
it wants to do with that object.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Most common words in a text file

2017-10-01 Thread Alan Gauld via Tutor
On 30/09/17 18:12, Sri G. wrote:

> import sysimport collections

I assume that should be two lines?

But you can also import multiple modules on a single line.

import sys, collections

Although some folks don't like that style.

> def find_most_common_words(textfile, top=10):
> ''' Returns the most common words in the textfile.'''

The comment is slightly inaccurate since you really
return a dict of the most common words *with the counts* added.
It is good practice to specify the precise return
type (list, tuple, dict etc) since that tells the user
what they can do with it once they have it.

Also by using the parameter textfile it is not clear
whether I should pass a file object or a file name.
Again it helps users if the comment is as specific
as possible.

> textfile = open(textfile)
> text = textfile.read().lower()

potential memory hog, others have already suggested
reading line by line

> textfile.close()
> words = collections.Counter(text.split()) # how often each word appears
> 
> return dict(words.most_common(top))

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Most common words in a text file

2017-10-01 Thread Mark Lawrence via Tutor

On 30/09/2017 18:12, Sri G. wrote:

I'm learning programming with Python.

I’ve written the code below for finding the most common words in a text
file that has about 1.1 million words. It's working fine, but I believe
there is always room for improvement.

When run, the function in the script gets a text file from the command-line
argument sys.argv[1], opens the file in read mode, converts the text to
lowercase, makes a list of words from the text after removing any
whitespaces or empty strings, and stores the list elements as dictionary
keys and values in a collections.Counter object. Finally, it returns a
dictionary of the most common words and their counts. The
words.most_common() method gets its argument from the optional top
  parameter.

import sysimport collections
def find_most_common_words(textfile, top=10):
 ''' Returns the most common words in the textfile.'''

 textfile = open(textfile)
 text = textfile.read().lower()
 textfile.close()


The modern Pythonic way is:-

with open(textfile) as textfile:
text = textfile.read().lower()

The file close is handled automatically for you.  For those who don't 
know this construct using the "with" keyword is called a context 
manager, here's an article about them 
https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/



 words = collections.Counter(text.split()) # how often each word appears

 return dict(words.most_common(top))

filename = sys.argv[1]


How about some error handling if the user forgets the filename?  The 
Pythonic way is to use a try/except looking for an IndexError, but 
there's nothing wrong with checking the length of sys.argv.



top_five_words = find_most_common_words(filename, 5)

I need your comments please.

Sri


Pretty good all in all :)

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

---
This email has been checked for viruses by AVG.
http://www.avg.com


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Most common words in a text file

2017-09-30 Thread Cameron Simpson

On 30Sep2017 22:42, Sri G.  wrote:

I’ve written the code below for finding the most common words in a text
file that has about 1.1 million words. It's working fine, but I believe
there is always room for improvement.

When run, the function in the script gets a text file from the command-line
argument sys.argv[1], opens the file in read mode, converts the text to
lowercase, makes a list of words from the text after removing any
whitespaces or empty strings, and stores the list elements as dictionary
keys and values in a collections.Counter object. Finally, it returns a
dictionary of the most common words and their counts. The
words.most_common() method gets its argument from the optional top
parameter.


Thank you, this is a nice clear explaination.


import collections


It is common to import specific names from modules; whether you just import the 
module or specific names depends on your needs. However, in this case, instead 
of importing "collections" and creating a counter with:


 collections.Counter()

I would import the name "Counter" and use it like this:

 from collections import Counter
 ...
 ... Counter() ...


def find_most_common_words(textfile, top=10):


It is often good form to externalise the "10" as a constant (Python doesn't 
have constants per se, but it has a convention for value which would be 
constants in other languages):


 DEFAULT_TOP_SIZE = 10

 def find_most_common_words(textfile, top=DEFAULT_TOP_SIZE):

In a library of functions you can then put all the "constants" up the top where 
they can be seen or modified, and reference then lower down where the functions 
are defined.



   ''' Returns the most common words in the textfile.'''
   textfile = open(textfile)


You shouldn't reuse "textfile" as a file. Keep the filename value distinct. I 
tend to use names like "textpath" for the filename and "textfile" for the open 
file, eg:


 textfile = open(textpath)

Of course you change the parameter name to match.


   textfile = open(textfile)
   text = textfile.read().lower()
   textfile.close()


The preferred way of opening files looks like this:

 with open(textpath) as textfile:
   ... work on the data from textfile ...
   ...
   ...

This automatically closes textfile on exiting the "with" suite, even if an 
exception occurs (which is very handy in situations where the exception might 
be caught in outer code - you never miss closing the file). It is also shorter 
and more readable.



   text = textfile.read().lower()


This code reads the entire text into memory. It is more frugal to read the file 
progressively, which is good because it scales to files of any size. This 
concern is prominent in my mind because I started programming on systems with a 
16 bit address space, so 64K max including the program itself _and_ the OS. But 
it applies even today when many data files are very large.


So one might process file file as lines of text:

 C = Counter()
 for line in textfile:
   words = line.lower().split()
   C.update(words)

This avoids the need for more than a single line to be stored in memory at any 
given time.


Cheers,
Cameron Simpson  (formerly c...@zip.com.au)
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] Most common words in a text file

2017-09-30 Thread Sri G.
I'm learning programming with Python.

I’ve written the code below for finding the most common words in a text
file that has about 1.1 million words. It's working fine, but I believe
there is always room for improvement.

When run, the function in the script gets a text file from the command-line
argument sys.argv[1], opens the file in read mode, converts the text to
lowercase, makes a list of words from the text after removing any
whitespaces or empty strings, and stores the list elements as dictionary
keys and values in a collections.Counter object. Finally, it returns a
dictionary of the most common words and their counts. The
words.most_common() method gets its argument from the optional top
 parameter.

import sysimport collections
def find_most_common_words(textfile, top=10):
''' Returns the most common words in the textfile.'''

textfile = open(textfile)
text = textfile.read().lower()
textfile.close()
words = collections.Counter(text.split()) # how often each word appears

return dict(words.most_common(top))

filename = sys.argv[1]
top_five_words = find_most_common_words(filename, 5)

I need your comments please.

Sri
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor