Re: [Tutor] Please look at my wordFrequency.py

Dick Moores Wed, 12 Oct 2005 01:48:09 -0700

John Fouhy wrote at 15:09 10/11/2005:
>On 12/10/05, Dick Moores <[EMAIL PROTECTED]> wrote:
>
>Hi Dick,
>
>Glad you're making progress :-)
>
> > Yes, that's about the difference I was seeing. Thanks for taking the
> > trouble. I went from 30 to 27. With no regex use (don't understand it 
> yet).
>
>Regular expressions are a power tool of text processing.  They are a
>bit tricky to learn, but once you've got the hang of them you'll find
>they can save you a lot of effort.  I'm not sure of a good tutorial,
>but if you look through the archives of this list, there's been a lot
>of discussion of them in the past :-)


Yes, but I think I need to go at something complex systematically. In my 
bookcase I have the 1st edition of Friedl's _Mastering Regular 
Expressions_ O'Reilly, 1997. Because there is now a 2nd edition (2002) 
I'd been ignoring what I have, but in the Python 2.4 Python Library 
Reference section 4.2 re -- Regular expression operations I find:

"
See Also:Mastering Regular Expressions Book on regular expressions by 
Jeffrey Friedl, published by O'Reilly. The second edition of the book no 
longer covers Python at all, but the first edition covered writing good 
regular expression patterns in great detail. "

So I'd like to ask if there is agreement about the value of this book, 
and if the 1st is really better that the 2nd ed. for regex in Python.

In Python Library Reference 4.2.1 Regular Expression Syntax there is this:

"A brief explanation of the format of regular expressions follows. For 
further information and a gentler presentation, consult the Regular 
Expression HOWTO, accessible from 
<http://www.python.org/doc/howto/>http://www.python.org/doc/howto/.";

So I thought I'd start with that. What  do you think? I'm not a complete 
regex beginner, but I've forgotten most of what I once knew. A long time 
ago (12 years) I had a well-equipped dial-up shell (tcsh) account with 
helpful users (Netcom) and enjoyed learning something of the power of 
grep, etc.

The rest of what you wrote below is very clear and enlightening, and just 
right for my level. Are you a teacher? Ever thought about writing a 
Python book?

> > WOW! I didn't implement John's change because I didn't understand it.
> > Haven't dealt with dictionaries yet.
>
>Sorry about that.
>
>I think, basically, your project is in two parts: a "tokenization"
>part and a "counting" part.
>
>By tokenization, I mean the act of splitting up your input text into
>words (or "tokens").  This is the bit where there's no obvious right
>way, and you will probably need some ad-hoc code.
>
>Once you've done that, you move to the counting step: You have a list
>of words, and you want to count occurrences of each word.
>
>Think about how you would do such a task by hand with pen and paper.
>One approach you could take is to find the first word, then look
>through the document counting how many times the first word occurs.
>Then move to the second word, and look through the document again.
>You will end up looking through the entire document many times.
>
>But then you might think to yourself --- "When I count the first word,
>I have to _look at_ every other word in the document.  Why can't I
>count them at the same time?"  So you grab a piece of paper, write the
>words along the top, and underneath each word you keep a tally.  You
>look through your document, and for each word, you increase the
>corresponding tally.  When you're done, you have a count of every
>word, and you've only gone through your document once.

#My code
for word in L:
     k = L.count(word)
     if (k,word) not in F:
         F.append((k,word))

I wrote this quickly and it worked correctly--I was even proud of myself; 
I didn't think about how it did what it did--that it would repeatedly and 
unnecessarily look at each element of L and do too much with F as well 
(as Kent pointed out). Take L.count(word). I had this vague notion that 
it would take a quick look at all of L at once and come up with a count 
without actually counting, the way I can see a group of starlings in my 
back yard and know there are 4 without actually counting them. However, 5 
or more I count. But computers are magical, and not limited to 4, was my 
non-thinking thinking.

>This is the basic idea of the dictionary-based approach.  A dictionary
>is an efficient data structure for keeping these tallies.  In Big-O
>notation, it is O(1).  Any python tutorial can tell you more about
>dictionaries.
>
>Hope this helps :-)
>
> > Ah. But how can I know what is in C code and what isn't? For example, in
> > a previous post you say that L.extend(e) is in C, and imply that
> > L.append(e) isn't, and that therefore L.extend(e) should be used.
>
>The first rule about optimization is --- don't.  Unless you're really
>sure that you need to.
>
>I think a good general strategy is:
>  - If there is a builtin function that does what you want, use it.
>This does require you to be familiar with the standard library.  The
>most important sections are probably the sections on strings, lists,
>and dictionaries.
>  - Know where loops occur.  For example, if you say something like "x
>in A" where A is a list, that may require checking every element of A.
>  Then think about which loops are necessary, and which could be
>avoided or combined with others.

Thank you, John.

Dick

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Please look at my wordFrequency.py

Reply via email to