Some things to think about, however. Can you choose the characters you want, instead of the (many, many) characters you don't want? This might simplify matters.
What do most words look like in English? When is a hyphen part of a word, what about dashes? A dash in the middle of a sentence means something different than one at the end of line.
As far as other special/impossible cases...what is the difference between dogs' and 'the dogs' when you are counting words? What about acronyms written like S.P.E.C.T.R.E, or words that include a number like 1st? You can add point 5 to the list. Which are more common cases, which collide with other rules. What's the minimum amount of rules you can define to take the maximum chunk out of the problem?
That is enough of my random rambling.
A lot of it might magically fall into place once you try what Danny suggested. He is a smart guy. Doing my best to have clearly defined, self-contained functions that do a specific task usually helps to reduce a problem to more manageable steps, and visualize what is happening more clearly.
Good luck,
Andrew
On 10/10/05, Dick Moores <[EMAIL PROTECTED]> wrote:
Script is at:
<http://www.rcblue.com/Python/wordFrequency/wordFrequency.txt>
Example text file for input:
< http://www.rcblue.com/Python/wordFrequency/first3000linesOfDavidCopperfield.txt>
(142 kb)
(from <http://www.gutenberg.org/etext/766>)
Example output in file:
<http://www.rcblue.com/Python/wordFrequency/outputToFile.txt>
(40 kb)
(Execution took about 30 sec. with my computer.)
I worked on this a LONG time for something I expected to just be an easy
and possibly useful exercise. Three times I started completely over with
a new approach. Had a lot of trouble removing exactly the characters I
didn't want to appear in the output. Wished I knew how to debug other
than just by using a lot of print statements.
Specifically, I'm hoping for comments on or help with:
1) How to debug. I'm using v2.4, IDLE on Win XP.
2) I've tried to put in remarks that will help most anyone to understand
what the code is doing. Have I succeeded?
3) No modularization. Couldn't see a reason to do so. Is there one or two?
Specifically, what sections should become modules, if any?
4) Variable names. I gave up on making them self-explanatory. Instead, I
put in some remarks near the top of the script (lines 6-10) that I hope
do the job. Do they? In the code, does the "L to newL to L to newL to L"
kind of thing remain puzzling?
(lines 6-10)
# meaning of short variable names:
# S is a string
# c is a character of a string
# L, F are lists
# e is an element of a list
5) Ideally, abbreviations that end in a period, such as U.N., e.g., i.e.,
viz. op. cit., Mr. (Am. E.), etc., should not be stripped of their final
periods (whereas other words that end a sentence SHOULD be stripped). I
tried making and using a Python list of these, but it was too tough to
write the code to use it. Any ideas? (I can live very easily without a
solution to point 5, because if the output shows there are 10 "e.g"s,
I'll just assume, and I think safely, that there actually are 10 "e.g."s.
But I am curious, Pythonically.)
Thanks very much in advance, tutors.
Dick Moores
[EMAIL PROTECTED]
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor