Dick Moores wrote: > Kent Johnson wrote at 03:24 10/11/2005: > >>Dick Moores wrote: >> >>>(Execution took about 30 sec. with my computer.) >> >>That's way too long > > > How long would you expect? I've already made some changes but haven't > seen the time change much.
A couple of seconds at most, unless you are running it on some dog computer. It's just not that much text and you should be able to process it in a couple of passes at most. What changes have you made? Several changes already posted should have a noticable effect, I think. What is your current code? >>>5) Ideally, abbreviations that end in a period, such as U.N., e.g., >> >>i.e., >> >>>viz. op. cit., Mr. (Am. E.), etc., should not be stripped of their final >>>periods (whereas other words that end a sentence SHOULD be stripped). I >>>tried making and using a Python list of these, but it was too tough to >>>write the code to use it. Any ideas? >> >>You should be able to do this with regular expressions or searching in >>the word. You want to test for a word that ends with a period but >>doesn't include any periods. Something like >>if word.endswith('.') and '.' not in word[:-1]: >> word = word[:-1] > > > Nice! That takes care of U.N., e.g., i.e., but not viz., op. cit., or Mr. Ah, right. I don't know how you could handle that except with a dictionary. At least they will only appear in the word list once, without the trailing period. >>Other notes: >>Use re.split() to do all the splits at once. Something like >> L = re.split(r'\s+|--|/', textAsString) > > > Don't understand this yet. I'll work on it. OK, it's a regular expression that will match either \s+ one or more white space e.g. space, tab, newline -- a hyphen / a slash re.split() then splits the string on each match. > > >>#remove empty elements in L >>while "" in L: >> L.remove("") >>The above iterates L twice for each empty word! > > > I don't get the twice. Could you spell it out, please? the test /"" in L/ searches the list for an empty string - that's one L.remove("") searches the list again for the empty string, then removes it > > >>The remove() calls are expensive too because the remaining elements of L >>must be shifted down. Do the whole thing in one pass over L with >> L = [ w for w in L if w ] >>You only need to remove empty elements once, when the rest of the >>processing is done. > > > Got it. But using this doesn't seem to make much difference in the time. > > Also, I'm puzzled that whether or not psyco is employed makes no > difference in the time. Can you explain why? My guess is it's because you have so many O(n^2) elements in the code. You have to get your algorithm to be O(n). > > >>for e in saveRemovedForLaterL: >> L.append(e) >>could be >>L.extend(e) > > > Are you recommending L.extend(e), or is it just another way to do it? Recommending. Look for ways to eliminate loops. If you can't eliminate them, move them into C code in the runtime, which is what this one does. > > Thanks very much for your help, Kent. No problem! Kent > > Dick > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor