Emad Nawfal (عمـ نوفل ـاد) <emadnaw...@gmail.com> dixit: > Dear Tutors, > The purpose of this script is to see how many vocalized forms map to a > single consonantal form. For example, the form "fn" could be fan, fin, fun. > > The input is a large list (taken from a file) that has ordinary words. The > script creates a devocalized list, then compares the two lists. > > The problem: It takes over an hour to process 1 file. The average file size > is 400,000 words. > > Question: How can I make it run faster? I have a large number of files. > > Note: I'm not a programmer, so please avoid very technical terms. > > Thank you in anticipation. > > > > > > def devocalize(word): > vowels = "aiou" > return "".join([letter for letter in word if letter not in vowels]) > > > vowelled = ['him', 'ham', 'hum', 'fun', 'fan'] # input, usually a large list > of around 500,000 items > > vowelled = set(vowelled) > > unvowelled = set([devocalize(word) for word in vowelled]) > Your problem is algorithmic: the last part below is useless and it's the one that consume most time (a loop over all words on a loop over all words). Instead, as you produce unvowelled lexem versions, just feed a dictionary with unvowelled keys and a list of original vowelled lexems. So, to replace the first list comprehension above (untested):
wordmap = {} for lexem in vowelled: unvowelled = devocalize(lexem) # add lexem to list if unvowelled already registered if unvowelled in wordmap: wordmap[unvowelled].append(lexem) # else register unvowelled with first lexem else: wordmap[unvowelled] = [lexem] for (unvowelled,lexems) in wordmap.items(): # items = list of (key:value) pairs print unvowelled, " ".join(lexems) > for lex in unvowelled: > d = {} > d[lex] = [word for word in vowelled if devocalize(word) == lex] > > print lex, " ".join(d[lex]) > Note: If you really had to double loop over a whole lexicon, the trick #1 to highly speed things up is: to first split the list into parts separated on the criterion of (at least) first letter (or other script char you use), and do all process on list list parts. 1 list of 100 --> 10000 loops 10 lists of 10 --> 10 x 100 loops (In your case, it would be more clever to distinguish words on first _consonant_ char!) Denis ________________________________ la vita e estrany http://spir.wikidot.com/ _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor