Emad Nawfal (عمـ نوفل ـاد) <emadnaw...@gmail.com> dixit:

> Dear Tutors,
> The purpose of this script is to see how many vocalized forms map to a
> single consonantal form. For example, the form "fn" could be fan, fin, fun.
> 
> The input is a large list (taken from a file) that has ordinary words. The
> script creates a devocalized list, then compares the two lists.
> 
> The problem: It takes over an hour to process 1 file. The average file size
> is 400,000 words.
> 
> Question: How can I make it run faster? I have a large number of files.
> 
> Note: I'm not a programmer, so please avoid very technical terms.
> 
> Thank you in anticipation.
> 
> 
> 
> 
> 
> def devocalize(word):
>     vowels = "aiou"
>     return "".join([letter for letter in word if letter not in vowels])
> 
> 
> vowelled = ['him', 'ham', 'hum', 'fun', 'fan'] # input, usually a large list
> of around 500,000 items
> 
> vowelled = set(vowelled)
> 
> unvowelled = set([devocalize(word) for word in vowelled])
> 
Your problem is algorithmic: the last part below is useless and it's the one 
that consume most time (a loop over all words on a loop over all words). 
Instead, as you produce unvowelled lexem versions, just feed a dictionary with 
unvowelled keys and a list of original vowelled lexems. So, to replace the 
first list comprehension above (untested):

wordmap = {}
for lexem in vowelled:
   unvowelled = devocalize(lexem)
   # add lexem to list if unvowelled already registered
   if unvowelled in wordmap:
       wordmap[unvowelled].append(lexem)
   # else register unvowelled with first lexem
   else:
       wordmap[unvowelled] = [lexem]
for (unvowelled,lexems) in wordmap.items():     # items = list of (key:value) 
pairs
    print unvowelled, " ".join(lexems)

> for lex in unvowelled:
>     d = {}
>     d[lex] = [word for word in vowelled if devocalize(word) == lex]
> 
>     print lex, " ".join(d[lex])
> 

Note: If you really had to double loop over a whole lexicon, the trick #1 to 
highly speed things up is: to first split the list into parts separated on the 
criterion of (at least) first letter (or other script char you use), and do all 
process on list list parts.
1 list of 100 --> 10000 loops
10 lists of 10 --> 10 x 100 loops
(In your case, it would be more clever to distinguish words on first 
_consonant_ char!)

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to