Hello Tutors, Arabic words are build around a root of 3 or 4 consonants with lots of letters in between, and also prefixes and suffixes. The root ktb (write) for example, could be found in words like: ktab : book mktob: letter, written wktabhm: and their book yktb: to write lyktbha: in order for him to write it
I need to find all the word forms made up of a certain root in a corpus. My idea, which is not completely right, but nonetheless works most of the time, is to find words that have the letters of the root in their respective order. For example, the words that contain k followed by t then followed by b, no matter whether there is something in between. I came up with following which works fine. For learning purposes, please let me know whether this is a good way, and how else I can achieve that. I appreciate your help, as I always did. def getRoot(root, word): result = "" for letter in word: if letter not in root: continue result +=letter return result # main infile = open("myCorpus.txt").read().split() query = "ktb" outcome = set([word for word in infile if query == getRoot(query, word)]) for word in outcome: print(word) -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد الغزالي "No victim has ever been more repressed and alienated than the truth" Emad Soliman Nawfal Indiana University, Bloomington http://emnawfal.googlepages.com --------------------------------------------------------
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor