Emad Nawfal wrote: > Dear Tutors, > I'm trying to get the most frequent words in an Arabic text. I wrote the > following code and tried it on English and it works fine, but when I try > it on Arabic, all I get is the slashes and x's.
> import codecs > infile = codecs.open(r'C:\Documents and > Settings\Emad\Desktop\milal.txt', 'r', 'utf-8').read().split() > num = {} > for word in infile: > if word not in num: > num[word] = 1 > num[word] +=1 > new = zip(num.values(), num.keys()) Note that new is a list of pairs of (count, word), *not* a list of words. > new.sort() > new.reverse() > outfile = codecs.open(r'C:\Documents and > Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8') > for word in new: > print >> out, word So here 'word' is a tuple, not a string. When you print a tuple, the output is the repr() of the elements of a tuple, not the str() of the elements. For strings, this means that non-ascii characters are always printed using backslash escapes. For example: In [19]: s='é' In [21]: print s é In [25]: t=(s,s) In [26]: print t ('\xc3\xa9', '\xc3\xa9') I suggest you format the output yourself. If you want the tuple formatting, try this: for count, word in new: # unpack the tuple to two values out.write('(%s, %s)\n' % (count, word)) Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor