2008/3/9 Kent Johnson <[EMAIL PROTECTED]>:

> Emad Nawfal wrote:
> > Dear Tutors,
> > I'm trying to get the most frequent words in an Arabic text. I wrote the
> > following code and tried it on English and it works fine, but when I try
> > it on Arabic, all I get is the slashes and x's.
>
> > import codecs
> > infile = codecs.open(r'C:\Documents and
> > Settings\Emad\Desktop\milal.txt', 'r', 'utf-8').read().split()
> > num = {}
> > for word in infile:
> >     if word not in num:
> >         num[word] = 1
> >     num[word] +=1
> > new = zip(num.values(), num.keys())
>
> Note that new is a list of pairs of (count, word), *not* a list of words.
>
> > new.sort()
> > new.reverse()
> > outfile = codecs.open(r'C:\Documents and
> > Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8')
> > for word in new:
> >         print >> out, word
>
> So here 'word' is a tuple, not a string.
>
> When you print a tuple, the output is the repr() of the elements of a
> tuple, not the str() of the elements. For strings, this means that
> non-ascii characters are always printed using backslash escapes.
>
> For example:
> In [19]: s='é'
> In [21]: print s
> é
> In [25]: t=(s,s)
> In [26]: print t
> ('\xc3\xa9', '\xc3\xa9')
>
> I suggest you format the output yourself. If you want the tuple
> formatting, try this:
>
> for count, word in new: # unpack the tuple to two values
>   out.write('(%s, %s)\n' % (count, word))
>
> Kent
>

Thank you so much Kent. It works. I have now realized the bad things about
self-learning.

-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
http://emnawfal.googlepages.com
--------------------------------------------------------
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to