What I want to do is just to get the &xxxx characters into Hebrew, and I 
don't want to touch the rest. I don't care that it has invalid HTML tags.
I don't understand how an encoding method has an exception on invalid 
tags... how are the two issues related?


On Saturday, June 2, 2012 1:46:18 AM UTC-4, mweissen wrote:
>
> Hi Udi,
>
> I have tried it once again.
> I had to change from  <br> to <br/> and from <img...> to <img...></img>.
> "unescape" works now!
>
> But it would be interesting to learn about more this problem. Could you 
> please answer Massimos questions? 
>
> 2012/6/2 Massimo Di Pierro <[email protected]>
>
>> I am not sure there is an error here. Is the problem that the characters 
>> are not dissplayed properly? Are you using a custom layout? If so, is it 
>> setting the utf8 encoding or does it tell the browser it is latin1?
>>
>> On Friday, 1 June 2012 15:45:45 UTC-5, Udi Milo wrote:
>>>
>>> It does, but not completely,
>>>
>>> As it turns out what I copy pasted was part of it, and your function 
>>> does work perfectly. when I try to run it on the entire text, I get errors 
>>> that I can't figure our, maybe you can help me once more?
>>> here is the complete text:
>>>
>>> <div><div class="post">
>>>         
>>>         <div dir="rtl">...
>>>
>>
>>>                     </div>
>>>
>>> Thanks!
>>>
>>>
>>> On Friday, June 1, 2012 1:45:32 AM UTC-4, mweissen wrote:
>>>>
>>>> I have found at 
>>>> http://wiki.python.org/moin/**EscapingXml<http://wiki.python.org/moin/EscapingXml>
>>>> :
>>>>
>>>> import xml.parsers.expat
>>>>
>>>> def unescape(s):
>>>>     want_unicode = False
>>>>     if isinstance(s, unicode):
>>>>         s = s.encode("utf-8")
>>>>         want_unicode = True
>>>>
>>>>     # the rest of this assumes that `s` is UTF-8
>>>>     list = []
>>>>
>>>>     # create and initialize a parser object
>>>>     p = xml.parsers.expat.**ParserCreate("utf-8")
>>>>     p.buffer_text = True
>>>>     p.returns_unicode = want_unicode
>>>>     p.CharacterDataHandler = list.append
>>>>
>>>>     # parse the data wrapped in a dummy element
>>>>     # (needed so the "document" is well-formed)
>>>>     p.Parse("<e>", 0)
>>>>     p.Parse(s, 0)
>>>>     p.Parse("</e>", 1)
>>>>
>>>>     # join the extracted strings and return
>>>>     es = ""
>>>>     if want_unicode:
>>>>         es = u""
>>>>     return es.join(list)
>>>>
>>>> With
>>>>
>>>> t="""&#x5DE;&#x5E4;&#x5EA;&#**x5D7;&#x5D9;&#x5DD; 
>>>> &#x5E8;&#x5D1;&#x5D9;&#x5DD; &#x5DE;&#x5D1;&#x5E7;&#x5E9;&#**x5D9;&#x5DD; 
>>>> &#x5D0;&#x5EA; &#x5E2;&#x5D6;&#x5E8;&#x5EA;&#**x5D9; 
>>>> &#x5D1;&#x5E4;&#x5EA;&#x5E8;&#**x5D5;&#x5DF; 
>>>> &#x5D1;&#x5E2;&#x5D9;&#x5D5;&#**x5EA; &#x5E9;&#x5DC; 
>>>> &#x5D1;&#x5D9;&#x5E6;&#x5D5;&#**x5E2;&#x5D9; Visual Studio. 
>>>> \n&#x5D1;&#x5D3;&#x201D;&#**x5DB; &#x5D0;&#x5EA; &#x5E8;&#x5D5;&#x5D1; 
>>>> &#x5D4;&#x5D1;&#x5E2;&#x5D9;&#**x5D5;&#x5EA; 
>>>> &#x5E0;&#x5D9;&#x5EA;&#x5DF; &#x5DC;&#x5E4;&#x5EA;&#x5D5;&#**x5E8; 
>>>> &#x5D9;&#x5D7;&#x5E1;&#x5D9;&#**x5EA; 
>>>> &#x5D1;&#x5E7;&#x5DC;&#x5D5;&#**x5EA;, 
>>>> \n&#x5D5;&#x5DB;&#x5DB;&#x5DC; &#x5E9;&#x5E2;&#x5D5;&#x5D1;&#**x5E8; 
>>>> &#x5D4;&#x5D6;&#x5DE;&#x5DF; &#x5D0;&#x5E0;&#x5D9; 
>>>> &#x5DE;&#x5D5;&#x5E6;&#x5D0; &#x5D0;&#x5EA; &#x5E2;&#x5E6;&#x5DE;&#x5D9; 
>>>> &#x5DE;&#x5E1;&#x5E4;&#x5E7; &#x5E4;&#x5D7;&#x5D5;&#x5EA; &#x5D0;&#x5D5; 
>>>> &#x5D9;&#x5D5;&#x5EA;&#x5E8; &#x5D0;&#x5EA; &#x5D0;&#x5D5;&#x5EA;&#x5DF; 
>>>> &#x5D4;&#x5EA;&#x5E9;&#x5D5;&#**x5D1;&#x5D5;&#x5EA;, \n&#x5DE;&#x5D4; 
>>>> &#x5E9;&#x5D2;&#x5E8;&#x5DD; &#x5DC;&#x5D9; &#x5DC;&#x5D7;&#x5E9;&#x5D5;&#
>>>> **x5D1; &#x5E9;&#x5DB;&#x5E0;&#x5E8;&#**x5D0;&#x5D4; 
>>>> &#x5D4;&#x5D2;&#x5D9;&#x5E2; &#x5D4;&#x5D6;&#x5DE;&#x5DF; 
>>>> &#x5DC;&#x5D4;&#x5E2;&#x5DC;&#**x5D5;&#x5EA; 
>>>> &#x5D0;&#x5D5;&#x5EA;&#x5DF; &#x5D1;&#x5E6;&#x5D5;&#x5E8;&#**x5D4; 
>>>> &#x5DE;&#x5E1;&#x5D5;&#x5D3;&#**x5E8;&#x5EA; 
>>>> &#x5DC;&#x5E4;&#x5D5;&#x5E1;&#**x5D8;."""
>>>> print unescape (t)
>>>>
>>>> the result is
>>>>
>>>> מפתחים רבים מבקשים את עזרתי בפתרון בעיות של ביצועי Visual Studio. 
>>>> בד”כ את רוב הבעיות ניתן לפתור יחסית בקלות, 
>>>> וככל שעובר הזמן אני מוצא את עצמי מספק פחות או יותר את אותן התשובות, 
>>>> מה שגרם לי לחשוב שכנראה הגיע הזמן להעלות אותן בצורה מסודרת לפוסט.
>>>>
>>>> I hope it helps.
>>>> Regards Martin
>>>>
>>>> 2012/6/1 Udi Milo <[email protected]>
>>>>
>>>>> part of my product receives user text, saves it and shows it later.
>>>>>
>>>>> one of my users added a hebrew text attached below and I do not know 
>>>>> how to translate it into letter instead of hex.
>>>>> simple text.encode('UTF-8') doesn't work, and I am far from being an 
>>>>> expert in the subject. can someone help me out?
>>>>>
>>>>> see attached text:
>>>>>
>>>>> &#x5DE;&#x5E4;&#x5EA;&#x5D7;&#**x5D9;&#x5DD; 
>>>>> &#x5E8;&#x5D1;&#x5D9;&#x5DD; &#x5DE;&#x5D1;&#x5E7;&#x5E9;&#**x5D9;&#x5DD; 
>>>>> &#x5D0;&#x5EA; &#x5E2;&#x5D6;&#x5E8;&#x5EA;&#**x5D9; 
>>>>> &#x5D1;&#x5E4;&#x5EA;&#x5E8;&#**x5D5;&#x5DF; 
>>>>> &#x5D1;&#x5E2;&#x5D9;&#x5D5;&#**x5EA; &#x5E9;&#x5DC; 
>>>>> &#x5D1;&#x5D9;&#x5E6;&#x5D5;&#**x5E2;&#x5D9; Visual Studio. 
>>>>> &#x5D1;&#x5D3;&#x201D;&#x5DB; &#x5D0;&#x5EA; &#x5E8;&#x5D5;&#x5D1; 
>>>>> &#x5D4;&#x5D1;&#x5E2;&#x5D9;&#**x5D5;&#x5EA; 
>>>>> &#x5E0;&#x5D9;&#x5EA;&#x5DF; &#x5DC;&#x5E4;&#x5EA;&#x5D5;&#**x5E8; 
>>>>> &#x5D9;&#x5D7;&#x5E1;&#x5D9;&#**x5EA; &#x5D1;&#x5E7;&#x5DC;&#x5D5;&#**
>>>>> x5EA;, 
>>>>> &#x5D5;&#x5DB;&#x5DB;&#x5DC; &#x5E9;&#x5E2;&#x5D5;&#x5D1;&#**x5E8; 
>>>>> &#x5D4;&#x5D6;&#x5DE;&#x5DF; &#x5D0;&#x5E0;&#x5D9; 
>>>>> &#x5DE;&#x5D5;&#x5E6;&#x5D0; &#x5D0;&#x5EA; &#x5E2;&#x5E6;&#x5DE;&#x5D9; 
>>>>> &#x5DE;&#x5E1;&#x5E4;&#x5E7; &#x5E4;&#x5D7;&#x5D5;&#x5EA; &#x5D0;&#x5D5; 
>>>>> &#x5D9;&#x5D5;&#x5EA;&#x5E8; &#x5D0;&#x5EA; &#x5D0;&#x5D5;&#x5EA;&#x5DF; 
>>>>> &#x5D4;&#x5EA;&#x5E9;&#x5D5;&#**x5D1;&#x5D5;&#x5EA;, 
>>>>> &#x5DE;&#x5D4; &#x5E9;&#x5D2;&#x5E8;&#x5DD; &#x5DC;&#x5D9; 
>>>>> &#x5DC;&#x5D7;&#x5E9;&#x5D5;&#**x5D1; 
>>>>> &#x5E9;&#x5DB;&#x5E0;&#x5E8;&#**x5D0;&#x5D4; 
>>>>> &#x5D4;&#x5D2;&#x5D9;&#x5E2; &#x5D4;&#x5D6;&#x5DE;&#x5DF; 
>>>>> &#x5DC;&#x5D4;&#x5E2;&#x5DC;&#**x5D5;&#x5EA; 
>>>>> &#x5D0;&#x5D5;&#x5EA;&#x5DF; &#x5D1;&#x5E6;&#x5D5;&#x5E8;&#**x5D4; 
>>>>> &#x5DE;&#x5E1;&#x5D5;&#x5D3;&#**x5E8;&#x5EA; 
>>>>> &#x5DC;&#x5E4;&#x5D5;&#x5E1;&#**x5D8;.
>>>>>
>>>>
>>>>
>>>>
>>>> 

Reply via email to