What I want to do is just to get the &xxxx characters into Hebrew, and I don't want to touch the rest. I don't care that it has invalid HTML tags. I don't understand how an encoding method has an exception on invalid tags... how are the two issues related?
On Saturday, June 2, 2012 1:46:18 AM UTC-4, mweissen wrote: > > Hi Udi, > > I have tried it once again. > I had to change from <br> to <br/> and from <img...> to <img...></img>. > "unescape" works now! > > But it would be interesting to learn about more this problem. Could you > please answer Massimos questions? > > 2012/6/2 Massimo Di Pierro <[email protected]> > >> I am not sure there is an error here. Is the problem that the characters >> are not dissplayed properly? Are you using a custom layout? If so, is it >> setting the utf8 encoding or does it tell the browser it is latin1? >> >> On Friday, 1 June 2012 15:45:45 UTC-5, Udi Milo wrote: >>> >>> It does, but not completely, >>> >>> As it turns out what I copy pasted was part of it, and your function >>> does work perfectly. when I try to run it on the entire text, I get errors >>> that I can't figure our, maybe you can help me once more? >>> here is the complete text: >>> >>> <div><div class="post"> >>> >>> <div dir="rtl">... >>> >> >>> </div> >>> >>> Thanks! >>> >>> >>> On Friday, June 1, 2012 1:45:32 AM UTC-4, mweissen wrote: >>>> >>>> I have found at >>>> http://wiki.python.org/moin/**EscapingXml<http://wiki.python.org/moin/EscapingXml> >>>> : >>>> >>>> import xml.parsers.expat >>>> >>>> def unescape(s): >>>> want_unicode = False >>>> if isinstance(s, unicode): >>>> s = s.encode("utf-8") >>>> want_unicode = True >>>> >>>> # the rest of this assumes that `s` is UTF-8 >>>> list = [] >>>> >>>> # create and initialize a parser object >>>> p = xml.parsers.expat.**ParserCreate("utf-8") >>>> p.buffer_text = True >>>> p.returns_unicode = want_unicode >>>> p.CharacterDataHandler = list.append >>>> >>>> # parse the data wrapped in a dummy element >>>> # (needed so the "document" is well-formed) >>>> p.Parse("<e>", 0) >>>> p.Parse(s, 0) >>>> p.Parse("</e>", 1) >>>> >>>> # join the extracted strings and return >>>> es = "" >>>> if want_unicode: >>>> es = u"" >>>> return es.join(list) >>>> >>>> With >>>> >>>> t="""מפת&#**x5D7;ים >>>> רבים מבקש&#**x5D9;ם >>>> את עזרת&#**x5D9; >>>> בפתר&#**x5D5;ן >>>> בעיו&#**x5EA; של >>>> ביצו&#**x5E2;י Visual Studio. >>>> \nבד”&#**x5DB; את רוב >>>> הבעי&#**x5D5;ת >>>> ניתן לפתו&#**x5E8; >>>> יחסי&#**x5EA; >>>> בקלו&#**x5EA;, >>>> \nוככל שעוב&#**x5E8; >>>> הזמן אני >>>> מוצא את עצמי >>>> מספק פחות או >>>> יותר את אותן >>>> התשו&#**x5D1;ות, \nמה >>>> שגרם לי לחשו&# >>>> **x5D1; שכנר&#**x5D0;ה >>>> הגיע הזמן >>>> להעל&#**x5D5;ת >>>> אותן בצור&#**x5D4; >>>> מסוד&#**x5E8;ת >>>> לפוס&#**x5D8;.""" >>>> print unescape (t) >>>> >>>> the result is >>>> >>>> מפתחים רבים מבקשים את עזרתי בפתרון בעיות של ביצועי Visual Studio. >>>> בד”כ את רוב הבעיות ניתן לפתור יחסית בקלות, >>>> וככל שעובר הזמן אני מוצא את עצמי מספק פחות או יותר את אותן התשובות, >>>> מה שגרם לי לחשוב שכנראה הגיע הזמן להעלות אותן בצורה מסודרת לפוסט. >>>> >>>> I hope it helps. >>>> Regards Martin >>>> >>>> 2012/6/1 Udi Milo <[email protected]> >>>> >>>>> part of my product receives user text, saves it and shows it later. >>>>> >>>>> one of my users added a hebrew text attached below and I do not know >>>>> how to translate it into letter instead of hex. >>>>> simple text.encode('UTF-8') doesn't work, and I am far from being an >>>>> expert in the subject. can someone help me out? >>>>> >>>>> see attached text: >>>>> >>>>> מפתח&#**x5D9;ם >>>>> רבים מבקש&#**x5D9;ם >>>>> את עזרת&#**x5D9; >>>>> בפתר&#**x5D5;ן >>>>> בעיו&#**x5EA; של >>>>> ביצו&#**x5E2;י Visual Studio. >>>>> בד”כ את רוב >>>>> הבעי&#**x5D5;ת >>>>> ניתן לפתו&#**x5E8; >>>>> יחסי&#**x5EA; בקלו&#** >>>>> x5EA;, >>>>> וככל שעוב&#**x5E8; >>>>> הזמן אני >>>>> מוצא את עצמי >>>>> מספק פחות או >>>>> יותר את אותן >>>>> התשו&#**x5D1;ות, >>>>> מה שגרם לי >>>>> לחשו&#**x5D1; >>>>> שכנר&#**x5D0;ה >>>>> הגיע הזמן >>>>> להעל&#**x5D5;ת >>>>> אותן בצור&#**x5D4; >>>>> מסוד&#**x5E8;ת >>>>> לפוס&#**x5D8;. >>>>> >>>> >>>> >>>> >>>>

