It's customary to copy the list with answers, so everyone can benefit who may run into the same issue, too.

On 20-Nov-11 11:38, dave selby wrote:
It came from some automated HTML generation app ... I just had the
idea of looking at in with ghex .... every other character is \00
!!!!, thats mad. OK will try ans replace('\00', '') in the string
before splitting

Those bytes are there for a reason, it's not mad. It's using wide characters, possibly due to Unicode encoding. If there are special characters involved (multinational applications or whatever), you'll destroy them by killing the null bytes and won't handle the case of that high-order byte being something other than zero.

Check out Python's Unicode handling, and character set encode/decode features for a robust way to translate the output you're getting.



Cheers

Dave

On 20 November 2011 19:15, Steve Willoughby<st...@alchemy.com>  wrote:
Where did the string come from?  It looks at first glance like you have two 
bytes for each character instead of the one you expect.  Is this perhaps a 
Unicode string instead of ASCII?

Sent from my iPad

On 2011/11/20, at 10:28, dave selby<dave6...@gmail.com>  wrote:

Hi All,

I have a long string which is an HTML file, I strip the HTML tags away
and make a list with

text = re.split('<.*?>', HTML)

I then tried to search for a string with text.index(...) but it was
not found, printing HTML to a terminal I get what I expect, a block of
tags and text, I split the HTML and print text and I get loads of

\x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.

Any idea what is happening and how to get back to a list of ascii strings ?

Cheers

Dave

--

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor






--
Steve Willoughby / st...@alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to