Re: [Tutor] Trouble in dealing with special characters.
On Sun, Dec 09, 2018 at 09:23:59AM +1100, Cameron Simpson wrote: > On 07Dec2018 21:20, Steven D'Aprano wrote: # Python 2 > txt = "abcπ" > > > >but it is a lie, because what we get isn't the string we typed, but the > >interpreters *bad guess* that we actually meant this: > > > txt > >'abc\xcf\x80' > > Wow. I did not know that! I imagined Python 2 would have simply rejected > such a string (out of range characters -- ordinals >= 256 -- in a "byte" > string). Nope. Python 2 tries hard to make bytes and unicode text work together. If your strings are pure ASCII, it "Just Works" and it seems great but on trickier cases it can lead to really confusing errors. Behind the scenes, what the interpreter is doing is using some platform- specific codec (ASCII, UTF-8, or similar) to automatically encode/decode from bytes to text or vise versa. This sort of "Do What I Mean" processing can work, up to the point that it doesn't, then it all goes pearshaped and you have silent failures and hard-to-diagnose errors. That's why Python 3 takes a hard-line policy that you cannot mix text and bytes (except, possibly, if one is the empty string) except by explicitly converting from one to the other. -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Trouble in dealing with special characters.
On 07Dec2018 21:20, Steven D'Aprano wrote: On Fri, Dec 07, 2018 at 02:06:16PM +0530, Sunil Tech wrote: I am using Python 2.7.8 That is important information. Python 2 unfortunately predates Unicode, and when it was added some bad decisions were made. For example, we can write this in Python 2: txt = "abcπ" but it is a lie, because what we get isn't the string we typed, but the interpreters *bad guess* that we actually meant this: txt 'abc\xcf\x80' Wow. I did not know that! I imagined Python 2 would have simply rejected such a string (out of range characters -- ordinals >= 256 -- in a "byte" string). Cheers, Cameron Simpson ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Trouble in dealing with special characters.
On 12/7/18 3:20 AM, Steven D'Aprano wrote: >> How to know whether in a given string(sentence) is there any that is not >> ASCII character and how to replace? > > That's usually the wrong solution. That's like saying, "My program can't > add numbers greater than 100. How do I tell if a number is greater than > 100, and turn it into a number smaller than 100?" yes, it's usually the wrong solution, but in the case of quote marks it is *possible* is is the wanted solution: certain text editing products (cough cough Microsoft Word) are really prone to putting in typographic quote marks. Everyone knows not to use Word for editing your code, but that doesn't mean some stuff doesn't make it into a data set we forced to process, if someone exports some text from an editor, etc. There are more quoting styles in the world than the English style, e.g. this one is used in many languages: „quoted text“ (I don't know if that will survive the email system, but starts with a descended double-quote mark). It's completely up to what the application needs; it *might* as I say be appropriate to normalize text so that only a single double-quote and only a single single-quote (or apostrophe) style is used. Or it might not. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Trouble in dealing with special characters.
On Fri, Dec 07, 2018 at 02:06:16PM +0530, Sunil Tech wrote: > Hi Alan, > > I am using Python 2.7.8 That is important information. Python 2 unfortunately predates Unicode, and when it was added some bad decisions were made. For example, we can write this in Python 2: >>> txt = "abcπ" but it is a lie, because what we get isn't the string we typed, but the interpreters *bad guess* that we actually meant this: >>> txt 'abc\xcf\x80' Depending on your operating system, sometimes you can work with these not-really-text strings for a long time, but when it fails, it fails HARD with confusing errors. Just as you have here: > >>> tx = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC" > >>> tx.decode() > Traceback (most recent call last): > File "", line 1, in > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 19: > ordinal not in range(128) Here, Python tried to guess an encoding, and picked some platform-specific encoding like Latin-1 or CP-1252 or something even more exotic. That is the wrong thing to do. But if you can guess which encoding it uses, you can make it work: tx.decode("Latin1") tx.decode("CP-1252") But a better fix is to use actual text, by putting a "u" prefix outside the quote marks: txt = u"MOUNTAIN VIEW WOMEN’S HEALTH CLINIC" If you need to write this to a file, you can do this: file.write(txt.encode('utf-8')) To read it back again: # from a file using UTF-8 txt = file.read().decode('utf-8') (If you get a decoding error, it means your text file wasn't actually UTF-8. Ask the supplier what it really is.) > How to know whether in a given string(sentence) is there any that is not > ASCII character and how to replace? That's usually the wrong solution. That's like saying, "My program can't add numbers greater than 100. How do I tell if a number is greater than 100, and turn it into a number smaller than 100?" You can do this: mystring = "something" if any(ord(c) > 127 for c in mystring): print "Contains non-ASCII" But what you do then is hard to decide. Delete non-ASCII characters? Replace them with what? If you are desperate, you can do this: bytestring = "something" text = bytestring.decode('ascii', errors='replace') bytestring = text.encode('ascii', errors='replace') but that will replace any non-ascii character with a question mark "?" which might not be what you want. -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Trouble in dealing with special characters.
On Fri, Dec 07, 2018 at 01:28:18PM +0530, Sunil Tech wrote: > Hi Tutor, > > I have a trouble with dealing with special characters in Python There are no special characters in Python. There are only Unicode characters. All characters are Unicode, including those which are also ASCII. Start here: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ https://blog.codinghorror.com/there-aint-no-such-thing-as-plain-text/ https://www.youtube.com/watch?v=sgHbC6udIqc https://nedbatchelder.com/text/unipain.html https://docs.python.org/3/howto/unicode.html https://docs.python.org/2/howto/unicode.html Its less than a month away from 2019. It is sad and shameful to be forced to use only ASCII, and nearly always unnecessary. Writing code that only supports the 128 ASCII characters is like writing a calculator that only supports numbers from 1 to 10. But if you really must do so, keep reading. > Below is > the sentence with a special character(apostrophe) "MOUNTAIN VIEW WOMEN’S > HEALTH CLINIC" with actually should be "MOUNTAIN VIEW WOMEN'S HEALTH CLINIC > ". Actually, no, it should be precisely what it is: "WOMEN’S" is correct, since that is an apostrophe. Changing the ’ to an inch-mark ' is not correct. But if you absolutely MUST change it: mystring = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC" mystring = mystring.replace("’", "'") will do it in Python 3. In Python 2 you have to write this instead: # Python 2 only mystring = u"MOUNTAIN VIEW WOMEN’S HEALTH CLINIC" mystring = mystring.replace(u"’", u"'") to ensure Python uses Unicode strings. What version of Python are you using, and what are you doing that gives you trouble? It is very unlikely that the only way to solve the problem is to throw away the precise meaning of the text you are dealing with by reducing it to ASCII. In Python 3, you can also do this: mystring = ascii(mystring) but the result will probably not be what you want. > Please help, how to identify these kinds of special characters and replace > them with appropriate ASCII? For 99.99% of the characters, there is NO appropriate ASCII. What ASCII character do you expect for these? § π Й খ ₪ ∀ ▶ 丕 ☃ ☺️ ASCII, even when it was invented in 1963, wasn't sufficient even for American English (no cent sign, no proper quotes, missing punctuation marks) let alone British English or international text. Unless you are stuck communicating with an ancient program written in the 1970s or 80s that cannot be upgraded, there are few good reasons to cripple your program by only supporting ASCII text. But if you really need to, this might help: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/ http://code.activestate.com/recipes/578243-repair-common-unicode-mistakes-after-theyve-been-m/ -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Trouble in dealing with special characters.
On 07/12/2018 08:36, Sunil Tech wrote: > I am using Python 2.7.8 tx = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC" tx.decode() > Traceback (most recent call last): > File "", line 1, in > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 19: > ordinal not in range(128) > > How to know whether in a given string(sentence) is there any that is not > ASCII character and how to replace? How to detect is to do wat you just did but wrap a try/except around it: try: tx.decode() except UnicodeError: print " There were non ASCII characters in the data" Now, how you replace the characters is up to you. The location of the offending character is given in the error. (Although there may be more, once you deal with that one!) What would you like to replace it with from the ASCII subset? But are you really sure you want to replace it with an ASCII character? Most display devices these days can cope with at least UTF-8 version of Unicode. Maybe you really want to change your default character set so it can handle those extra characters?? -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Trouble in dealing with special characters.
Hi Alan, I am using Python 2.7.8 >>> tx = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC" >>> tx.decode() Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 19: ordinal not in range(128) How to know whether in a given string(sentence) is there any that is not ASCII character and how to replace? On Fri, Dec 7, 2018 at 2:01 PM Alan Gauld via Tutor wrote: > On 07/12/2018 07:58, Sunil Tech wrote: > > > I have a trouble with dealing with special characters in Python Below is > > the sentence with a special character(apostrophe) "MOUNTAIN VIEW WOMEN’S > > HEALTH CLINIC" with actually should be "MOUNTAIN VIEW WOMEN'S HEALTH > CLINIC > > ". > > How do you define "special characters"? > There is nothing special about the apostraphe. It is just as > valid a character as all the other characters. > What makes it special to you? > > > Please help, how to identify these kinds of special characters and > replace > > them with appropriate ASCII? > > What is appropriate ASCII? > ASCII only has 127 characters. > Unicode has thousands of characters. > How do you want to map a unicode character into ASCII? > There are lots of options but we can't tell what you > think is appropriate. > > Finally, character handling changed between Python 2 and 3 > (where unicode became the default), so the solution will > likely depend on the Python version you are using. > Please tell us which. > > > -- > Alan G > Author of the Learn to Program web site > http://www.alan-g.me.uk/ > http://www.amazon.com/author/alan_gauld > Follow my photo-blog on Flickr at: > http://www.flickr.com/photos/alangauldphotos > > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Trouble in dealing with special characters.
On 07/12/2018 07:58, Sunil Tech wrote: > I have a trouble with dealing with special characters in Python Below is > the sentence with a special character(apostrophe) "MOUNTAIN VIEW WOMEN’S > HEALTH CLINIC" with actually should be "MOUNTAIN VIEW WOMEN'S HEALTH CLINIC > ". How do you define "special characters"? There is nothing special about the apostraphe. It is just as valid a character as all the other characters. What makes it special to you? > Please help, how to identify these kinds of special characters and replace > them with appropriate ASCII? What is appropriate ASCII? ASCII only has 127 characters. Unicode has thousands of characters. How do you want to map a unicode character into ASCII? There are lots of options but we can't tell what you think is appropriate. Finally, character handling changed between Python 2 and 3 (where unicode became the default), so the solution will likely depend on the Python version you are using. Please tell us which. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor