Re: [Tutor] Trouble in dealing with special characters.

2018-12-08 Thread Steven D'Aprano
On Sun, Dec 09, 2018 at 09:23:59AM +1100, Cameron Simpson wrote:
> On 07Dec2018 21:20, Steven D'Aprano  wrote:


# Python 2
> txt = "abcπ"
> >
> >but it is a lie, because what we get isn't the string we typed, but the
> >interpreters *bad guess* that we actually meant this:
> >
> txt
> >'abc\xcf\x80'
> 
> Wow. I did not know that! I imagined Python 2 would have simply rejected 
> such a string (out of range characters -- ordinals >= 256 -- in a "byte" 
> string).

Nope.

Python 2 tries hard to make bytes and unicode text work together. If 
your strings are pure ASCII, it "Just Works" and it seems great but on 
trickier cases it can lead to really confusing errors.

Behind the scenes, what the interpreter is doing is using some platform- 
specific codec (ASCII, UTF-8, or similar) to automatically encode/decode 
from bytes to text or vise versa. This sort of "Do What I Mean" 
processing can work, up to the point that it doesn't, then it all goes 
pearshaped and you have silent failures and hard-to-diagnose errors.

That's why Python 3 takes a hard-line policy that you cannot mix text 
and bytes (except, possibly, if one is the empty string) except by 
explicitly converting from one to the other.

-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Trouble in dealing with special characters.

2018-12-08 Thread Cameron Simpson

On 07Dec2018 21:20, Steven D'Aprano  wrote:

On Fri, Dec 07, 2018 at 02:06:16PM +0530, Sunil Tech wrote:

I am using Python 2.7.8


That is important information.

Python 2 unfortunately predates Unicode, and when it was added some bad
decisions were made. For example, we can write this in Python 2:


txt = "abcπ"


but it is a lie, because what we get isn't the string we typed, but the
interpreters *bad guess* that we actually meant this:


txt

'abc\xcf\x80'


Wow. I did not know that! I imagined Python 2 would have simply rejected 
such a string (out of range characters -- ordinals >= 256 -- in a "byte" 
string).


Cheers,
Cameron Simpson 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Trouble in dealing with special characters.

2018-12-07 Thread Mats Wichmann
On 12/7/18 3:20 AM, Steven D'Aprano wrote:

>> How to know whether in a given string(sentence) is there any that is not
>> ASCII character and how to replace?
> 
> That's usually the wrong solution. That's like saying, "My program can't 
> add numbers greater than 100. How do I tell if a number is greater than 
> 100, and turn it into a number smaller than 100?"

yes, it's usually the wrong solution, but in the case of quote marks it
is *possible* is is the wanted solution: certain text editing products
(cough cough Microsoft Word) are really prone to putting in typographic
quote marks.  Everyone knows not to use Word for editing your code, but
that doesn't mean some stuff doesn't make it into a data set we forced
to process, if someone exports some text from an editor, etc. There are
more quoting styles in the world than the English style, e.g. this one
is used in many languages: „quoted text“  (I don't know if that will
survive the email system, but starts with a descended double-quote mark).

It's completely up to what the application needs; it *might* as I say be
appropriate to normalize text so that only a single double-quote and
only a single single-quote (or apostrophe) style is used.  Or it might not.


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Trouble in dealing with special characters.

2018-12-07 Thread Steven D'Aprano
On Fri, Dec 07, 2018 at 02:06:16PM +0530, Sunil Tech wrote:
> Hi Alan,
> 
> I am using Python 2.7.8

That is important information.

Python 2 unfortunately predates Unicode, and when it was added some bad 
decisions were made. For example, we can write this in Python 2:

>>> txt = "abcπ"

but it is a lie, because what we get isn't the string we typed, but the 
interpreters *bad guess* that we actually meant this:

>>> txt
'abc\xcf\x80'

Depending on your operating system, sometimes you can work with these 
not-really-text strings for a long time, but when it fails, it fails 
HARD with confusing errors. Just as you have here:

> >>> tx = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
> >>> tx.decode()
> Traceback (most recent call last):
>   File "", line 1, in 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 19:
> ordinal not in range(128)


Here, Python tried to guess an encoding, and picked some 
platform-specific encoding like Latin-1 or CP-1252 or something even 
more exotic. That is the wrong thing to do.

But if you can guess which encoding it uses, you can make it work:

tx.decode("Latin1")

tx.decode("CP-1252")

But a better fix is to use actual text, by putting a "u" prefix outside 
the quote marks:

txt = u"MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"


If you need to write this to a file, you can do this:

file.write(txt.encode('utf-8'))

To read it back again:

# from a file using UTF-8
txt = file.read().decode('utf-8')

(If you get a decoding error, it means your text file wasn't actually 
UTF-8. Ask the supplier what it really is.)


> How to know whether in a given string(sentence) is there any that is not
> ASCII character and how to replace?

That's usually the wrong solution. That's like saying, "My program can't 
add numbers greater than 100. How do I tell if a number is greater than 
100, and turn it into a number smaller than 100?"

You can do this:

mystring = "something"
if any(ord(c) > 127 for c in mystring):
print "Contains non-ASCII"


But what you do then is hard to decide. Delete non-ASCII characters? 
Replace them with what?

If you are desperate, you can do this:

bytestring = "something"
text = bytestring.decode('ascii', errors='replace')
bytestring = text.encode('ascii', errors='replace')


but that will replace any non-ascii character with a question mark "?" 
which might not be what you want.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Trouble in dealing with special characters.

2018-12-07 Thread Steven D'Aprano
On Fri, Dec 07, 2018 at 01:28:18PM +0530, Sunil Tech wrote:
> Hi Tutor,
> 
> I have a trouble with dealing with special characters in Python 

There are no special characters in Python. There are only Unicode 
characters. All characters are Unicode, including those which are also 
ASCII.

Start here:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

https://blog.codinghorror.com/there-aint-no-such-thing-as-plain-text/

https://www.youtube.com/watch?v=sgHbC6udIqc

https://nedbatchelder.com/text/unipain.html


https://docs.python.org/3/howto/unicode.html

https://docs.python.org/2/howto/unicode.html


Its less than a month away from 2019. It is sad and shameful to be 
forced to use only ASCII, and nearly always unnecessary. Writing code 
that only supports the 128 ASCII characters is like writing a calculator 
that only supports numbers from 1 to 10.

But if you really must do so, keep reading.


> Below is
> the sentence with a special character(apostrophe) "MOUNTAIN VIEW WOMEN’S
> HEALTH CLINIC" with actually should be "MOUNTAIN VIEW WOMEN'S HEALTH CLINIC
> ".

Actually, no, it should be precisely what it is: "WOMEN’S" is correct, 
since that is an apostrophe. Changing the ’ to an inch-mark ' is not 
correct. But if you absolutely MUST change it:

mystring = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
mystring = mystring.replace("’", "'")

will do it in Python 3. In Python 2 you have to write this instead:

# Python 2 only
mystring = u"MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
mystring = mystring.replace(u"’", u"'")


to ensure Python uses Unicode strings.

What version of Python are you using, and what are you doing that gives 
you trouble?

It is very unlikely that the only way to solve the problem is to throw 
away the precise meaning of the text you are dealing with by reducing it 
to ASCII.

In Python 3, you can also do this:

mystring = ascii(mystring)

but the result will probably not be what you want.


> Please help, how to identify these kinds of special characters and replace
> them with appropriate ASCII?

For 99.99% of the characters, there is NO appropriate ASCII. What 
ASCII character do you expect for these?

§ π Й খ ₪ ∀ ▶ 丕 ☃ ☺️

ASCII, even when it was invented in 1963, wasn't sufficient even for 
American English (no cent sign, no proper quotes, missing punctuation 
marks) let alone British English or international text.

Unless you are stuck communicating with an ancient program written in 
the 1970s or 80s that cannot be upgraded, there are few good reasons to 
cripple your program by only supporting ASCII text.

But if you really need to, this might help:

http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/

http://code.activestate.com/recipes/578243-repair-common-unicode-mistakes-after-theyve-been-m/




-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Trouble in dealing with special characters.

2018-12-07 Thread Alan Gauld via Tutor
On 07/12/2018 08:36, Sunil Tech wrote:

> I am using Python 2.7.8
 tx = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
 tx.decode()
> Traceback (most recent call last):
>   File "", line 1, in 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 19:
> ordinal not in range(128)
> 
> How to know whether in a given string(sentence) is there any that is not
> ASCII character and how to replace?

How to detect is to do wat you just did but wrap a try/except around it:

try:
tx.decode()
except UnicodeError:
print " There were non ASCII characters in the data"

Now, how you replace the characters is up to you.
The location of the offending character is given in the error.
(Although there may be more, once you deal with that one!)
What would you like to replace it with from the ASCII subset?

But are you really sure you want to replace it with
an ASCII character? Most display devices these days
can cope with at least UTF-8 version of Unicode.
Maybe you really want to change your default character
set so it can handle those extra characters??

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Trouble in dealing with special characters.

2018-12-07 Thread Sunil Tech
Hi Alan,

I am using Python 2.7.8
>>> tx = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
>>> tx.decode()
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 19:
ordinal not in range(128)

How to know whether in a given string(sentence) is there any that is not
ASCII character and how to replace?

On Fri, Dec 7, 2018 at 2:01 PM Alan Gauld via Tutor 
wrote:

> On 07/12/2018 07:58, Sunil Tech wrote:
>
> > I have a trouble with dealing with special characters in Python Below is
> > the sentence with a special character(apostrophe) "MOUNTAIN VIEW WOMEN’S
> > HEALTH CLINIC" with actually should be "MOUNTAIN VIEW WOMEN'S HEALTH
> CLINIC
> > ".
>
> How do you define "special characters"?
> There is nothing special about the apostraphe. It is just as
> valid a character as all the other characters.
> What makes it special to you?
>
> > Please help, how to identify these kinds of special characters and
> replace
> > them with appropriate ASCII?
>
> What is appropriate ASCII?
> ASCII only has 127 characters.
> Unicode has thousands of characters.
> How do you want to map a unicode character into ASCII?
> There are lots of options but we can't tell what you
> think is appropriate.
>
> Finally, character handling changed between Python 2 and 3
> (where unicode became the default), so the solution will
> likely depend on the Python version you are using.
> Please tell us which.
>
>
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.amazon.com/author/alan_gauld
> Follow my photo-blog on Flickr at:
> http://www.flickr.com/photos/alangauldphotos
>
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Trouble in dealing with special characters.

2018-12-07 Thread Alan Gauld via Tutor
On 07/12/2018 07:58, Sunil Tech wrote:

> I have a trouble with dealing with special characters in Python Below is
> the sentence with a special character(apostrophe) "MOUNTAIN VIEW WOMEN’S
> HEALTH CLINIC" with actually should be "MOUNTAIN VIEW WOMEN'S HEALTH CLINIC
> ".

How do you define "special characters"?
There is nothing special about the apostraphe. It is just as
valid a character as all the other characters.
What makes it special to you?

> Please help, how to identify these kinds of special characters and replace
> them with appropriate ASCII?

What is appropriate ASCII?
ASCII only has 127 characters.
Unicode has thousands of characters.
How do you want to map a unicode character into ASCII?
There are lots of options but we can't tell what you
think is appropriate.

Finally, character handling changed between Python 2 and 3
(where unicode became the default), so the solution will
likely depend on the Python version you are using.
Please tell us which.


-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor