Re: UTF-8 and latin1

2022-10-25 Thread Chris Angelico
On Wed, 26 Oct 2022 at 05:09, Barry Scott  wrote:
>
>
>
> > On 25 Oct 2022, at 11:16, Stefan Ram  wrote:
> >
> > r...@zedat.fu-berlin.de (Stefan Ram) writes:
> >> You can let Python guess the encoding of a file.
> >> def encoding_of( name ):
> >> path = pathlib.Path( name )
> >> for encoding in( "utf_8", "cp1252", "latin_1" ):
> >> try:
> >> with path.open( encoding=encoding, errors="strict" )as file:
> >
> >  I also read a book which claimed that the tkinter.Text
> >  widget would accept bytes and guess whether these are
> >  encoded in UTF-8 or "ISO 8859-1" and decode them
> >  accordingly. However, today I found that here it does
> >  accept bytes but it always guesses "ISO 8859-1".
>
> The best you can do is assume that if the text cannot decode as utf-8 it may 
> be 8859-1.
>

Except when it's Windows-1252.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-10-25 Thread Barry Scott


> On 25 Oct 2022, at 11:16, Stefan Ram  wrote:
> 
> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>> You can let Python guess the encoding of a file.
>> def encoding_of( name ):
>> path = pathlib.Path( name )
>> for encoding in( "utf_8", "cp1252", "latin_1" ):
>> try:
>> with path.open( encoding=encoding, errors="strict" )as file:
> 
>  I also read a book which claimed that the tkinter.Text
>  widget would accept bytes and guess whether these are
>  encoded in UTF-8 or "ISO 8859-1" and decode them 
>  accordingly. However, today I found that here it does 
>  accept bytes but it always guesses "ISO 8859-1".

The best you can do is assume that if the text cannot decode as utf-8 it may be 
8859-1.

Barry

> 
>  main.py
> 
> import tkinter
> 
> text = tkinter.Text()
> text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='ISO 8859-1' ))
> text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='UTF-8' ))
> text.pack()
> print( text.get( "1.0", "end" ))
> 
>  output
> 
> AÄäÖöÜüßAÄäÖöÜüß
> 
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-19 Thread Dennis Lee Bieber
On Thu, 18 Aug 2022 11:33:59 -0700, Tobiah  declaimed the
following:

>
>So how does this break down?  When a person enters
>Montréal, Quebéc into a form field, what are they
>doing on the keyboard to make that happen?  As the
>string sits there in the text box, is it latin1, or utf-8
>or something else?  How does the browser know what
>sort of data it has in that text box?
>

If this were my ancient Amiga -- most of the accented characters in
ISO-Latin-1 were entered by using one of the meta/alt keys simultaneously
with one of five or six designated "dead keys" (in days of typewriters, a
dead key was one that did not advance the carriage to the next character
space). The dead key indicated which accent mark was to be applied to the
subsequent "regular" character.

On Windows, many of the characters might be entered using 
(where  are keys on the numeric pad!)  (such as 1254 => µ).

As for what the browser receives? Unless the browser is asking for raw
key codes and translating them internally to some encoding, it is likely
receiving characters in whatever encoding has been defined for the
computer/OS (Windows, most likely CP1252, which is a superset of latin-1 as
I recall). Whether the browser then re-encodes that to UTF-8 is something I
can't answer.



-- 
Wulfraed Dennis Lee Bieber AF6VN
wlfr...@ix.netcom.comhttp://wlfraed.microdiversity.freeddns.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


回复: UTF-8 and latin1

2022-08-19 Thread Daniel Lee
Thanks!

发件人: Stefan Ram<mailto:r...@zedat.fu-berlin.de>
发送时间: 2022年8月19日 6:23
收件人: python-list@python.org<mailto:python-list@python.org>
主题: Re: UTF-8 and latin1

Tobiah  writes:
>  When a person enters
>Montréal, Quebéc into a form field, what are they
>doing on the keyboard to make that happen?

  Depends on the OS and its configuration. Some devices might
  not even have a keyboard as hardware.

>As the
>string sits there in the text box, is it latin1, or utf-8
>or something else?

  This is an internal implementation detail of the browser.

>How does the browser know what
>sort of data it has in that text box?

  This is an internal implementation details of the browser.

  You usually do not need to know these internal information
  about the browser in order to use it.


--
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-listdata=05%7C01%7C%7C242e3a7de5ba4183621b08da81684702%7C84df9e7fe9f640afb435%7C1%7C0%7C637964582138805523%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=HSG21e6Aj5pyf7m8e290Rv7tsMMfCGZptEU32iMbo1I%3Dreserved=0

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-18 Thread Chris Angelico
On Fri, 19 Aug 2022 at 08:15, Tobiah  wrote:
>
> > You configure the web server to send:
> >
> >  Content-Type: text/html; charset=...
> >
> > in the HTTP header when it serves HTML files.
>
> So how does this break down?  When a person enters
> Montréal, Quebéc into a form field, what are they
> doing on the keyboard to make that happen?  As the
> string sits there in the text box, is it latin1, or utf-8
> or something else?  How does the browser know what
> sort of data it has in that text box?
>

As it sits there in the text box, it is *a text string*.

When it gets sent to the server, the encoding is defined by the
browser (with reference to the server's specifications) and identified
in a request header.

The server should then receive that and interpret it as a text string.

Encodings should ONLY be relevant when data is stored in files or
transmitted across a network etc, and the rest of the time, just think
in Unicode.

Also - migrate to Python 3, your life will become a lot easier.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-18 Thread Jon Ribbens via Python-list
On 2022-08-18, Tobiah  wrote:
>> You configure the web server to send:
>> 
>>  Content-Type: text/html; charset=...
>> 
>> in the HTTP header when it serves HTML files.
>
> So how does this break down?  When a person enters
> Montréal, Quebéc into a form field, what are they
> doing on the keyboard to make that happen?

It depends on what keybaord they have. Using a standard UK or US
("qwerty") keyboard and Windows you should be able to type "é" by
holding down the 'Alt' key to the right of the spacebar, and typing
'e'.  If they're using a French ("azerty") keyboard then I think they
can enter it by holding 'shift' and typing '2'.

> As the string sits there in the text box, is it latin1, or utf-8
> or something else?

That depends on which browser you're using. I think it's quite likely
it will use UTF-32 (i.e. fixed-width 32 bits per character).

> How does the browser know what sort of data it has in that text box?

It's a text box, so it knows it's text.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-18 Thread Tobiah

You configure the web server to send:

 Content-Type: text/html; charset=...

in the HTTP header when it serves HTML files.


So how does this break down?  When a person enters
Montréal, Quebéc into a form field, what are they
doing on the keyboard to make that happen?  As the
string sits there in the text box, is it latin1, or utf-8
or something else?  How does the browser know what
sort of data it has in that text box?


--
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-18 Thread Jon Ribbens via Python-list
On 2022-08-18, Tobiah  wrote:
>> Generally speaking browser submisisons were/are supposed to be sent
>> using the same encoding as the page, so if you're sending the page
>> as "latin1" then you'll see that a fair amount I should think. If you
>> send it as "utf-8" then you'll get 100% utf-8 back.
>
> The only trick I know is to use .  Would
> that 'send' the post as utf-8?  I always expected it had more
> to do with the way the user entered the characters.  How do
> they by the way, enter things like Montréal, Quebéc.  When they
> enter that into a text box on a web page can we say it's in
> a particular encoding at that time?  At submit time?

You configure the web server to send:

Content-Type: text/html; charset=...

in the HTTP header when it serves HTML files. Another way is to put:



or:



in the  section of your HTML document. The HTML "standard"
nowadays says that you are only allowed to use the "utf-8" encoding,
but if you use another encoding then browsers will generally use that
as both the encoding to use when reading the HTML file and the encoding
to use when submitting form data.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-18 Thread Jon Ribbens via Python-list
On 2022-08-17, Barry  wrote:
>> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list 
>>  wrote:
>> On 2022-08-17, Tobiah  wrote:
>>> I get data from various sources; client emails, spreadsheets, and
>>> data from web applications.  I find that I can do 
>>> some_string.decode('latin1')
>>> to get unicode that I can use with xlsxwriter,
>>> or put  in the header of a web page to display
>>> European characters correctly.  But normally UTF-8 is recommended as
>>> the encoding to use today.  latin1 works correctly more often when I
>>> am using data from the wild.  It's frustrating that I have to play
>>> a guessing game to figure out how to use incoming text.   I'm just wondering
>>> if there are any thoughts.  What if we just globally decided to use utf-8?
>>> Could that ever happen?
>> 
>> That has already been decided, as much as it ever can be. UTF-8 is
>> essentially always the correct encoding to use on output, and almost
>> always the correct encoding to assume on input absent any explicit
>> indication of another encoding. (e.g. the HTML "standard" says that
>> all HTML files must be UTF-8.)
>> 
>> If you are finding that your specific sources are often encoded with
>> latin-1 instead then you could always try something like:
>> 
>>try:
>>text = data.decode('utf-8')
>>except UnicodeDecodeError:
>>text = data.decode('latin-1')
>> 
>> (I think latin-1 text will almost always fail to be decoded as utf-8,
>> so this would work fairly reliably assuming those are the only two
>> encodings you see.)
>
> Only if a reserved byte is used in the string.
> It will often work in either.

Because it's actually ASCII and hence there's no difference between
interpreting it as utf-8 or iso-8859-1? In which case, who cares?

> For web pages it cannot be assumed that markup saying it’s utf-8 is
> correct. Many pages are I fact cp1252. Usually you find out because
> of a smart quote that is 0xa0 is cp1252 and illegal in utf-8.

Hence what I said above. But if a source explicitly states an encoding
and it's false then these days I see little need for sympathy.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-18 Thread Tobiah

Generally speaking browser submisisons were/are supposed to be sent
using the same encoding as the page, so if you're sending the page
as "latin1" then you'll see that a fair amount I should think. If you
send it as "utf-8" then you'll get 100% utf-8 back.


The only trick I know is to use .  Would
that 'send' the post as utf-8?  I always expected it had more
to do with the way the user entered the characters.  How do
they by the way, enter things like Montréal, Quebéc.  When they
enter that into a text box on a web page can we say it's in
a particular encoding at that time?  At submit time?

--
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-18 Thread Jon Ribbens via Python-list
On 2022-08-17, Tobiah  wrote:
>> That has already been decided, as much as it ever can be. UTF-8 is
>> essentially always the correct encoding to use on output, and almost
>> always the correct encoding to assume on input absent any explicit
>> indication of another encoding. (e.g. the HTML "standard" says that
>> all HTML files must be UTF-8.)

> I got an email from a client with blast text that
> was in French with stuff like: Montréal, Quebéc.
> latin1 did the trick.

There's no accounting for the Québécois. They think they speak French.

> Also, whenever I get a spreadsheet from a client and save as .csv,
> or take browser data through PHP, it always seems to work with latin1,
> but not UTF-8.

That depends on how you "saved as .csv" and what you did with PHP.
Generally speaking browser submisisons were/are supposed to be sent
using the same encoding as the page, so if you're sending the page
as "latin1" then you'll see that a fair amount I should think. If you
send it as "utf-8" then you'll get 100% utf-8 back.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-17 Thread dn
On 18/08/2022 03.33, Stefan Ram wrote:
> Tobiah  writes:
>> I get data from various sources; client emails, spreadsheets, and
>> data from web applications.  I find that I can do 
>> some_string.decode('latin1')
> 
>   Strings have no "decode" method. ("bytes" objects do.)
> 
>> to get unicode that I can use with xlsxwriter,
>> or put  in the header of a web page to display
>> European characters correctly.
> 
> |You should always use the UTF-8 character encoding. (Remember
> |that this means you also need to save your content as UTF-8.)
> World Wide Web Consortium (W3C) (2014)
> 
>> am using data from the wild.  It's frustrating that I have to play
>> a guessing game to figure out how to use incoming text.   I'm just wondering
> 
>   You can let Python guess the encoding of a file.
> 
> def encoding_of( name ):
> path = pathlib.Path( name )
> for encoding in( "utf_8", "cp1252", "latin_1" ):
> try:
> with path.open( encoding=encoding, errors="strict" )as file:
> text = file.read()
> return encoding
> except UnicodeDecodeError:
> pass
> return None
> 
>> if there are any thoughts.  What if we just globally decided to use utf-8?
>> Could that ever happen?
> 
>   That decisions has been made long ago.

Unfortunately, much of our data was collected long before then - and as
we've discovered, the OP is still living in Python 2 times.

What about if the path "name" (above) is not in utf-8?
eg the OP's Montréal in Latin1, as Montréal.txt or Montréal.rpt
-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-17 Thread Barry


> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list 
>  wrote:
> 
> On 2022-08-17, Tobiah  wrote:
>> I get data from various sources; client emails, spreadsheets, and
>> data from web applications.  I find that I can do 
>> some_string.decode('latin1')
>> to get unicode that I can use with xlsxwriter,
>> or put  in the header of a web page to display
>> European characters correctly.  But normally UTF-8 is recommended as
>> the encoding to use today.  latin1 works correctly more often when I
>> am using data from the wild.  It's frustrating that I have to play
>> a guessing game to figure out how to use incoming text.   I'm just wondering
>> if there are any thoughts.  What if we just globally decided to use utf-8?
>> Could that ever happen?
> 
> That has already been decided, as much as it ever can be. UTF-8 is
> essentially always the correct encoding to use on output, and almost
> always the correct encoding to assume on input absent any explicit
> indication of another encoding. (e.g. the HTML "standard" says that
> all HTML files must be UTF-8.)
> 
> If you are finding that your specific sources are often encoded with
> latin-1 instead then you could always try something like:
> 
>try:
>text = data.decode('utf-8')
>except UnicodeDecodeError:
>text = data.decode('latin-1')
> 
> (I think latin-1 text will almost always fail to be decoded as utf-8,
> so this would work fairly reliably assuming those are the only two
> encodings you see.)

Only if a reserved byte is used in the string.
It will often work in either.

For web pages it cannot be assumed that markup saying it’s utf-8 is
correct. Many pages are I fact cp1252. Usually you find out because
of a smart quote that is 0xa0 is cp1252 and illegal in utf-8.

Barry


> 
> Or you could use something fancy like https://pypi.org/project/chardet/
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-17 Thread Tobiah

That has already been decided, as much as it ever can be. UTF-8 is
essentially always the correct encoding to use on output, and almost
always the correct encoding to assume on input absent any explicit
indication of another encoding. (e.g. the HTML "standard" says that
all HTML files must be UTF-8.)


I got an email from a client with blast text that
was in French with stuff like: Montréal, Quebéc.
latin1 did the trick.
Also, whenever I get a spreadsheet from a client and save as .csv,
or take browser data through PHP, it always seems
to work with latin1, but not UTF-8.


--
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-17 Thread Tobiah

On 8/17/22 08:33, Stefan Ram wrote:

Tobiah  writes:

I get data from various sources; client emails, spreadsheets, and
data from web applications.  I find that I can do some_string.decode('latin1')


   Strings have no "decode" method. ("bytes" objects do.)


I'm using 2.7.  Maybe that's why.
 


Toby
--
https://mail.python.org/mailman/listinfo/python-list


UTF-8 and latin1

2022-08-17 Thread Tobiah

I get data from various sources; client emails, spreadsheets, and
data from web applications.  I find that I can do some_string.decode('latin1')
to get unicode that I can use with xlsxwriter,
or put  in the header of a web page to display
European characters correctly.  But normally UTF-8 is recommended as
the encoding to use today.  latin1 works correctly more often when I
am using data from the wild.  It's frustrating that I have to play
a guessing game to figure out how to use incoming text.   I'm just wondering
if there are any thoughts.  What if we just globally decided to use utf-8?
Could that ever happen?

--
https://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 and latin1

2022-08-17 Thread Jon Ribbens via Python-list
On 2022-08-17, Tobiah  wrote:
> I get data from various sources; client emails, spreadsheets, and
> data from web applications.  I find that I can do some_string.decode('latin1')
> to get unicode that I can use with xlsxwriter,
> or put  in the header of a web page to display
> European characters correctly.  But normally UTF-8 is recommended as
> the encoding to use today.  latin1 works correctly more often when I
> am using data from the wild.  It's frustrating that I have to play
> a guessing game to figure out how to use incoming text.   I'm just wondering
> if there are any thoughts.  What if we just globally decided to use utf-8?
> Could that ever happen?

That has already been decided, as much as it ever can be. UTF-8 is
essentially always the correct encoding to use on output, and almost
always the correct encoding to assume on input absent any explicit
indication of another encoding. (e.g. the HTML "standard" says that
all HTML files must be UTF-8.)

If you are finding that your specific sources are often encoded with
latin-1 instead then you could always try something like:

try:
text = data.decode('utf-8')
except UnicodeDecodeError:
text = data.decode('latin-1')

(I think latin-1 text will almost always fail to be decoded as utf-8,
so this would work fairly reliably assuming those are the only two
encodings you see.)

Or you could use something fancy like https://pypi.org/project/chardet/

-- 
https://mail.python.org/mailman/listinfo/python-list