Re: UTF_16 question

2024-04-29 Thread Richard Damon via Python-list
> On Apr 29, 2024, at 12:23 PM, jak via Python-list  
> wrote:
> 
> Hi everyone,
> one thing that I do not understand is happening to me: I have some text
> files with different characteristics, among these there are that they
> have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them
> without BOM. With those utf_32_xx I have no problem but with the
> UTF_16_xx I have. If I have an utf_16_le coded file and I read it with
> encoding='utf_16_le' I have no problem I read it, with
> encoding='utf_16_be' I can read it without any error even if the data I
> receive have the inverted bytes. The same thing happens with the
> utf_16_be codified file, I read it, both with encoding='utf_16_be' and
> with 'utf_16_le' without errors but in the last case the bytes are
> inverted. What did I not understand? What am I doing wrong?
> 
> thanks in advance
> 
> --
> https://mail.python.org/mailman/listinfo/python-list

That is why the BOM was created. A lot of files can be “correctly” read as 
either UTF-16-LE or UTF-1-BE encoded, as most of the 16 bit codes are valid, so 
unless the wrong encoding happens to hit something that is invalid (most likely 
something looking like a Surrogage Pair without a match), there isn’t an error 
in reading the file. The BOM character was specifically designed to be an 
invalid code if read by the wrong encoding (if you ignore the possibility of 
the file having a NUL right after the BOM)

If you know the files likely contains a lot of “ASCII” characters, then you 
might be able to detect that you got it wrong, due to seeing a lot of 0xXX00 
characters and few 0x00XX characters, but that doesn’t create an “error” 
normally.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: UTF_16 question

2024-04-29 Thread jak via Python-list

Stefan Ram ha scritto:

jak  wrote or quoted:

 I read it, both with encoding='utf_16_be' and
with 'utf_16_le' without errors but in the last case the bytes are
inverted.


   I think the order of the octets (bytes) is exactly the difference
   between these two encodings, so your observation isn't really
   surprising. The computer can't report an error here since it
   can't infer the correct encoding from the file data. It's like
   that koan: "A bit has the value 1. What does that mean?".



Understood. They are just 2 bytes and there is no difference between
them.

Thank you.


--
https://mail.python.org/mailman/listinfo/python-list


UTF_16 question

2024-04-29 Thread jak via Python-list

Hi everyone,
one thing that I do not understand is happening to me: I have some text
files with different characteristics, among these there are that they
have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them
without BOM. With those utf_32_xx I have no problem but with the
UTF_16_xx I have. If I have an utf_16_le coded file and I read it with
encoding='utf_16_le' I have no problem I read it, with
encoding='utf_16_be' I can read it without any error even if the data I
receive have the inverted bytes. The same thing happens with the
utf_16_be codified file, I read it, both with encoding='utf_16_be' and
with 'utf_16_le' without errors but in the last case the bytes are
inverted. What did I not understand? What am I doing wrong?

thanks in advance

--
https://mail.python.org/mailman/listinfo/python-list