Re: [Avocado-devel] [RFC] Text/binary data and encodings: how they relate to Avocado, extensions and tests
Dne 18.4.2018 v 03:18 Cleber Rosa napsal(a): > Recently, Avocado has seen a lot of changes brought by the Python 3 > port. One fundamental difference between Python 2 and 3 is under > the spotlight: how to deal with "text" and "binary" data[1]. > > It's then important to make it clear where Avocado stands (or is > headed) when it comes to handling text, binary data and encodings, > which is the goal of this document. > > First, let's review some very basic concepts. > > Bytes, the unassuming arrays > > > On both Python 2 and 3, there's "bytes". On Python 2, it's nothing > but an alias to "str"[2]:: > >>>> import sys; sys.version[0] >2 >>>> bytes is str >True > > One of the striking characteristics of "bytes" is that every byte > counts, that is:: > >>>> aacute = b'\xc3\xa1' >>>> len(aacute) >2 > > This is as simple as it gets. The "bytes" type is an "array" of > bytes. > > Also, if it's not clear enough, this sequence of two bytes, happens to > be **one way** to **represent** the "LATIN SMALL LETTER A WITH > ACUTE"[3] character, as defined by the Unicode standard, in a given > encoding. Please pause for a moment and let that information settle. > > Old habits die hard > === > > We, humans beings, are used to deal with text. Developers, being a > special kind of human beings, are used to deal with *character arrays* > instead. Those are, or have been for a long time, sequences of > one-byte characters with specific (but somewhat implicit) meaning. > > Many developers will still assume that each byte contains a value that > maps to the ascii(7) table:: > >Oct Dec Hex CharOct Dec Hex Char > >000 0 00NUL '\0' (null character) 100 6440@ >001 1 01SOH (start of heading) 101 6541A >002 2 02STX (start of text) 102 6642B >... >076 623E> 176 126 7E~ >077 633F? 177 127 7FDEL > > Some other developers will assume that ASCII is a thing of the past, > and each one-byte character means something according to the latin1(7) > mapping:: > >ISO 8859-1 characters >The following table displays the characters in ISO 8859-1, which > are printable and >unlisted in the ascii(7) manual page. > >Oct Dec Hex Char Description > >240 160 A0 NO-BREAK SPACE >241 161 A1 ¡ INVERTED EXCLAMATION MARK >242 162 A2 ¢ CENT SIGN >... >376 254 FE þ LATIN SMALL LETTER THORN >377 255 FF ÿ LATIN SMALL LETTER Y WITH DIAERESIS > > Then, there's yet another group of developers who believe that a byte > in an array of bytes may be either a character, or part of a > character. They believe in that because, Unicode and "UTF-8" is the > new standard and can be assumed to be everywhere. > > The fact is, all those developers are wrong. Not because an array of > bytes can not contain what they believe, but because one can only > guess that an array of bytes map to a character set (an encoding). > > Data itself carries no intrinsic meaning > > > Pure data doesn't have any meaning. Its meaning depends on the > interpretation given, that is, some kind of context around it. > > When dealing with text, the meaning of data is usually determined by a > character set, a mapping table or some more advanced encoding and > decoding mechanism. > > For instance, the following sequence of numbers expressed in > decimal format and separated by spaces:: > > 66 67 68 69 70 > > Will only mean the first letters of the western alphabet, ``ABCDE``, > **if** we determine that its meaning is based on the ASCII character > set (besides other details such as ordering, separator used, etc). > > Turning arrays of bytes into text > = > > On many occasions, usually when data is destined for humans, it is > necessary to present it, and to deal with it, in a different way. > Here, we use the abstract term *text* to refer to data is more > meaningful to humans, and would usually be found in documents (such as > this one) intended to be distributed and read by us, the non-machine > beings. > > Reusing the example given earlier, one can do on a Python interpreter:: > >>>> aacute = b'\xc3\xa1' >>>> len(aacute.decode('utf-8')) >1 > > The process of turning bytes into "text" is called "decoding" by > Python. It helps to think of bytes as something that humans cannot > understand and consequently needs deciphering (or decoding) to then > become something readable by humans. > > In this process, the encoding is of the
Re: [Avocado-devel] [RFC] Text/binary data and encodings: how they relate to Avocado, extensions and tests
On 04/17/2018 09:18 PM, Cleber Rosa wrote: > Recently, Avocado has seen a lot of changes brought by the Python 3 > port. One fundamental difference between Python 2 and 3 is under > the spotlight: how to deal with "text" and "binary" data[1]. > > It's then important to make it clear where Avocado stands (or is > headed) when it comes to handling text, binary data and encodings, > which is the goal of this document. > > First, let's review some very basic concepts. > > Bytes, the unassuming arrays > > > On both Python 2 and 3, there's "bytes". On Python 2, it's nothing > but an alias to "str"[2]:: > >>>> import sys; sys.version[0] >2 >>>> bytes is str >True > > One of the striking characteristics of "bytes" is that every byte > counts, that is:: > >>>> aacute = b'\xc3\xa1' >>>> len(aacute) >2 > > This is as simple as it gets. The "bytes" type is an "array" of > bytes. > > Also, if it's not clear enough, this sequence of two bytes, happens to > be **one way** to **represent** the "LATIN SMALL LETTER A WITH > ACUTE"[3] character, as defined by the Unicode standard, in a given > encoding. Please pause for a moment and let that information settle. > > Old habits die hard > === > > We, humans beings, are used to deal with text. Developers, being a > special kind of human beings, are used to deal with *character arrays* > instead. Those are, or have been for a long time, sequences of > one-byte characters with specific (but somewhat implicit) meaning. > > Many developers will still assume that each byte contains a value that > maps to the ascii(7) table:: > >Oct Dec Hex CharOct Dec Hex Char > >000 0 00NUL '\0' (null character) 100 6440@ >001 1 01SOH (start of heading) 101 6541A >002 2 02STX (start of text) 102 6642B >... >076 623E> 176 126 7E~ >077 633F? 177 127 7FDEL > > Some other developers will assume that ASCII is a thing of the past, > and each one-byte character means something according to the latin1(7) > mapping:: > >ISO 8859-1 characters >The following table displays the characters in ISO 8859-1, which > are printable and >unlisted in the ascii(7) manual page. > >Oct Dec Hex Char Description > >240 160 A0 NO-BREAK SPACE >241 161 A1 ¡ INVERTED EXCLAMATION MARK >242 162 A2 ¢ CENT SIGN >... >376 254 FE þ LATIN SMALL LETTER THORN >377 255 FF ÿ LATIN SMALL LETTER Y WITH DIAERESIS > > Then, there's yet another group of developers who believe that a byte > in an array of bytes may be either a character, or part of a > character. They believe in that because, Unicode and "UTF-8" is the > new standard and can be assumed to be everywhere. > > The fact is, all those developers are wrong. Not because an array of > bytes can not contain what they believe, but because one can only > guess that an array of bytes map to a character set (an encoding). > > Data itself carries no intrinsic meaning > > > Pure data doesn't have any meaning. Its meaning depends on the > interpretation given, that is, some kind of context around it. > > When dealing with text, the meaning of data is usually determined by a > character set, a mapping table or some more advanced encoding and > decoding mechanism. > > For instance, the following sequence of numbers expressed in > decimal format and separated by spaces:: > > 66 67 68 69 70 > > Will only mean the first letters of the western alphabet, ``ABCDE``, > **if** we determine that its meaning is based on the ASCII character > set (besides other details such as ordering, separator used, etc). > > Turning arrays of bytes into text > = > > On many occasions, usually when data is destined for humans, it is > necessary to present it, and to deal with it, in a different way. > Here, we use the abstract term *text* to refer to data is more > meaningful to humans, and would usually be found in documents (such as > this one) intended to be distributed and read by us, the non-machine > beings. > > Reusing the example given earlier, one can do on a Python interpreter:: > >>>> aacute = b'\xc3\xa1' >>>> len(aacute.decode('utf-8')) >1 > > The process of turning bytes into "text" is called "decoding" by > Python. It helps to think of bytes as something that humans cannot > understand and consequently needs deciphering (or decoding) to then > become something readable by humans. > > In this process, the encoding is of the