Re: [Avocado-devel] [RFC] Text/binary data and encodings: how they relate to Avocado, extensions and tests

2018-04-19 Thread Lukáš Doktor
Dne 18.4.2018 v 03:18 Cleber Rosa napsal(a):
> Recently, Avocado has seen a lot of changes brought by the Python 3
> port.  One fundamental difference between Python 2 and 3 is under
> the spotlight: how to deal with "text" and "binary" data[1].
> 
> It's then important to make it clear where Avocado stands (or is
> headed) when it comes to handling text, binary data and encodings,
> which is the goal of this document.
> 
> First, let's review some very basic concepts.
> 
> Bytes, the unassuming arrays
> 
> 
> On both Python 2 and 3, there's "bytes".  On Python 2, it's nothing
> but an alias to "str"[2]::
> 
>>>> import sys; sys.version[0]
>2
>>>> bytes is str
>True
> 
> One of the striking characteristics of "bytes" is that every byte
> counts, that is::
> 
>>>> aacute = b'\xc3\xa1'
>>>> len(aacute)
>2
> 
> This is as simple as it gets.  The "bytes" type is an "array" of
> bytes.
> 
> Also, if it's not clear enough, this sequence of two bytes, happens to
> be **one way** to **represent** the "LATIN SMALL LETTER A WITH
> ACUTE"[3] character, as defined by the Unicode standard, in a given
> encoding.  Please pause for a moment and let that information settle.
> 
> Old habits die hard
> ===
> 
> We, humans beings, are used to deal with text.  Developers, being a
> special kind of human beings, are used to deal with *character arrays*
> instead.  Those are, or have been for a long time, sequences of
> one-byte characters with specific (but somewhat implicit) meaning.
> 
> Many developers will still assume that each byte contains a value that
> maps to the ascii(7) table::
> 
>Oct   Dec   Hex   CharOct   Dec   Hex   Char
>
>000   0 00NUL '\0' (null character)   100   6440@
>001   1 01SOH (start of heading)  101   6541A
>002   2 02STX (start of text) 102   6642B
>...
>076   623E>   176   126   7E~
>077   633F?   177   127   7FDEL
> 
> Some other developers will assume that ASCII is a thing of the past,
> and each one-byte character means something according to the latin1(7)
> mapping::
> 
>ISO 8859-1 characters
>The  following  table  displays the characters in ISO 8859-1, which
> are printable and
>unlisted in the ascii(7) manual page.
> 
>Oct   Dec   Hex   Char   Description
>
>240   160   A0   NO-BREAK SPACE
>241   161   A1 ¡ INVERTED EXCLAMATION MARK
>242   162   A2 ¢ CENT SIGN
>...
>376   254   FE þ LATIN SMALL LETTER THORN
>377   255   FF ÿ LATIN SMALL LETTER Y WITH DIAERESIS
> 
> Then, there's yet another group of developers who believe that a byte
> in an array of bytes may be either a character, or part of a
> character.  They believe in that because, Unicode and "UTF-8" is the
> new standard and can be assumed to be everywhere.
> 
> The fact is, all those developers are wrong.  Not because an array of
> bytes can not contain what they believe, but because one can only
> guess that an array of bytes map to a character set (an encoding).
> 
> Data itself carries no intrinsic meaning
> 
> 
> Pure data doesn't have any meaning.  Its meaning depends on the
> interpretation given, that is, some kind of context around it.
> 
> When dealing with text, the meaning of data is usually determined by a
> character set, a mapping table or some more advanced encoding and
> decoding mechanism.
> 
> For instance, the following sequence of numbers expressed in
> decimal format and separated by spaces::
> 
>   66 67 68 69 70
> 
> Will only mean the first letters of the western alphabet, ``ABCDE``,
> **if** we determine that its meaning is based on the ASCII character
> set (besides other details such as ordering, separator used, etc).
> 
> Turning arrays of bytes into text
> =
> 
> On many occasions, usually when data is destined for humans, it is
> necessary to present it, and to deal with it, in a different way.
> Here, we use the abstract term *text* to refer to data is more
> meaningful to humans, and would usually be found in documents (such as
> this one) intended to be distributed and read by us, the non-machine
> beings.
> 
> Reusing the example given earlier, one can do on a Python interpreter::
> 
>>>> aacute = b'\xc3\xa1'
>>>> len(aacute.decode('utf-8'))
>1
> 
> The process of turning bytes into "text" is called "decoding" by
> Python.  It helps to think of bytes as something that humans cannot
> understand and consequently needs deciphering (or decoding) to then
> become something readable by humans.
> 
> In this process, the encoding is of the 

Re: [Avocado-devel] [RFC] Text/binary data and encodings: how they relate to Avocado, extensions and tests

2018-04-18 Thread Cleber Rosa


On 04/17/2018 09:18 PM, Cleber Rosa wrote:
> Recently, Avocado has seen a lot of changes brought by the Python 3
> port.  One fundamental difference between Python 2 and 3 is under
> the spotlight: how to deal with "text" and "binary" data[1].
> 
> It's then important to make it clear where Avocado stands (or is
> headed) when it comes to handling text, binary data and encodings,
> which is the goal of this document.
> 
> First, let's review some very basic concepts.
> 
> Bytes, the unassuming arrays
> 
> 
> On both Python 2 and 3, there's "bytes".  On Python 2, it's nothing
> but an alias to "str"[2]::
> 
>>>> import sys; sys.version[0]
>2
>>>> bytes is str
>True
> 
> One of the striking characteristics of "bytes" is that every byte
> counts, that is::
> 
>>>> aacute = b'\xc3\xa1'
>>>> len(aacute)
>2
> 
> This is as simple as it gets.  The "bytes" type is an "array" of
> bytes.
> 
> Also, if it's not clear enough, this sequence of two bytes, happens to
> be **one way** to **represent** the "LATIN SMALL LETTER A WITH
> ACUTE"[3] character, as defined by the Unicode standard, in a given
> encoding.  Please pause for a moment and let that information settle.
> 
> Old habits die hard
> ===
> 
> We, humans beings, are used to deal with text.  Developers, being a
> special kind of human beings, are used to deal with *character arrays*
> instead.  Those are, or have been for a long time, sequences of
> one-byte characters with specific (but somewhat implicit) meaning.
> 
> Many developers will still assume that each byte contains a value that
> maps to the ascii(7) table::
> 
>Oct   Dec   Hex   CharOct   Dec   Hex   Char
>
>000   0 00NUL '\0' (null character)   100   6440@
>001   1 01SOH (start of heading)  101   6541A
>002   2 02STX (start of text) 102   6642B
>...
>076   623E>   176   126   7E~
>077   633F?   177   127   7FDEL
> 
> Some other developers will assume that ASCII is a thing of the past,
> and each one-byte character means something according to the latin1(7)
> mapping::
> 
>ISO 8859-1 characters
>The  following  table  displays the characters in ISO 8859-1, which
> are printable and
>unlisted in the ascii(7) manual page.
> 
>Oct   Dec   Hex   Char   Description
>
>240   160   A0   NO-BREAK SPACE
>241   161   A1 ¡ INVERTED EXCLAMATION MARK
>242   162   A2 ¢ CENT SIGN
>...
>376   254   FE þ LATIN SMALL LETTER THORN
>377   255   FF ÿ LATIN SMALL LETTER Y WITH DIAERESIS
> 
> Then, there's yet another group of developers who believe that a byte
> in an array of bytes may be either a character, or part of a
> character.  They believe in that because, Unicode and "UTF-8" is the
> new standard and can be assumed to be everywhere.
> 
> The fact is, all those developers are wrong.  Not because an array of
> bytes can not contain what they believe, but because one can only
> guess that an array of bytes map to a character set (an encoding).
> 
> Data itself carries no intrinsic meaning
> 
> 
> Pure data doesn't have any meaning.  Its meaning depends on the
> interpretation given, that is, some kind of context around it.
> 
> When dealing with text, the meaning of data is usually determined by a
> character set, a mapping table or some more advanced encoding and
> decoding mechanism.
> 
> For instance, the following sequence of numbers expressed in
> decimal format and separated by spaces::
> 
>   66 67 68 69 70
> 
> Will only mean the first letters of the western alphabet, ``ABCDE``,
> **if** we determine that its meaning is based on the ASCII character
> set (besides other details such as ordering, separator used, etc).
> 
> Turning arrays of bytes into text
> =
> 
> On many occasions, usually when data is destined for humans, it is
> necessary to present it, and to deal with it, in a different way.
> Here, we use the abstract term *text* to refer to data is more
> meaningful to humans, and would usually be found in documents (such as
> this one) intended to be distributed and read by us, the non-machine
> beings.
> 
> Reusing the example given earlier, one can do on a Python interpreter::
> 
>>>> aacute = b'\xc3\xa1'
>>>> len(aacute.decode('utf-8'))
>1
> 
> The process of turning bytes into "text" is called "decoding" by
> Python.  It helps to think of bytes as something that humans cannot
> understand and consequently needs deciphering (or decoding) to then
> become something readable by humans.
> 
> In this process, the encoding is of the