Re: PEP 249 Compliant error handling

2017-10-18 Thread Cameron Simpson

On 17Oct2017 21:38, Karsten Hilbert  wrote:

That's certainly a possibility, if that behavior conforms to the DB API 
"standards". My concern in this front is that in my experience working with 
other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns designated as 
type VARCHAR or TEXT are returned as strings (unicode in python 2, although that may have 
been a setting I used), not bytes. The other complication here is that the 4D database 
doesn't use the UTF-8 encoding typically found, but rather UTF-16LE, and I don't know how 
well this is documented. So not only is the bytes representation completely 
unintelligible for human consumption, I'm not sure the average end-user would know what 
decoding to use.

In the end though, the main thing in my mind is to maintain "standards" 
compatibility - I don't want to be returning bytes if all other DB API modules return 
strings, or visa-versa for that matter. There may be some flexibility there, but as much 
as possible I want to conform to the majority/standard/whatever



The thing here is that you don't want to return data AS IF it was correct 
despite it having been
"corrected" by some driver logic.


I just want to say that I think this is correct and extremely important.


What might be interesting to users is to set an attribute on the cursor, say,
  cursor.faulty_data = unicode(faulty_data, errors='replace')
or some such in order to improve error messages to the end user.


Or perhaps more conveniently for the end user, possibly an option supplied at 
db connect time, though I'd entirely understand wanting a cursor option so that 
one can pick and choose in a fine grained fashion.


Cheers,
Cameron Simpson  (formerly c...@zip.com.au)
--
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 Compliant error handling

2017-10-18 Thread Karsten Hilbert
On Wed, Oct 18, 2017 at 08:32:48AM -0800, Israel Brewster wrote:

> actual question, which is "how does the STANDARD (PEP 249 in
> this case) say to handle this, or, baring that (since the
> standard probably doesn't explicitly say), how do the
> MAJORITY of PEP 249 compliant modules handle this?" Not what
> is the *best* way to handle it, but rather what is the
> normal, expected behavior for a Python DB API module when
> presented with bad data? That is, how does psycopg2 behave?
> pyodbc? pymssql (I think)? Etc. Or is that portion of the
> behavior completely arbitrary and different for every module?

For what it is worth, psycopg2 does not give you bad data to
the best of my knowledge. In fact, given PostgreSQL's quite
tight checking of text data to be "good" psycopg2 hardly has
a chance to give you bad data. Most times the database itself
will detect the corruption and not even hand the data to
psycopg2.

IMHO a driver should not hand over to the client any bad data
unless explicitely told to do so, which latter case does not
seem to be covered by the DB-API specs, regardless of what
the majority of drivers might do these days.

2 cent...

Karsten
-- 
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 Compliant error handling

2017-10-18 Thread Thomas Jollans
On 17/10/17 19:26, Israel Brewster wrote:
> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D 
> database, and I've run into a situation where corrupted string data from the 
> database can cause the module to error out. Specifically, when decoding the 
> string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in 
> position 86-87: illegal UTF-16 surrogate" error. This makes sense, given that 
> the string data got corrupted somehow, but the question is "what is the 
> proper way to deal with this in the module?" Should I just throw an error on 
> bad data? Or would it be better to set the errors parameter to something like 
> "replace"? The former feels a bit more "proper" to me (there's an error here, 
> so we throw an error), but leaves the end user dead in the water, with no way 
> to retrieve *any* of the data (from that row at least, and perhaps any rows 
> after it as well). The latter option sort of feels like sweeping the problem 
> under the rug, but does at least leave an error character in the string to l
>  et them know there was an error, and will allow retrieval of any good data.
> 
> Of course, if this was in my own code I could decide on a case-by-case basis 
> what the proper action is, but since this a module that has to work in any 
> situation, it's a bit more complicated.


The sqlite3 module falls back to returning bytes if there's a decoding
error. I don't know what the other modules do. It should be easy enough
for you to test this, though!

Python 3.5.3 (default, Jan 19 2017, 14:11:04)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import sqlite3

In [2]: db = sqlite3.connect("malformed-test1.sqlite3")

In [3]: db.execute("CREATE TABLE test (txt TEXT)")
Out[3]: 

In [4]: db.execute("INSERT INTO test VALUES(?)", ("utf-8: é",))
Out[4]: 

In [5]: db.execute("INSERT INTO test VALUES(?)", ("latin1:
é".encode('latin1'),))
Out[5]: 

In [6]: db.execute("SELECT * FROM test").fetchall()
Out[6]: [('utf-8: é',), (b'latin1: \xe9',)]

In [7]: db.text_factory = bytes # sqlite3 extension to the API

In [8]: db.execute("SELECT * FROM test").fetchall()
Out[8]: [(b'utf-8: \xc3\xa9',), (b'latin1: \xe9',)]


For what it's worth, this is also what os.listdir() does when it
encounters filenames in the wrong encoding on operating systems where
this is possible (e.g. Linux, but not Windows)

If the encoding could be anything, I think you should give the user some
kind of choice between using bytes, raising errors, and escaping.

In the particular case of UTF-16 (especially if the encoding is always
UTF-16), the best solution is almost certainly to use
errors='surrogatepass' in both en- and decoding. I believe this is
fairly common practice when full interoperability with software that
predates UTF-16 (and previously used UCS-2) is required. This should
solve all your problems as long as you don't get strings with an odd
number of bytes.

See: https://en.wikipedia.org/wiki/UTF-16#U.2BD800_to_U.2BDFFF

-- Thomas

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 Compliant error handling

2017-10-18 Thread Israel Brewster
On Oct 17, 2017, at 12:02 PM, MRAB  wrote:
> 
> On 2017-10-17 20:25, Israel Brewster wrote:
>> 
>>> On Oct 17, 2017, at 10:35 AM, MRAB >> > wrote:
>>> 
>>> On 2017-10-17 18:26, Israel Brewster wrote:
 I have written and maintain a PEP 249 compliant (hopefully) DB API for the 
 4D database, and I've run into a situation where corrupted string data 
 from the database can cause the module to error out. Specifically, when 
 decoding the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't 
 decode bytes in position 86-87: illegal UTF-16 surrogate" error. This 
 makes sense, given that the string data got corrupted somehow, but the 
 question is "what is the proper way to deal with this in the module?" 
 Should I just throw an error on bad data? Or would it be better to set the 
 errors parameter to something like "replace"? The former feels a bit more 
 "proper" to me (there's an error here, so we throw an error), but leaves 
 the end user dead in the water, with no way to retrieve *any* of the data 
 (from that row at least, and perhaps any rows after it as well). The 
 latter option sort of feels like sweeping the problem under the rug, but 
 does at least leave an error character in the s
>>> tring to
>>> l
  et them know there was an error, and will allow retrieval of any good 
 data.
 Of course, if this was in my own code I could decide on a case-by-case 
 basis what the proper action is, but since this a module that has to work 
 in any situation, it's a bit more complicated.
>>> If a particular text field is corrupted, then raising UnicodeDecodeError 
>>> when trying to get the contents of that field as a Unicode string seems 
>>> reasonable to me.
>>> 
>>> Is there a way to get the contents as a bytestring, or to get the contents 
>>> with a different errors parameter, so that the user has the means to fix it 
>>> (if it's fixable)?
>> 
>> That's certainly a possibility, if that behavior conforms to the DB API 
>> "standards". My concern in this front is that in my experience working with 
>> other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns 
>> designated as type VARCHAR or TEXT are returned as strings (unicode in 
>> python 2, although that may have been a setting I used), not bytes. The 
>> other complication here is that the 4D database doesn't use the UTF-8 
>> encoding typically found, but rather UTF-16LE, and I don't know how well 
>> this is documented. So not only is the bytes representation completely 
>> unintelligible for human consumption, I'm not sure the average end-user 
>> would know what decoding to use.
>> 
>> In the end though, the main thing in my mind is to maintain "standards" 
>> compatibility - I don't want to be returning bytes if all other DB API 
>> modules return strings, or visa-versa for that matter. There may be some 
>> flexibility there, but as much as possible I want to conform to the 
>> majority/standard/whatever
>> 
> The average end-user might not know which encoding is being used, but 
> providing a way to read the underlying bytes will give a more experienced 
> user the means to investigate and possibly fix it: get the bytes, figure out 
> what the string should be, update the field with the correctly decoded string 
> using normal DB instructions.

I agree, and if I was just writing some random module I'd probably go with it, 
or perhaps with the suggestion offered by Karsten Hilbert. However, neither 
answer addresses my actual question, which is "how does the STANDARD (PEP 249 
in this case) say to handle this, or, baring that (since the standard probably 
doesn't explicitly say), how do the MAJORITY of PEP 249 compliant modules 
handle this?" Not what is the *best* way to handle it, but rather what is the 
normal, expected behavior for a Python DB API module when presented with bad 
data? That is, how does psycopg2 behave? pyodbc? pymssql (I think)? Etc. Or is 
that portion of the behavior completely arbitrary and different for every 
module?

It may well be that one of the suggestions *IS* the normal, expected, behavior, 
but it sounds more like you are suggesting how you think would be best to 
handle it, which is appreciated but not actually what I'm asking :-) Sorry if I 
am being difficult.

> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 Compliant error handling

2017-10-18 Thread Israel Brewster


> On Oct 18, 2017, at 1:46 AM, Abdur-Rahmaan Janhangeer  
> wrote:
> 
> all corruption systematically ignored but data piece logged in for analysis

Thanks. Can you expound a bit on what you mean by "data piece logged in" in 
this context? I'm not aware of any logging specifications in the PEP 249, and 
would think that would be more end-user configured rather than module level.

---
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---

> 
> Abdur-Rahmaan Janhangeer,
> Mauritius
> abdurrahmaanjanhangeer.wordpress.com 
> 
> 
> On 17 Oct 2017 21:43, "Israel Brewster"  > wrote:
> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D 
> database, and I've run into a situation where corrupted string data from the 
> database can cause the module to error out. Specifically, when decoding the 
> string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in 
> position 86-87: illegal UTF-16 surrogate" error. This makes sense, given that 
> the string data got corrupted somehow, but the question is "what is the 
> proper way to deal with this in the module?" Should I just throw an error on 
> bad data? Or would it be better to set the errors parameter to something like 
> "replace"? The former feels a bit more "proper" to me (there's an error here, 
> so we throw an error), but leaves the end user dead in the water, with no way 
> to retrieve *any* of the data (from that row at least, and perhaps any rows 
> after it as well). The latter option sort of feels like sweeping the problem 
> under the rug, but does at least leave an error character in the string to
  l
>  et them know there was an error, and will allow retrieval of any good data.
> 
> Of course, if this was in my own code I could decide on a case-by-case basis 
> what the proper action is, but since this a module that has to work in any 
> situation, it's a bit more complicated.
> ---
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293 
> ---
> 
> 
> 
> 
> --
> https://mail.python.org/mailman/listinfo/python-list 
> 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 Compliant error handling

2017-10-18 Thread Abdur-Rahmaan Janhangeer
all corruption systematically ignored but data piece logged in for analysis

Abdur-Rahmaan Janhangeer,
Mauritius
abdurrahmaanjanhangeer.wordpress.com

On 17 Oct 2017 21:43, "Israel Brewster"  wrote:

> I have written and maintain a PEP 249 compliant (hopefully) DB API for the
> 4D database, and I've run into a situation where corrupted string data from
> the database can cause the module to error out. Specifically, when decoding
> the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode
> bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense,
> given that the string data got corrupted somehow, but the question is "what
> is the proper way to deal with this in the module?" Should I just throw an
> error on bad data? Or would it be better to set the errors parameter to
> something like "replace"? The former feels a bit more "proper" to me
> (there's an error here, so we throw an error), but leaves the end user dead
> in the water, with no way to retrieve *any* of the data (from that row at
> least, and perhaps any rows after it as well). The latter option sort of
> feels like sweeping the problem under the rug, but does at least leave an
> error character in the string to l
>  et them know there was an error, and will allow retrieval of any good
> data.
>
> Of course, if this was in my own code I could decide on a case-by-case
> basis what the proper action is, but since this a module that has to work
> in any situation, it's a bit more complicated.
> ---
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> ---
>
>
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 Compliant error handling

2017-10-17 Thread MRAB

On 2017-10-17 20:25, Israel Brewster wrote:


On Oct 17, 2017, at 10:35 AM, MRAB > wrote:


On 2017-10-17 18:26, Israel Brewster wrote:
I have written and maintain a PEP 249 compliant (hopefully) DB API 
for the 4D database, and I've run into a situation where corrupted 
string data from the database can cause the module to error out. 
Specifically, when decoding the string, I get a "UnicodeDecodeError: 
'utf-16-le' codec can't decode bytes in position 86-87: illegal 
UTF-16 surrogate" error. This makes sense, given that the string 
data got corrupted somehow, but the question is "what is the proper 
way to deal with this in the module?" Should I just throw an error 
on bad data? Or would it be better to set the errors parameter to 
something like "replace"? The former feels a bit more "proper" to me 
(there's an error here, so we throw an error), but leaves the end 
user dead in the water, with no way to retrieve *any* of the data 
(from that row at least, and perhaps any rows after it as well). The 
latter option sort of feels like sweeping the problem under the rug, 
but does at least leave an error character in the s

tring to
l
 et them know there was an error, and will allow retrieval of any 
good data.
Of course, if this was in my own code I could decide on a 
case-by-case basis what the proper action is, but since this a 
module that has to work in any situation, it's a bit more complicated.
If a particular text field is corrupted, then raising 
UnicodeDecodeError when trying to get the contents of that field as a 
Unicode string seems reasonable to me.


Is there a way to get the contents as a bytestring, or to get the 
contents with a different errors parameter, so that the user has the 
means to fix it (if it's fixable)?


That's certainly a possibility, if that behavior conforms to the DB 
API "standards". My concern in this front is that in my experience 
working with other PEP 249 modules (specifically psycopg2), I'm pretty 
sure that columns designated as type VARCHAR or TEXT are returned as 
strings (unicode in python 2, although that may have been a setting I 
used), not bytes. The other complication here is that the 4D database 
doesn't use the UTF-8 encoding typically found, but rather UTF-16LE, 
and I don't know how well this is documented. So not only is the bytes 
representation completely unintelligible for human consumption, I'm 
not sure the average end-user would know what decoding to use.


In the end though, the main thing in my mind is to maintain 
"standards" compatibility - I don't want to be returning bytes if all 
other DB API modules return strings, or visa-versa for that matter. 
There may be some flexibility there, but as much as possible I want to 
conform to the majority/standard/whatever


The average end-user might not know which encoding is being used, but 
providing a way to read the underlying bytes will give a more 
experienced user the means to investigate and possibly fix it: get the 
bytes, figure out what the string should be, update the field with the 
correctly decoded string using normal DB instructions.

--
https://mail.python.org/mailman/listinfo/python-list


Aw: Re: PEP 249 Compliant error handling

2017-10-17 Thread Karsten Hilbert
> That's certainly a possibility, if that behavior conforms to the DB API 
> "standards". My concern in this front is that in my experience working with 
> other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns 
> designated as type VARCHAR or TEXT are returned as strings (unicode in python 
> 2, although that may have been a setting I used), not bytes. The other 
> complication here is that the 4D database doesn't use the UTF-8 encoding 
> typically found, but rather UTF-16LE, and I don't know how well this is 
> documented. So not only is the bytes representation completely unintelligible 
> for human consumption, I'm not sure the average end-user would know what 
> decoding to use.
> 
> In the end though, the main thing in my mind is to maintain "standards" 
> compatibility - I don't want to be returning bytes if all other DB API 
> modules return strings, or visa-versa for that matter. There may be some 
> flexibility there, but as much as possible I want to conform to the 
> majority/standard/whatever


The thing here is that you don't want to return data AS IF it was correct 
despite it having been
"corrected" by some driver logic.

What might be interesting to users is to set an attribute on the cursor, say,

   cursor.faulty_data = unicode(faulty_data, errors='replace')

or some such in order to improve error messages to the end user.

Karsten
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 Compliant error handling

2017-10-17 Thread Israel Brewster

> On Oct 17, 2017, at 10:35 AM, MRAB  wrote:
> 
> On 2017-10-17 18:26, Israel Brewster wrote:
>> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 
>> 4D database, and I've run into a situation where corrupted string data from 
>> the database can cause the module to error out. Specifically, when decoding 
>> the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode 
>> bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense, 
>> given that the string data got corrupted somehow, but the question is "what 
>> is the proper way to deal with this in the module?" Should I just throw an 
>> error on bad data? Or would it be better to set the errors parameter to 
>> something like "replace"? The former feels a bit more "proper" to me 
>> (there's an error here, so we throw an error), but leaves the end user dead 
>> in the water, with no way to retrieve *any* of the data (from that row at 
>> least, and perhaps any rows after it as well). The latter option sort of 
>> feels like sweeping the problem under the rug, but does at least leave an 
>> error character in the s
> tring to
> l
>>  et them know there was an error, and will allow retrieval of any good data.
>> Of course, if this was in my own code I could decide on a case-by-case basis 
>> what the proper action is, but since this a module that has to work in any 
>> situation, it's a bit more complicated.
> If a particular text field is corrupted, then raising UnicodeDecodeError when 
> trying to get the contents of that field as a Unicode string seems reasonable 
> to me.
> 
> Is there a way to get the contents as a bytestring, or to get the contents 
> with a different errors parameter, so that the user has the means to fix it 
> (if it's fixable)?

That's certainly a possibility, if that behavior conforms to the DB API 
"standards". My concern in this front is that in my experience working with 
other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns 
designated as type VARCHAR or TEXT are returned as strings (unicode in python 
2, although that may have been a setting I used), not bytes. The other 
complication here is that the 4D database doesn't use the UTF-8 encoding 
typically found, but rather UTF-16LE, and I don't know how well this is 
documented. So not only is the bytes representation completely unintelligible 
for human consumption, I'm not sure the average end-user would know what 
decoding to use.

In the end though, the main thing in my mind is to maintain "standards" 
compatibility - I don't want to be returning bytes if all other DB API modules 
return strings, or visa-versa for that matter. There may be some flexibility 
there, but as much as possible I want to conform to the 
majority/standard/whatever

---
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 249 Compliant error handling

2017-10-17 Thread MRAB

On 2017-10-17 18:26, Israel Brewster wrote:

I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D database, and I've run into a situation 
where corrupted string data from the database can cause the module to error out. Specifically, when decoding the 
string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 86-87: illegal UTF-16 
surrogate" error. This makes sense, given that the string data got corrupted somehow, but the question is 
"what is the proper way to deal with this in the module?" Should I just throw an error on bad data? Or would 
it be better to set the errors parameter to something like "replace"? The former feels a bit more 
"proper" to me (there's an error here, so we throw an error), but leaves the end user dead in the water, with 
no way to retrieve *any* of the data (from that row at least, and perhaps any rows after it as well). The latter option 
sort of feels like sweeping the problem under the rug, but does at least leave an error character in the string to

 l

  et them know there was an error, and will allow retrieval of any good data.

Of course, if this was in my own code I could decide on a case-by-case basis 
what the proper action is, but since this a module that has to work in any 
situation, it's a bit more complicated.

If a particular text field is corrupted, then raising UnicodeDecodeError 
when trying to get the contents of that field as a Unicode string seems 
reasonable to me.


Is there a way to get the contents as a bytestring, or to get the 
contents with a different errors parameter, so that the user has the 
means to fix it (if it's fixable)?

--
https://mail.python.org/mailman/listinfo/python-list


PEP 249 Compliant error handling

2017-10-17 Thread Israel Brewster
I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D 
database, and I've run into a situation where corrupted string data from the 
database can cause the module to error out. Specifically, when decoding the 
string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in 
position 86-87: illegal UTF-16 surrogate" error. This makes sense, given that 
the string data got corrupted somehow, but the question is "what is the proper 
way to deal with this in the module?" Should I just throw an error on bad data? 
Or would it be better to set the errors parameter to something like "replace"? 
The former feels a bit more "proper" to me (there's an error here, so we throw 
an error), but leaves the end user dead in the water, with no way to retrieve 
*any* of the data (from that row at least, and perhaps any rows after it as 
well). The latter option sort of feels like sweeping the problem under the rug, 
but does at least leave an error character in the string to l
 et them know there was an error, and will allow retrieval of any good data.

Of course, if this was in my own code I could decide on a case-by-case basis 
what the proper action is, but since this a module that has to work in any 
situation, it's a bit more complicated.
---
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---




-- 
https://mail.python.org/mailman/listinfo/python-list