Re: [HACKERS] Bad Data back Door

2012-10-09 Thread Albe Laurenz
David E. Wheeler wrote:
 As the author I agree that this is a bug in oracle_fdw.

 Thanks. Should I file a report somewhere?

That's not necessary.  Thanks for reporting the problem.
It may be a few days until I get around to fix that.

 Oracle does not care much about correct encoding.

 Yeah, same here. I've been looking into write a function to try
 to fix poorly-encoded data, though, but haven't got far,
 because CONVERT() does not indicate failure. If you have
 any insight on this, I'd appreciate your thoughts on this
 Stack Overflow question:
 
   http://stackoverflow.com/q/12717363/79202

The only thing I can think of is a stored procedure in Java.
You could use java.nio.charset.CharsetEncoder and
java.nio.charset.CharsetDecoder which should throw exceptions
if they encounter illegal bytes.

Yours,
Laurenz Albe


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-09 Thread Albe Laurenz
I wrote:
 Tom Lane wrote:
 Now, having said that, I think it has to be the reponsibility of
the FDW
 to apply any required check ... which makes this a bug report
against
 oracle_fdw, not the core system.

 As the author I agree that this is a bug in oracle_fdw.

Ok, fixed.
David, could you try the latest version from CVS to see
if it works for you?

Yours,
Laurenz Albe


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-08 Thread Albe Laurenz
Tom Lane wrote:
 David E. Wheeler da...@justatheory.com writes:
 On Oct 5, 2012, at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Now, having said that, I think it has to be the reponsibility of the
FDW
 to apply any required check ... which makes this a bug report
against
 oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on
the
 COPY code, which will check encoding.)

 I agree that this is a bug in oracle_fdw (well, potentially;
ultimately,
 it's Oracle that's lying about the encoding of those text values).
 But I think that it would be much more useful overall -- not
 to mention more database-like -- for PostgreSQL to provide a way to
 enforce it. That is, to consider foreign tables to be an input like
 COPY or SQL, and to validate values before displaying them.

 It is the FDW's responsibility to deal with this.  We expect it to
hand
 back valid tuples; it is not reasonable to disassemble them looking
for
 mistakes (and we couldn't catch most mistakes, anyway).  If the
 interface were defined in terms of text, we could do checking above
the
 FDW level ... but it isn't.

As the author I agree that this is a bug in oracle_fdw.

This was caused by ignorance on my part:  I had assumed that the
type input functions would perform the necessary checks, but it
seems like that is not the case.  I'll look into it.

Oracle does not care much about correct encoding.
If client character set and database character set are the same,
Oracle does not bother to check the data.  This is probably how
WINDOWS-1252 characters slipped into the UTF-8 database in question.
I consider this a bug in Oracle, but never reported it, because
I don't have much hope that Oracle would see it as a problem
given their habitually sloppy handling of encoding issues.

Yours,
Laurenz Albe


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-08 Thread David E. Wheeler
On Oct 8, 2012, at 12:25 AM, Albe Laurenz laurenz.a...@wien.gv.at wrote:

 As the author I agree that this is a bug in oracle_fdw.

Thanks. Should I file a report somewhere?

 This was caused by ignorance on my part:  I had assumed that the
 type input functions would perform the necessary checks, but it
 seems like that is not the case.  I'll look into it.

Thank you!

 Oracle does not care much about correct encoding.
 If client character set and database character set are the same,
 Oracle does not bother to check the data.  This is probably how
 WINDOWS-1252 characters slipped into the UTF-8 database in question.
 I consider this a bug in Oracle, but never reported it, because
 I don't have much hope that Oracle would see it as a problem
 given their habitually sloppy handling of encoding issues.

Yeah, same here. I've been looking into write a function to try to fix 
poorly-encoded data, though, but haven't got far, because CONVERT() does not 
indicate failure. If you have any insight on this, I'd appreciate your thoughts 
on this Stack Overflow question:

  http://stackoverflow.com/q/12717363/79202

Thanks,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Bad Data back Door

2012-10-08 Thread David E. Wheeler
On Oct 5, 2012, at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Now, having said that, I think it has to be the reponsibility of the FDW
 to apply any required check ... which makes this a bug report against
 oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
 COPY code, which will check encoding.)

FWIW, I believe that dblink does not check encoding.

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-08 Thread Tom Lane
David E. Wheeler da...@justatheory.com writes:
 On Oct 5, 2012, at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Now, having said that, I think it has to be the reponsibility of the FDW
 to apply any required check ... which makes this a bug report against
 oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
 COPY code, which will check encoding.)

 FWIW, I believe that dblink does not check encoding.

In dblink's case, that boils down to trusting a remote instance of
Postgres to get this right, which doesn't seem totally unreasonable.
But I wouldn't object to adding checks there if someone wanted to submit
a patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-08 Thread David E. Wheeler
On Oct 8, 2012, at 11:13 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 FWIW, I believe that dblink does not check encoding.
 
 In dblink's case, that boils down to trusting a remote instance of
 Postgres to get this right, which doesn't seem totally unreasonable.
 But I wouldn't object to adding checks there if someone wanted to submit
 a patch.

Yeah, I found this because we had a dblink to another PostgreSQL server's table 
with data populated from oracle_fdw. I guess trusting is reasonable, though.

I wonder about dbi-link, though…

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-08 Thread Andrew Dunstan


On 10/08/2012 02:13 PM, Tom Lane wrote:

David E. Wheeler da...@justatheory.com writes:

On Oct 5, 2012, at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Now, having said that, I think it has to be the reponsibility of the FDW
to apply any required check ... which makes this a bug report against
oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
COPY code, which will check encoding.)

FWIW, I believe that dblink does not check encoding.

In dblink's case, that boils down to trusting a remote instance of
Postgres to get this right, which doesn't seem totally unreasonable.
But I wouldn't object to adding checks there if someone wanted to submit
a patch.


It does do:

PQsetClientEncoding(conn, GetDatabaseEncodingName());


I'd be mildly reluctant to do anything more except possibly as an 
option, unless it could be shown to have minimal performance impact.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-06 Thread Heikki Linnakangas

On 06.10.2012 05:14, John R Pierce wrote:

I'd like to see some encoding validation and substitution functions in
postgres. for instance, one that can take any supported encoding and
convert it to the database encoding and generate an error on any invalid
character. this translation could be identity (eg, UTF8-UTF8) whereupon
it would just validate.


See pg_any_to_server() in mbutils.c. At the SQL level, there's the 
convert(bytea, name, name) function.



a 2nd function would do the same, but replace
errors with the substitution character in the target charset and not error.


Hmm, I don't think we have that.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-06 Thread David E. Wheeler
On Oct 5, 2012, at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Probably not so much assumed as nobody thought about it.  In
 e.g. plperl we expend the cycles to do encoding validity checking on
 *every* string entering the system from Perl.  I'm not sure why foreign
 tables ought to get a pass on that, especially when you consider the
 communication overhead that the encoding check would be amortized
 against.

Yes, that’s what I was thinking.

 Now, having said that, I think it has to be the reponsibility of the FDW
 to apply any required check ... which makes this a bug report against
 oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
 COPY code, which will check encoding.)

I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s 
Oracle that’s lying about the encoding of those text values). But I think that 
it would be much more useful overall -- not to mention more database-like -- 
for PostgreSQL to provide a way to enforce it. That is, to consider foreign 
tables to be an input like COPY or SQL, and to validate values before 
displaying them.

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-06 Thread Atri Sharma
On Sat, Oct 6, 2012 at 1:34 PM, David E. Wheeler da...@justatheory.com wrote:
 On Oct 5, 2012, at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Probably not so much assumed as nobody thought about it.  In
 e.g. plperl we expend the cycles to do encoding validity checking on
 *every* string entering the system from Perl.  I'm not sure why foreign
 tables ought to get a pass on that, especially when you consider the
 communication overhead that the encoding check would be amortized
 against.

 Yes, that’s what I was thinking.

 Now, having said that, I think it has to be the reponsibility of the FDW
 to apply any required check ... which makes this a bug report against
 oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
 COPY code, which will check encoding.)

 I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s 
 Oracle that’s lying about the encoding of those text values). But I think 
 that it would be much more useful overall -- not to mention more 
 database-like -- for PostgreSQL to provide a way to enforce it. That is, to 
 consider foreign tables to be an input like COPY or SQL, and to validate 
 values before displaying them.

 Best,

 David



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


+1
-- 
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-06 Thread John R Pierce

On 10/06/12 3:45 AM, Heikki Linnakangas wrote:

At the SQL level, there's the convert(bytea, name, name) function.


ahhh, right.  (forehead slap)




a 2nd function would do the same, but replace
errors with the substitution character in the target charset and not 
error.


Hmm, I don't think we have that.


me thinks this would be extremely useful for importing 'dirty' data.   
that or a per-connection flag (or option on the COPY command?)  that 
said substitute-on-error for the likes of UTF-8 imports from CSV.




--
john r pierceN 37, W 122
santa cruz ca mid-left coast



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-06 Thread Tom Lane
David E. Wheeler da...@justatheory.com writes:
 On Oct 5, 2012, at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Now, having said that, I think it has to be the reponsibility of the FDW
 to apply any required check ... which makes this a bug report against
 oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
 COPY code, which will check encoding.)

 I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s 
 Oracle that’s lying about the encoding of those text values). But I think 
 that it would be much more useful overall -- not to mention more 
 database-like -- for PostgreSQL to provide a way to enforce it. That is, to 
 consider foreign tables to be an input like COPY or SQL, and to validate 
 values before displaying them.

It is the FDW's responsibility to deal with this.  We expect it to hand
back valid tuples; it is not reasonable to disassemble them looking for
mistakes (and we couldn't catch most mistakes, anyway).  If the
interface were defined in terms of text, we could do checking above the
FDW level ... but it isn't.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-06 Thread Andrew Dunstan


On 10/06/2012 03:35 PM, Tom Lane wrote:

David E. Wheeler da...@justatheory.com writes:

On Oct 5, 2012, at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Now, having said that, I think it has to be the reponsibility of the FDW
to apply any required check ... which makes this a bug report against
oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
COPY code, which will check encoding.)

I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s 
Oracle that’s lying about the encoding of those text values). But I think that 
it would be much more useful overall -- not to mention more database-like -- 
for PostgreSQL to provide a way to enforce it. That is, to consider foreign 
tables to be an input like COPY or SQL, and to validate values before 
displaying them.

It is the FDW's responsibility to deal with this.  We expect it to hand
back valid tuples; it is not reasonable to disassemble them looking for
mistakes (and we couldn't catch most mistakes, anyway).  If the
interface were defined in terms of text, we could do checking above the
FDW level ... but it isn't.





Exactly.

We've done quite a lot of work tightening the ways that badly encoded 
data can enter the database over the years. It's a never ending game of 
whack-a-mole. There aren't any easy answers.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-05 Thread Tom Lane
David E. Wheeler da...@justatheory.com writes:
 I’ve discovered something a bit disturbing at $work. We’re migrating (slowly) 
 from Oracle to PostgreSQL, and in some cases are using oracle_fdw to copy 
 data over. Alas, there are a fair number of text values in the Oracle 
 database that, although the database is UTF-8, are actually something else 
 (CP1252 or Latin1). When we copy from an oracle_fdw foreign table into a 
 PostgreSQL table, PostgreSQL does not complain, but ends up storing the 
 mis-encoded strings, even though the database is UTF-8.

 I assume that this is because the foreign table, as a table, is assumed by 
 the system to have valid data, and therefor additional character encoding 
 validation is skipped, yes?

Probably not so much assumed as nobody thought about it.  In
e.g. plperl we expend the cycles to do encoding validity checking on
*every* string entering the system from Perl.  I'm not sure why foreign
tables ought to get a pass on that, especially when you consider the
communication overhead that the encoding check would be amortized
against.

Now, having said that, I think it has to be the reponsibility of the FDW
to apply any required check ... which makes this a bug report against
oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
COPY code, which will check encoding.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bad Data back Door

2012-10-05 Thread John R Pierce

On 10/05/12 6:12 PM, Tom Lane wrote:

Now, having said that, I think it has to be the reponsibility of the FDW
to apply any required check ... which makes this a bug report against
oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
COPY code, which will check encoding.)


I'm not sure of that.  what if the FDW is used to connect to (say) a 
postgres database that is in POSIX/C ?  is that checked for?


I'd like to see some encoding validation and substitution functions in 
postgres.   for instance, one that can take any supported encoding and 
convert it to the database encoding and generate an error on any invalid 
character.   this translation could be identity (eg, UTF8-UTF8) 
whereupon it would just validate.a 2nd function would do the same, 
but replace errors with the substitution character in the target charset 
and not error.


--
john r pierceN 37, W 122
santa cruz ca mid-left coast



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers