Re: [PATCHES] Updated COPY CSV patch

2004-04-14 Thread Bruce Momjian
Andrew Dunstan wrote:
> Bruce Momjian wrote:
> 
> >
> >Do we need control for each column?  What if we go with preferring NULL
> >for comma-comma, and then print warnings for NOT NULL columns and try
> >the promote.  If you want comma-comma to be zero-length string, you can
> >create the column with NOT NULL, load the file, then ALTER TABLE to
> >allow NULL's again.  Basically, the NOT NULL specification on the column
> >is the COPY CSV control method, rather than having it be in COPY.
> >
> >  
> >
> 
> 
> If we can't do type specific stuff then we need to be able to have 
> column-specific controls on export, at least.
> 
> Consider a text column containing US 5-digit ZIP codes. If they are not 
> quoted, a spreadsheet will almost certainly not preserve the leading 
> zero some of them have, producing very undesirable results. However, a 
> genuine numeric-type field must not be quoted, or the same spreadsheet 
> won't see that value as a number. Unless we do stuff based on type, we 
> have no way of knowing from the text representation of the data what we 
> really need. Thus my proposal from this morning for column-specific user 
> control over this aspect. And if we are going to have per column user 
> control on export, why not on import too, to handle the NOT NULL 
> problem? It might make life easier for us code-wise than chasing down 
> nullability (e.g. in domains).

Wow, that is certainly an excellent point.  When we import, we know the
resulting data type, but spreadsheets don't, and rely on the quoting to
know what to do with the value.

The zipcode is an excellent example.  You can't even test for leading
zeros because then some spreadsheet values in the column are text and
some numeric.

We do have a column list capability with COPY already:

   COPY tablename [ ( column [, ...] ) ]
   FROM { 'filename' | STDIN }

Maybe we should extend that to control quoting on export and NULL
handling on import.  Does that solve our problems?

FYI, do you have IM?  I am:

AIM bmomjian
ICQ 151255111
Yahoo   bmomjian
MSN [EMAIL PROTECTED]
IRC bmomjian via FreeNode or EFNet

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] Updated COPY CSV patch

2004-04-14 Thread Andrew Dunstan
Bruce Momjian wrote:

Do we need control for each column?  What if we go with preferring NULL
for comma-comma, and then print warnings for NOT NULL columns and try
the promote.  If you want comma-comma to be zero-length string, you can
create the column with NOT NULL, load the file, then ALTER TABLE to
allow NULL's again.  Basically, the NOT NULL specification on the column
is the COPY CSV control method, rather than having it be in COPY.
 



If we can't do type specific stuff then we need to be able to have 
column-specific controls on export, at least.

Consider a text column containing US 5-digit ZIP codes. If they are not 
quoted, a spreadsheet will almost certainly not preserve the leading 
zero some of them have, producing very undesirable results. However, a 
genuine numeric-type field must not be quoted, or the same spreadsheet 
won't see that value as a number. Unless we do stuff based on type, we 
have no way of knowing from the text representation of the data what we 
really need. Thus my proposal from this morning for column-specific user 
control over this aspect. And if we are going to have per column user 
control on export, why not on import too, to handle the NOT NULL 
problem? It might make life easier for us code-wise than chasing down 
nullability (e.g. in domains).

cheers

andrew







---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] Updated COPY CSV patch

2004-04-14 Thread Bruce Momjian
Andrew Dunstan wrote:
> Again, I think this will break that property. But if that's what it 
> takes to be able to import to a table with NOT NULL in at least some 
> cases I could live with it. Just. But in the general case it won't work. 
> Say you are importing into a table with the following defn: (a text, b 
> text not null, c int). then the line 'x,,' will fail on b if '' is null, 
> and will fail on c if '' is empty string. And yet this sort of output is 
> exactly what is to be expected from a spreadsheet.

I thought about this case.  I think the big objection to the original
idea was trying to tie the CSV behavior to specific types.  With a
type-agnostic system like PostgreSQL, doing that will certainly fail to
catch all cases and be a maintenance nightmare.

However, your idea of testing pg_attribute.attnotnull is acceptable, I
think.  It isn't type-specific.

One way of solving the above issue is to allow comma-comma to promote to
a zero-length string if the column is NOT NULL.  One idea would be to
print a warning the first time a NOT NULL column sees a comma-comma, and
state that it will be changed to a zero-length string.  You could use
STRICT to disable that and throw an error.

I don't see why this has to be type-specific.  If a zero-length string
doesn't match an int column, you throw an error. If you see comma-comma
for an INT NOT NULL column, you have to throw an error.  There is no
proper value for the column.

You might also want to do that anyway in some cases even if the column
isn't NOT NULL.  This is where your FORCE col1, col2 came from, where
you can specify which columns should be NULL and which zero-length
strings.

I guess the issue is can we get a single specification for the entire
table, or do we need per-column specifications?

Basically, we are trying to over-load the CSV system to store null
information by using the distinction of whether a column has quote-quote
or nothing.

Let's look at your example again:

CREATE TABLE test (col1 TEXT NOT NULL, col2 INT)

and you have a CSV input file of:

,

If we say both are NULL, it fails, and if we say neither is NULL, it
fails.

Do we need control for each column?  What if we go with preferring NULL
for comma-comma, and then print warnings for NOT NULL columns and try
the promote.  If you want comma-comma to be zero-length string, you can
create the column with NOT NULL, load the file, then ALTER TABLE to
allow NULL's again.  Basically, the NOT NULL specification on the column
is the COPY CSV control method, rather than having it be in COPY.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Updated COPY CSV patch

2004-04-14 Thread Bruce Momjian
Peter Eisentraut wrote:
> Bruce Momjian wrote:
> > I don't agree about moving this to a client.  It is too important for
> > that.  Many admin apps and psql need this capability and having it in
> > a client just requires everyone to reinvent the wheel.
> 
> So far the only need I've heard of is dumping the data to an 
> appropriately formatted text file and reading it into some external 
> application.  That surely sounds like something a specialized client 
> could handle.

But too many folks need this functionality, and if you create a client,
how do you access it from other computers.  It is a logical extension
for COPY, I think.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Updated COPY CSV patch

2004-04-14 Thread Peter Eisentraut
Bruce Momjian wrote:
> I don't agree about moving this to a client.  It is too important for
> that.  Many admin apps and psql need this capability and having it in
> a client just requires everyone to reinvent the wheel.

So far the only need I've heard of is dumping the data to an 
appropriately formatted text file and reading it into some external 
application.  That surely sounds like something a specialized client 
could handle.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] Updated COPY CSV patch

2004-04-14 Thread Andrew Dunstan
Karel Zak said:
> On Tue, Apr 13, 2004 at 10:43:35AM -0400, Tom Lane wrote:
>> Andrew Dunstan <[EMAIL PROTECTED]> writes:
>> > Thinking about this some more  maybe the right rule would be
>> > "quote  all non-numeric non-null values".
>>
>> And how would you define "numeric"?
>
> And don't  forget that  number format depend  on locale. Now  we ignore
> LC_NUMERIC, but  maybe at  some time  in future  we will  support it. I
> think the best solution is quote all values including numerics too.
>
> 123,456  (in Czech this is one number with decimal "point" ;-)
>

I am assuming that such users would use a delimiter other than a comma.
Any string that contains the delimiter *must* be quoted, regardless of
type. That's just the way CSVs work - it's not an artifact of ours.

cheers

andrew





---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Updated COPY CSV patch

2004-04-14 Thread Andrew Dunstan
Bruce Momjian said:
> Andrew Dunstan wrote:
>> I don't believe '' should be special, any more than 'fred' should be.
>> As  it stands now, NULL 'fred' does say that ,, and '"", are empty
>> strings.
>>
>> >Again, I can assist in making these modifications to the patch.
>> >
>> >
>>
>> I appreciate your efforts. But as indicated elsewhere, right now I'm
>> leaning towards reworking this into a client, because the road seems
>> to  be blocked on doing what I regard as necessary in the backend.
>
> I don't agree about moving this to a client.  It is too important for
> that.  Many admin apps and psql need this capability and having it in a
> client just requires everyone to reinvent the wheel.
>
> Let me think over you issues and see what I can come up with.
>
> The patch isn't that big and already does 90% of what we need.  We just
> need ot get that extra 10%.
>

Yeah. Yesterday was a bad day - sorry if I was a bit liverish.

Regrouping ...

The likely failures I see are these:

. on import, a null value is attempted to be inserted into a NOT NULL
field, and
. on export, a column that should have not-null values quoted does not,
confusing the foreign program.

I was trying to be clever and handle these automatically, causing a small
explosion somewhere near Pittsburgh. How about if we are dumb and simply
give finer grained control to the user?

Let's say instead of your per table/file ALL|NONE|STRICT proposal, we have
something like a "FORCE list-of-columns" option. On import, this would
force isnull to be false and on export it would force quoting of non-null
values just for those columns. The problems are not symmetric, but the
solution is the same - make the user decide on a per column basis.

This would have the following happy results:
. what we write would remain unambiguously null-preserving
. you would never need to use (and shouldn't) this option for a
postgresql<->postgresql round trip, assuming the table definitions were
the same.
. even if you wrote a file using this option you could still reimport it
correctly without using the option.
. (fingers crossed) Tom is kept happy.

cheers

andrew



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] Updated COPY CSV patch

2004-04-14 Thread Karel Zak
On Tue, Apr 13, 2004 at 10:43:35AM -0400, Tom Lane wrote:
> Andrew Dunstan <[EMAIL PROTECTED]> writes:
> > Thinking about this some more  maybe the right rule would be "quote 
> > all non-numeric non-null values".
> 
> And how would you define "numeric"?

 And don't  forget that  number format depend  on locale. Now  we ignore
 LC_NUMERIC, but  maybe at  some time  in future  we will  support it. I
 think the best solution is quote all values including numerics too.

 123,456  (in Czech this is one number with decimal "point" ;-)

Karel

-- 
 Karel Zak  <[EMAIL PROTECTED]>
 http://home.zf.jcu.cz/~zakkr/

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Bruce Momjian
Andrew Dunstan wrote:
> I don't believe '' should be special, any more than 'fred' should be. As 
> it stands now, NULL 'fred' does say that ,, and '"", are empty strings.
> 
> >Again, I can assist in making these modifications to the patch.
> >  
> >
> 
> I appreciate your efforts. But as indicated elsewhere, right now I'm 
> leaning towards reworking this into a client, because the road seems to 
> be blocked on doing what I regard as necessary in the backend.

I don't agree about moving this to a client.  It is too important for
that.  Many admin apps and psql need this capability and having it in a
client just requires everyone to reinvent the wheel.

Let me think over you issues and see what I can come up with.

The patch isn't that big and already does 90% of what we need.  We just
need ot get that extra 10%.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Bruce Momjian wrote:

I see that the default NULL for CSV mode is ''.  I was hoping the
default was something more special.  Right now, by default, comma-comma
is a null and comma-double-quote-double-quote-comma is a zero-length
string.   I am thinking there should be a way to set NULL to be either
of those, or neither of those, in which case comma-comma is a
zero-length string too.
To me, these characteristics are a property of the file, not of the
individual fields.
For example, WITH NULL BOTH would allow ,, and ,"", to both be null,
 

I can't see a real world use for this setting. And I think it would 
break the property of the patch as it currently stands, that we can 
unambiguously import what we exported, no matter what the settings. I 
don't think we should abandon that lightly. Quite apart from any other 
reason because it makes testing easier (just compare what you wrote with 
what you read back).


while using WITH NULL NONE, both ,, and ,"", are zero-length strings. 
 

Again, I think this will break that property. But if that's what it 
takes to be able to import to a table with NOT NULL in at least some 
cases I could live with it. Just. But in the general case it won't work. 
Say you are importing into a table with the following defn: (a text, b 
text not null, c int). then the line 'x,,' will fail on b if '' is null, 
and will fail on c if '' is empty string. And yet this sort of output is 
exactly what is to be expected from a spreadsheet.

And, finally, the default is WITH NULL STRICT (or SOME) where ,, is NULL
and ,"", is the zero-length string.
 

That's what happens now with the default.

Those are all existing keywords, and those special NULL values would
only be available in CSV mode.
I am not sure what NULL '' should so in these cases. I am thinking we
would actually disable it for CSV mode because you would need to define
which '' you are talking about.
If you specify an actual string for NULL like WITH NULL 'fred', then
both ,, and ,"", are zero-length strings, I think.
 

I don't believe '' should be special, any more than 'fred' should be. As 
it stands now, NULL 'fred' does say that ,, and '"", are empty strings.

Again, I can assist in making these modifications to the patch.
 

I appreciate your efforts. But as indicated elsewhere, right now I'm 
leaning towards reworking this into a client, because the road seems to 
be blocked on doing what I regard as necessary in the backend.

cheers

andrew









---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> Andrew Dunstan wrote:
>> For all those reasons I disallowed use of WITH OIDS for CSV mode.

> While doing OIDs seems atypical, it seems like a reasonable thing that
> CSV should be able to do.  Basically, I see no reason to disable it.

I agree.  It's an orthogonal thing, and we shouldn't make CSV mode
work differently from normal mode where we don't have to.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Bruce Momjian
Andrew Dunstan wrote:
> Yes. What *I* meant was that allowing it was OK with me, and not worth 
> arguing over.

OK, thanks.

> Incidentally, the patch looks OK at first glance, and seems to work 
> fine, modulo today's little controversies, with this exception:
> 
> if (csv_mode)
> {
> if (!quote)
> quote = "\"";
> if (!escape)
> escape = "\""; /* should be "escape = quote;" */
> }

Oh, excellent point.

On my other WITH NULL ALL|STRICT|NONE email, we could have '' be ALL and
have it be WITH NULL ''|STRICT|NONE.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Bruce Momjian wrote:

Andrew Dunstan wrote:
 

Bruce Momjian wrote:

   

While doing OIDs seems atypical, it seems like a reasonable thing that
CSV should be able to do.  Basically, I see no reason to disable it.


 

OK. We have bigger fish to fry ;-)
   

Uh, sorry, what I meant was that we should have it working in this
patch.  No reason to leave it for later.  We have to get this 100% right
the first time because people will start relying on it.
 

Yes. What *I* meant was that allowing it was OK with me, and not worth 
arguing over.

Incidentally, the patch looks OK at first glance, and seems to work 
fine, modulo today's little controversies, with this exception:

   if (csv_mode)
   {
   if (!quote)
   quote = "\"";
   if (!escape)
   escape = "\""; /* should be "escape = quote;" */
   }
cheers

andrew

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Bruce Momjian
Andrew Dunstan wrote:
> >As for setting default values, I think that is a good idea. I suggested
> >a while back. There could be another keyword, DEFAULT, on the COPY FROM
> >command that is used to define a code that will be replaced by the
> >default value (or NULL if there is no default for a column) similar to
> >how the NULL code is replaced by NULL.
> >  
> >
> 
> Well, as I indicated we can deal with this in a subsequent round, I 
> think. However, here's an idea. We know (or can easily discover) if 
> there is a NOT NULL constraint that can apply to the attribute (or 
> domain if it is a domain type). If isnull is set on the read-in value in 
> such a case, instead of trying to insert null, and knowing we would 
> fail, try to insert the value we actually read (usually ''), even though 
> we think we found a null. This will succeed with text fields, and fail 
> with, for example, int fields. That strikes me as quite reasonable 
> behavior, although perhaps qualifying for a warning. Or perhaps we 
> could enable such behavior with a flag.
> 
> Of course, this would be for CSV mode only - standard TEXT mode should 
> work as now.

I see that the default NULL for CSV mode is ''.  I was hoping the
default was something more special.  Right now, by default, comma-comma
is a null and comma-double-quote-double-quote-comma is a zero-length
string.   I am thinking there should be a way to set NULL to be either
of those, or neither of those, in which case comma-comma is a
zero-length string too.

To me, these characteristics are a property of the file, not of the
individual fields.

For example, WITH NULL BOTH would allow ,, and ,"", to both be null,
while using WITH NULL NONE, both ,, and ,"", are zero-length strings. 
And, finally, the default is WITH NULL STRICT (or SOME) where ,, is NULL
and ,"", is the zero-length string.

Those are all existing keywords, and those special NULL values would
only be available in CSV mode.

I am not sure what NULL '' should so in these cases. I am thinking we
would actually disable it for CSV mode because you would need to define
which '' you are talking about.

If you specify an actual string for NULL like WITH NULL 'fred', then
both ,, and ,"", are zero-length strings, I think.

Again, I can assist in making these modifications to the patch.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Tom Lane wrote:

Andrew Dunstan <[EMAIL PROTECTED]> writes:
 

Tom Lane wrote:
   

And how would you define "numeric"?
 

 

At least the following:
   

 

int8
int2
int4
float4
float8
numeric
money
   

 

and domains based on them.
   

Wrong answer, as this excludes user-defined types.  COPY should not
discriminate on the basis of recognizing particular data types.
 

I'm trying to keep this as simple as possible. But we have to be a bit 
smart if we want to be able to export nicely. Here's the problem: say 
you have a text field that stores something that has numeric form (phone 
number, SSN, whatever). You want that exported as text (i.e. quoted). 
Otherwise, things like leading zeros will get lost by the importing 
program. However, you *must* not quote genuine number values, or they 
will not be imported correctly either.
   

Again, you are trying to make COPY into something it isn't and shouldn't
be.
 

Exporting nicely has a lot more wrinkles than importing nicely, because 
predicting the behaviour of the program we might be exporting to is 
difficult.
   

s/difficult/impossible/.  I might be willing to accept this sort of
cruft if it were well-defined cruft, but in point of fact you are trying
to set up expectations that will be impossible to satisfy.  We will be
forever making more little tweaks to COPY that render its behavior ever
less predictable, in the vain hope of reclosing the can of worms you
want to open.  It would be a lot wiser to implement this sort of
behavior outside the backend, in code that is easily hacked by users.
 

I have neither the time nor the inclination for a fight over this. I 
hope I never have to use this anyway.

Plan B would be to provide a libpq client for importing/exporting CSVs. 
This is such a basic function that I'd like to see it in the core 
distribution. Most of the required logic is in what has already been 
done.  It would mean that it had to be done via a client connection, 
which might have speed implications for very large imports that could 
run faster direct from the server's file system.

I thought there was an agreement on how to do this, but the straight 
jacket into which it is being forced will significantly limit its 
utility, IMNSHO.

cheers

andrew



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Bruce Momjian
Andrew Dunstan wrote:
> Bruce Momjian wrote:
> 
> >While doing OIDs seems atypical, it seems like a reasonable thing that
> >CSV should be able to do.  Basically, I see no reason to disable it.
> >
> >  
> >
> 
> OK. We have bigger fish to fry ;-)

Uh, sorry, what I meant was that we should have it working in this
patch.  No reason to leave it for later.  We have to get this 100% right
the first time because people will start relying on it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> And how would you define "numeric"?

> At least the following:

>  int8
>  int2
>  int4
>  float4
>  float8
>  numeric
>  money

> and domains based on them.

Wrong answer, as this excludes user-defined types.  COPY should not
discriminate on the basis of recognizing particular data types.

> I'm trying to keep this as simple as possible. But we have to be a bit 
> smart if we want to be able to export nicely. Here's the problem: say 
> you have a text field that stores something that has numeric form (phone 
> number, SSN, whatever). You want that exported as text (i.e. quoted). 
> Otherwise, things like leading zeros will get lost by the importing 
> program. However, you *must* not quote genuine number values, or they 
> will not be imported correctly either.

Again, you are trying to make COPY into something it isn't and shouldn't
be.

> Exporting nicely has a lot more wrinkles than importing nicely, because 
> predicting the behaviour of the program we might be exporting to is 
> difficult.

s/difficult/impossible/.  I might be willing to accept this sort of
cruft if it were well-defined cruft, but in point of fact you are trying
to set up expectations that will be impossible to satisfy.  We will be
forever making more little tweaks to COPY that render its behavior ever
less predictable, in the vain hope of reclosing the can of worms you
want to open.  It would be a lot wiser to implement this sort of
behavior outside the backend, in code that is easily hacked by users.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Well, as I indicated we can deal with this in a subsequent round, I 
> think. However, here's an idea. We know (or can easily discover) if 
> there is a NOT NULL constraint that can apply to the attribute (or 
> domain if it is a domain type). If isnull is set on the read-in value in 
> such a case, instead of trying to insert null, and knowing we would 
> fail, try to insert the value we actually read (usually ''), even though 
> we think we found a null. This will succeed with text fields, and fail 
> with, for example, int fields. That strikes me as quite reasonable 
> behaviour, although perhaps qualifying for a warning. Or perhaps we 
> could enable such behaviour with a flag.

I think this is all a *really really* bad idea.  COPY is not supposed
to be an "intuit what the user meant" facility.  It is supposed to give
reliable, predictable results, and in particular it *is* supposed to
raise errors if the data is wrong.  If you want intuition in
interpreting bogus CSV files then an external data conversion program is
the place to do it.

> Of course, this would be for CSV mode only - standard TEXT mode should 
> work as now.

The CSV flag should not produce any fundamental changes in COPY's
behavior, only change the external format it handles.  Guessing has
no place here.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Bruce Momjian wrote:

While doing OIDs seems atypical, it seems like a reasonable thing that
CSV should be able to do.  Basically, I see no reason to disable it.
 

OK. We have bigger fish to fry ;-)

cheers

andrew

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Bruce Momjian
Andrew Dunstan wrote:
> >
> > I have two open issues.  First, CSV should support WITH OIDS, no?
> >
> 
> Why on earth would you want to? OIDs only have emaning to postgresql.
> Dumping to/from CSVs should not be seen as an alternative to Postgresql's
> normal text or binary file formats. Rather, it is a way of exchanging
> data  with other programs that can't conveniently read or write those
> formats,  but can read/write CSVs. I expect the data to be making a one
> way trip.  OIDs make no sense to those other programs.
> 
> 
> For all those reasons I disallowed use of WITH OIDS for CSV mode.
> 
> If you think that we should have it, it will be simple enough to do. It
> just strikes me as a very odd thing to do.

While doing OIDs seems atypical, it seems like a reasonable thing that
CSV should be able to do.  Basically, I see no reason to disable it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Bruno Wolff III wrote:

On Tue, Apr 13, 2004 at 06:58:24 -0400,
 Andrew Dunstan <[EMAIL PROTECTED]> wrote:
 

One area that we should think about as an enhancement is NOT NULL fields.
As it stands now, we will get what we normally get when we try to insert
a  null into a NOT NULL field, namely an error. If the field has a simple
literal default we could force that. And in the special case of
text/varchar fields, it would be reasonable to force an empty string even
if no default is set. There isn't a nice easy answer, I'm afraid. We
shouldn't hold up putting this in on that account, but handling this
better is certainly a TODO.
   

If you try to insert NULLs into a nonnull field you should get an error.
If you have unquoted empty strings, and are not using the empty string as 
the NULL marker, then you can just not set the NULL code to be the empty
string. If you need to turn this on and off by column, then perhaps it
would be better to do that externally.

As for setting default values, I think that is a good idea. I suggested
a while back. There could be another keyword, DEFAULT, on the COPY FROM
command that is used to define a code that will be replaced by the
default value (or NULL if there is no default for a column) similar to
how the NULL code is replaced by NULL.
 

Well, as I indicated we can deal with this in a subsequent round, I 
think. However, here's an idea. We know (or can easily discover) if 
there is a NOT NULL constraint that can apply to the attribute (or 
domain if it is a domain type). If isnull is set on the read-in value in 
such a case, instead of trying to insert null, and knowing we would 
fail, try to insert the value we actually read (usually ''), even though 
we think we found a null. This will succeed with text fields, and fail 
with, for example, int fields. That strikes me as quite reasonable 
behaviour, although perhaps qualifying for a warning. Or perhaps we 
could enable such behaviour with a flag.

Of course, this would be for CSV mode only - standard TEXT mode should 
work as now.

cheers

andrew (trying to be creative here)

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Tom Lane wrote:

Andrew Dunstan <[EMAIL PROTECTED]> writes:
 

Thinking about this some more  maybe the right rule would be "quote 
all non-numeric non-null values".
   

And how would you define "numeric"?
 

At least the following:

int8
int2
int4
float4
float8
numeric
money
and domains based on them.

I do *not* like putting data-type-specific knowledge into COPY.
 

I'm trying to keep this as simple as possible. But we have to be a bit 
smart if we want to be able to export nicely. Here's the problem: say 
you have a text field that stores something that has numeric form (phone 
number, SSN, whatever). You want that exported as text (i.e. quoted). 
Otherwise, things like leading zeros will get lost by the importing 
program. However, you *must* not quote genuine number values, or they 
will not be imported correctly either. So, either we get a little bit 
smart about the types we are exporting, or our export is going to be 
broken in some cases. We need to be able to use more information than is 
contained in the text representation of the value.

I would certainly hope to keep all this confined to the CSV code path, 
and not intrude on any existing functionality in any way.

Exporting nicely has a lot more wrinkles than importing nicely, because 
predicting the behaviour of the program we might be exporting to is 
difficult.

If you can suggest a better way I'm all ears.

cheers

andrew

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Thinking about this some more  maybe the right rule would be "quote 
> all non-numeric non-null values".

And how would you define "numeric"?

I do *not* like putting data-type-specific knowledge into COPY.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Andrew Dunstan wrote:

Bruce Momjian said:
 

Second, I found a problem with NULLs.  If I do:
.
  test=> create table test (x text, y text);
  CREATE TABLE
  test=> insert into test values ('', NULL);
  INSERT 17221 1
  test=>
then this:

  test=> copy test to '/tmp/b' with csv;

creates:

  "",

and this:

  test=> copy test to '/tmp/b' with csv NULL 'fred';

creates:

  ,fred

Is that logical?  A non-null field went from "" to nothing.

   

One more point about this - we can't force quoting of every non-null
value, which would remove the "inconsistency" you see here, because
spreadsheets especially infer information from whether or not a CSV value
is quoted. In particular, they will not usually treat a quoted numeric
value as numeric, which would be a very undesirable effect.
 

Thinking about this some more  maybe the right rule would be "quote 
all non-numeric non-null values". That would fix the odd effect that 
Bruce saw, and it would also stop a spreadsheet from interpreting a date 
like 2002-03-04 as an arithmetic expression. 

Note that we don't care if a value is quoted or not - the only inference 
we draw from it is that if it is quoted it can't be null. We don't need 
to draw type inferences from it because we know the type we are trying 
to import into from the table defn. This fits nicely with the rule about 
being liberal with what you accept.

Thoughts?

cheers

andrew

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Bruno Wolff III
On Tue, Apr 13, 2004 at 06:58:24 -0400,
  Andrew Dunstan <[EMAIL PROTECTED]> wrote:
> 
> One area that we should think about as an enhancement is NOT NULL fields.
> As it stands now, we will get what we normally get when we try to insert
> a  null into a NOT NULL field, namely an error. If the field has a simple
> literal default we could force that. And in the special case of
> text/varchar fields, it would be reasonable to force an empty string even
> if no default is set. There isn't a nice easy answer, I'm afraid. We
> shouldn't hold up putting this in on that account, but handling this
> better is certainly a TODO.

If you try to insert NULLs into a nonnull field you should get an error.
If you have unquoted empty strings, and are not using the empty string as 
the NULL marker, then you can just not set the NULL code to be the empty
string. If you need to turn this on and off by column, then perhaps it
would be better to do that externally.

As for setting default values, I think that is a good idea. I suggested
a while back. There could be another keyword, DEFAULT, on the COPY FROM
command that is used to define a code that will be replaced by the
default value (or NULL if there is no default for a column) similar to
how the NULL code is replaced by NULL.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Bruce Momjian said:
> Second, I found a problem with NULLs.  If I do:
> .
>test=> create table test (x text, y text);
>CREATE TABLE
>test=> insert into test values ('', NULL);
>INSERT 17221 1
>test=>
>
> then this:
>
>test=> copy test to '/tmp/b' with csv;
>
> creates:
>
>"",
>
> and this:
>
>test=> copy test to '/tmp/b' with csv NULL 'fred';
>
> creates:
>
>,fred
>
> Is that logical?  A non-null field went from "" to nothing.
>

One more point about this - we can't force quoting of every non-null
value, which would remove the "inconsistency" you see here, because
spreadsheets especially infer information from whether or not a CSV value
is quoted. In particular, they will not usually treat a quoted numeric
value as numeric, which would be a very undesirable effect.

cheers

andrew



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Updated COPY CSV patch

2004-04-13 Thread Andrew Dunstan
Bruce Momjian said:
>
> OK, here is a new version of the patch that includes the grammar changes
we agreed upon, SGML changes, and \copy support.  I will not make any more
changes without contacting you so feel free to make adjustments and repost.


Excellent. Quick work :-) I will test later today.

Meanwhile, quick response to your comments.


>
> I have two open issues.  First, CSV should support WITH OIDS, no?
>

Why on earth would you want to? OIDs only have emaning to postgresql.
Dumping to/from CSVs should not be seen as an alternative to Postgresql's
normal text or binary file formats. Rather, it is a way of exchanging
data  with other programs that can't conveniently read or write those
formats,  but can read/write CSVs. I expect the data to be making a one
way trip.  OIDs make no sense to those other programs.


For all those reasons I disallowed use of WITH OIDS for CSV mode.

If you think that we should have it, it will be simple enough to do. It
just strikes me as a very odd thing to do.

> Second, I found a problem with NULLs.  If I do:
> .
>test=> create table test (x text, y text);
>CREATE TABLE
>test=> insert into test values ('', NULL);
>INSERT 17221 1
>test=>
>
> then this:
>
>test=> copy test to '/tmp/b' with csv;
>
> creates:
>
>"",
>
> and this:
>
>test=> copy test to '/tmp/b' with csv NULL 'fred';
>
> creates:
>
>,fred
>
> Is that logical?  A non-null field went from "" to nothing.
>

Yes, I think it is logical, although it is strange, I agree. See below.


> I think it is caused by this code:
>
> bool force_quote = (strcmp(string, null_print) == 0);
> CopyAttributeOutCSV(string, delim, quote, escape,
> force_quote);
>
> The reason it happens is that when the null string is '', it matches a
zero-length string, so the value is quoted.  When the null stirng isn't
blank, a zero-length string doesn't match the null string so it isn't
quoted.I think we need to add special logic for zero-length strings so
they are always quoted, even if there is a special null string.  This will
make our dumps more consistent, I think, or maybe the current behavior is
OK.  It just struck me as strange.
>

I agree it seems strange, but that's because you probably aren't that
used  to dealing with CSVs.

The code you quote is there so that if *we* reload a CSV *we* created the
null property is preserved. That's actually quite a neat property, and
one  I would have said is "beyond contract requirements" :-). The effect
you  have seen seems perverse because you have chosen a perverse null
marker -  I expect everyone (well, nearly everyone) who uses this will do
what seems  to me natural, and use '' as the null marker, and then, as you
saw, the  results look quite intuitive.

The real test of this is how well it plays with other programs, rather
than how well data makes a round trip from postgresql to postgresql. In
that sense, we need to let it out into the wild a bit and see what happens.

There is a bit of an impedance mismatch between the way databases look at
data and the way spreadsheets look at data. And CSV is a horribly brain-
dead data format. Its one saving grace is that it is a lowest common
denominator format that practically every data handling proggram under
the  sun understands. So there is going to be a bit of oddness. And we may
have  to make the odd tweak here and there.

One area that we should think about as an enhancement is NOT NULL fields.
As it stands now, we will get what we normally get when we try to insert
a  null into a NOT NULL field, namely an error. If the field has a simple
literal default we could force that. And in the special case of
text/varchar fields, it would be reasonable to force an empty string even
if no default is set. There isn't a nice easy answer, I'm afraid. We
shouldn't hold up putting this in on that account, but handling this
better is certainly a TODO.


(MySQL solves this problem by requiring every NOT NULL field to have a
default, and silently supplying one if it is not specified, and the
inserting a null is the sane as inserting DEFAULT. I don't think we want
to go down that road  this bit me badly a few weeks back when I
applied some schema changes someone had carelessly designed relying on
this behaviour. Of course I was running PostgrSQL and the whole thing
blew  up on me.)



> I did a dump/reload test with a null string and null, and it worked fine.
>
> Is there any data that can not be dumped/reloaded via CSV?
>

I know it works fine for geometric types. I have seen it dump ACLs quite
happily, although I didn't try to reload them. I will do some testing
soon  with arrays and byteas.

cheers

andrew




---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PATCHES] Updated COPY CSV patch

2004-04-12 Thread Bruce Momjian
Andrew Dunstan wrote:
> If the null marker is not an empty string, it gets an error, of
> course - if it is it gets a null:
> 
> [EMAIL PROTECTED] pginst]$ echo ',,' | bin/psql -c "create temp
> table foo (a int, b text, c text); copy foo from stdin delimiter
> ',\"' null 'N';" ERROR:  invalid input syntax for integer:
> "" CONTEXT:  COPY foo, line 1, column a: "" [EMAIL PROTECTED]
> pginst]$ echo ',,' | bin/psql -c "create temp table foo (a int,
> b text, c text); copy foo from stdin delimiter ',\"' ;"
> [EMAIL PROTECTED] pginst]$
> 
> 
> I hope that is expected behaviour - it's what *I* expect, at
> least.
> >
> 
> Attached patch has these additions to previously posted patch:
> . quote character may not appear in NULL marker
> . any non-null value that matches the NULL marker is forced to be quoted
> when written.

OK, here is a new version of the patch that includes the grammar
changes we agreed upon, SGML changes, and \copy support.  I will not
make any more changes without contacting you so feel free to make
adjustments and repost.

I have two open issues.  First, CSV should support WITH OIDS, no?

Second, I found a problem with NULLs.  If I do:
.
test=> create table test (x text, y text);
CREATE TABLE
test=> insert into test values ('', NULL);
INSERT 17221 1
test=>

then this:

test=> copy test to '/tmp/b' with csv;

creates:

"",

and this:

test=> copy test to '/tmp/b' with csv NULL 'fred';

creates:

,fred

Is that logical?  A non-null field went from "" to nothing.

I think it is caused by this code:

 bool force_quote = (strcmp(string, null_print) == 0);
 CopyAttributeOutCSV(string, delim, quote, escape,
 force_quote);

The reason it happens is that when the null string is '', it matches a
zero-length string, so the value is quoted.  When the null stirng isn't
blank, a zero-length string doesn't match the null string so it isn't
quoted.I think we need to add special logic for zero-length strings
so they are always quoted, even if there is a special null string.  This
will make our dumps more consistent, I think, or maybe the current
behavior is OK.  It just struck me as strange.

I did a dump/reload test with a null string and null, and it worked
fine.

Is there any data that can not be dumped/reloaded via CSV?

--
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
Index: doc/src/sgml/ref/copy.sgml
===
RCS file: /cvsroot/pgsql-server/doc/src/sgml/ref/copy.sgml,v
retrieving revision 1.55
diff -c -c -r1.55 copy.sgml
*** doc/src/sgml/ref/copy.sgml  13 Dec 2003 23:59:07 -  1.55
--- doc/src/sgml/ref/copy.sgml  13 Apr 2004 04:18:22 -
***
*** 26,32 
[ BINARY ] 
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
!   [ NULL [ AS ] 'null string' ] 
]
  
  COPY tablename [ ( column [, ...] ) ]
  TO { 'filename' | STDOUT }
--- 26,34 
[ BINARY ] 
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
!   [ NULL [ AS ] 'null string' ]
!   [ CSV [ QUOTE [ AS ] 'quote' 
] 
! [ ESCAPE [ AS ] 'escape' ] ]
  
  COPY tablename [ ( column [, ...] ) ]
  TO { 'filename' | STDOUT }
***
*** 34,40 
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
!   [ NULL [ AS ] 'null string' ] 
]
  
   
   
--- 36,44 
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
!   [ NULL [ AS ] 'null string' ]
!   [ CSV [ QUOTE [ AS ] 'quote' 
] 
! [ ESCAPE [ AS ] 'escape' ] ]
  
   
   
***
*** 136,142 
   
Specifies copying the OID for each row.  (An error is raised if
OIDS is specified for a table that does not
!   have OIDs.)
   
  
 
--- 140,146 
   
Specifies copying the OID for each row.  (An error is raised if
OIDS is specified for a table that does not
!   have OIDs.)  FIX CSV FOR OIDS!
   
  
 
***
*** 146,152 
  
   
The single character that separates columns within each row
!   (line) of the file.  The default is a tab character.
   
  
 
--- 150,157 
  
   
The single character that separates columns within each row
!   (line) of the file.  The default is a tab character in normal mode,
!   a comma in CSV mode.
   
  
 
***
*** 156,175 
  
   
The string that represents a null value. The default is
!   \N (backslash-N). You might prefer an empty
!   string, for exampl