Re: [HACKERS] COPY formatting

2004-03-22 Thread Tom Lane
Karel Zak [EMAIL PROTECTED] writes:
 On Fri, Mar 19, 2004 at 09:54:37AM -0500, Tom Lane wrote:
 Why?  The client-side code doesn't have any real say over the meaning of
 the data, at least not in psql-class clients.  I suppose a client app
 that tries to interpret the data could get confused, but psql sure
 doesn't do that.

  libpq, fe-protocol2.c, pqGetCopyData2() and pqGetline2() cut string
  behind '\n'.

Sure, but that doesn't invalidate the data stream as a whole, it's just
a bufferload boundary choice that won't be very helpful for clients not
using a newline-based data layout.  In any case, v2 protocol is
obsolete and needn't limit our thoughts about what to do in future.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] COPY formatting

2004-03-21 Thread Karel Zak
On Fri, Mar 19, 2004 at 09:54:37AM -0500, Tom Lane wrote:
 Karel Zak [EMAIL PROTECTED] writes:
   The problem with  CSV is that it will correctly  work with new protocol
   only. Because old  versions of  clients are newline  sensitive.
 
 Why?  The client-side code doesn't have any real say over the meaning of
 the data, at least not in psql-class clients.  I suppose a client app
 that tries to interpret the data could get confused, but psql sure
 doesn't do that.

 libpq, fe-protocol2.c, pqGetCopyData2() and pqGetline2() cut string
 behind '\n'.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 http://home.zf.jcu.cz/~zakkr/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] COPY formatting

2004-03-19 Thread Karel Zak
On Thu, Mar 18, 2004 at 10:16:36AM -0500, Tom Lane wrote:
 Passing in a relation OID is probably a bad idea anyway, as it ties this
 API to the assumption that COPY is only for complete relations.  There's
 been talk before of allowing a SELECT result to be presented via the
 COPY protocol, for instance.  What might be a more usable API is
 
 COPY OUT:
   function formatter_out(text[]) returns text
 COPY IN:
   function formatter_in(text) returns text[]
 
 where the text array is either the results of or the input to the
 per-column datatype I/O routines.  This makes it explicit that the
 formatter's job is solely to determine the column-level wrapping and
 unwrapping of the data.  I'm assuming here that there is no good reason
 for the formatter to care about the specific datatypes involved; can you
 give a counterexample?

 The idea was put maximum information about tuple to formatter, and what
 will formatter do with this information is a formatter problem.

   It's pity  that main idea of  current COPY is based  on separated lines
   and it is not more common interface for streaming data between FE and BE.
 
 Yeah, that was another concern I had.  This API would let the formatter
 control line-level layout but it would not eliminate the hard-wired
 significance of newline.  What's worse, there isn't any clean way to
 deal with reading quoted newlines --- the formatter can't really replace
 the default quoting rules if the low-level code is going to decide
 whether a newline is quoted or not.

 I think latest  protocol version works with blocks of  data and no with
 lines and client PQputCopyData() returns a block -- only docs says that
 it is row of table.

 We could possibly solve that by specifying that the text output or input
 (respectively) is the complete line sent to or from the client,
 including newline or whatever other line-level formatting you are using.
 This still leaves the problem of how the low-level COPY IN code knows
 what is a complete line to pass off to the formatter_in routine.  We
 could possibly fix this by adding a second input-control routine
 
   function formatter_linelength(text) returns integer
 
 which is defined to return -1 if the input isn't a complete line yet

 But  formatter_linelength()  will  need   some  context  information  I
 think. The others  words some  struct with formatter  specific internal
 data. And  for more  difficult formats  like XML  you need  some others
 context data (parser data) too.

 Maybe there can be some global  exported struct (like for triggers) and
 functions that is written in C  can use it. It means for simple formats
 like CSV you can  use non-C functions and for formats  like XML you can
 use C functions. And  if it will intereting for PL  developers they can
 add support for access to this structs to their languages.

 (i.e., read some more data, append to the buffer, and try again), or
 = 0 to indicate that the first N bytes of the buffer represent a
 complete line to be passed off to formatter_in.  I don't see a way to
 combine formatter_in and formatter_linelength into a single function
 without relying on out parameters, which would again confine the
 feature to format functions written in C.

 It's a tad annoying that we need two functions for input.  One way that
 we could still keep the COPY option syntax to be just
   FORMAT csv
 is to create an arbitrary difference in the signatures of the input
 functions.  Then we could have coexisting functions
   csv(text[]) returns text
   csv(text) returns text[]
   csv(text, ...) returns int
 that are referenced by FORMAT csv.

 It sounds good, but I think we  both not full sure about it now, right?
 CSV support will probably better add by DELIMITER extension.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 http://home.zf.jcu.cz/~zakkr/

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] COPY formatting

2004-03-19 Thread Chris Browne
[EMAIL PROTECTED] (Andrew Dunstan) writes:
 Karel Zak wrote:

 Hi,

 in TODO is item: * Allow dump/load of CSV format. I don't think
 it's clean idea. Why CSV and why not something other? :-)

 A why not allow to users full control of the format by they own
 function. It means something like:
 COPY tablename [ ( column [, ...] ) ]
 TO { 'filename' | STDOUT }
 [ [ WITH ]  [ BINARY ]
  [ OIDS ]
  [ DELIMITER [ AS ] 'delimiter' ]
  [ NULL [ AS ] 'null string' ]
  [ FORMAT funcname ] ]
   
 The formatting
 function API can be pretty simple:

 text *my_copy_format(text *attrdata, int direction, int
 nattrs, int attr, oid attrtype, oid relation)

 -- it's pseudocode of course, it should be use standard fmgr
 interface.
 It's probably interesting for non-binary COPY version.

 Interesting ... The alternative might be an external program to munge
 CSVs and whatever other format people want to support and then call
 the exisiting COPY- either in bin or contrib. I have seen lots of
 people wanting to import CSVs, and that's even before we get a Windows
 port.

I know Jan Wieck has been working on something like this, with a bit
of further smarts...

 - By having, alongside, a table definition, the table can be created 
   concurrently;

 - A set of mapping functions can be used, so that if, for instance,
   the program generating the data was Excel, and you have a field with
   values like 37985, 38045, or 38061, they can respectively be mapped
   to '2004-01-01', '2004-03-01', and '2004-03-17';

 - It can load whatever data is loadable, and use Ethernet-like
   backoffs when it encounters bad records so that it loads all the data
   that is good, and leaves a bundle of 'crud' that is left over.

He had been prototyping it in Tcl; I'm not sure how far a port to C
has gotten.  It looked pretty neat; it sure seems better to put the
cleverness in userspace than to try to increase the complexity of
the postmaster...
-- 
output = (cbbrowne @ cbbrowne.com)
http://cbbrowne.com/info/linuxxian.html
Have you heard of the new Macsyma processor?  It has three instructions --
LOAD, STORE, and SKIP IF INTEGRABLE.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] COPY formatting

2004-03-19 Thread Andrew Dunstan


Karel Zak wrote:

The problem with  CSV is that it will correctly  work with new protocol
only. Because old  versions of  clients are newline  sensitive. And CSV
can contains newline in by quotation marks defined attributes:
John, Smith, The White House
1600 Pennsylvania Avenue NW
Washington, DC 20500, male, open source software office
It is one record. 

(Long Live President Smith!)

I have never seen such a beast, and if I had been asked off the cuff 
would have said that it was probably illegal, except that I know of no 
standard actually defining the format. Perhaps others (Joshua Drake or 
Josh Berkus?) have wider experience. I think in any case we should 
ignore those for now and handle the straightforward case.

I *have* seen monstrosities like fields that do not begin with the quote 
character but then break into a quote, e.g.:

1,2,a,123abcdef,6,7,8

 
It's difficult to say it :-), but your DELIMITER idea is better than my
suggested API. Andrew,  go ahead. I thought about  some data streaming,
but COPY is probably bad place for it.

 



Just to be clear, I have no current plans for doing this. I have about 3 
pg related things I am working on, in addition to doing paid work so I 
can eat! If I do it it will not be for quite a while.

There are some wrinkles, though, concerning the interaction of CSV's 
notion of escaping and  COPY's notion of escaping. If someone want to 
undertake this I can flesh those out in a further email.

Someone was asking the other day about a newbie or student project - 
this might be a good one (impact relatively isolated, very easy to test) 
although possibly not as sexy as some might like.

cheers

andrew

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] COPY formatting

2004-03-19 Thread Tom Lane
Karel Zak [EMAIL PROTECTED] writes:
 It's pity  that main idea of  current COPY is based  on separated lines
 and it is not more common interface for streaming data between FE and BE.
 
 Yeah, that was another concern I had.  This API would let the formatter
 control line-level layout but it would not eliminate the hard-wired
 significance of newline.  What's worse, there isn't any clean way to
 deal with reading quoted newlines --- the formatter can't really replace
 the default quoting rules if the low-level code is going to decide
 whether a newline is quoted or not.

  I think latest  protocol version works with blocks of  data and no with
  lines and client PQputCopyData() returns a block -- only docs says that
  it is row of table.

But you can't assume that the client will send blocks that are
semantically significant.  For instance, if psql is reading a file to
send with \copy, how's it going to know how the file is formatted?
It's just gonna send disk-block-sized messages, and the backend has to
discover the semantic boundaries for itself.

  It sounds good, but I think we  both not full sure about it now, right?
  CSV support will probably better add by DELIMITER extension.

Yeah, without people beating on our door for such a hook, it seems like
Andrew's DELIMITER idea is the best thing to do for now.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] COPY formatting

2004-03-19 Thread Fernando Nasser
Andrew Dunstan wrote:


Karel Zak wrote:

The problem with  CSV is that it will correctly  work with new protocol
only. Because old  versions of  clients are newline  sensitive. And CSV
can contains newline in by quotation marks defined attributes:
John, Smith, The White House
1600 Pennsylvania Avenue NW
Washington, DC 20500, male, open source software office
It is one record.


(Long Live President Smith!)

I have never seen such a beast,
Export from a spreadsheet where people have formated the cell with the 
address on it.

Regards,
Fernando
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] COPY formatting

2004-03-19 Thread Karel Zak
On Fri, Mar 19, 2004 at 09:39:58AM -0500, Tom Lane wrote:
 Karel Zak [EMAIL PROTECTED] writes:
  It's pity  that main idea of  current COPY is based  on separated lines
  and it is not more common interface for streaming data between FE and BE.
  
  Yeah, that was another concern I had.  This API would let the formatter
  control line-level layout but it would not eliminate the hard-wired
  significance of newline.  What's worse, there isn't any clean way to
  deal with reading quoted newlines --- the formatter can't really replace
  the default quoting rules if the low-level code is going to decide
  whether a newline is quoted or not.
 
   I think latest  protocol version works with blocks of  data and no with
   lines and client PQputCopyData() returns a block -- only docs says that
   it is row of table.
 
 But you can't assume that the client will send blocks that are
 semantically significant.  For instance, if psql is reading a file to
 send with \copy, how's it going to know how the file is formatted?

 And what  \n in attibutes data  in CSV? I think CSV  format doesn't use
 some escape for  newline char. It means psql with \copy  cannot be sure
 with CSV.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 http://home.zf.jcu.cz/~zakkr/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] COPY formatting

2004-03-19 Thread Tom Lane
Karel Zak [EMAIL PROTECTED] writes:
  The problem with  CSV is that it will correctly  work with new protocol
  only. Because old  versions of  clients are newline  sensitive.

Why?  The client-side code doesn't have any real say over the meaning of
the data, at least not in psql-class clients.  I suppose a client app
that tries to interpret the data could get confused, but psql sure
doesn't do that.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] COPY formatting

2004-03-19 Thread Andrew Dunstan
Fernando Nasser wrote:

Andrew Dunstan wrote:



Karel Zak wrote:

The problem with CSV is that it will correctly work with new protocol
only. Because old versions of clients are newline sensitive. And CSV
can contains newline in by quotation marks defined attributes:
John, Smith, The White House
1600 Pennsylvania Avenue NW
Washington, DC 20500, male, open source software office
It is one record.


(Long Live President Smith!)

I have never seen such a beast,


Export from a spreadsheet where people have formated the cell with the 
address on it.

Well, I just tried with OpenOffice on my RH9 box, and it translated the 
line breaks in the cell into spaces in the CSV. When I replaced them 
with line breaks in a text editor and reloaded it treated them as 
separate rows.

I don't have a Windows box handy on which I can test Excel's behaviour.

cheers

andrew

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] COPY formatting

2004-03-19 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 There are some wrinkles, though, concerning the interaction of CSV's 
 notion of escaping and  COPY's notion of escaping. If someone want to 
 undertake this I can flesh those out in a further email.

Please do that, so that the info is in the archives in case someone else
wants to tackle the project.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] COPY formatting

2004-03-19 Thread Andrew Dunstan
I wrote:

Fernando Nasser wrote:

[snip re multiline CSV fields]

I have never seen such a beast,


Export from a spreadsheet where people have formated the cell with 
the address on it.

Well, I just tried with OpenOffice on my RH9 box, and it translated 
the line breaks in the cell into spaces in the CSV. When I replaced 
them with line breaks in a text editor and reloaded it treated them as 
separate rows.

I don't have a Windows box handy on which I can test Excel's behaviour.

Clearly my experience is out of date - I just found a web ref to Excel 
doing just this. Oh, well, it doesn't seem to me an insurmountable problem.

cheers

andrew

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] COPY formatting

2004-03-19 Thread Tom Lane
Karel Zak [EMAIL PROTECTED] writes:
  And what  \n in attibutes data  in CSV? I think CSV  format doesn't use
  some escape for  newline char. It means psql with \copy  cannot be sure
  with CSV.

I think CSV only allows embedded newlines that are either escaped, or
inside quotes.  COPY doesn't currently have the notion of a quote
character, but that was part of Andrew's proposal ...

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] COPY formatting

2004-03-19 Thread Thomas Swan
quote who=Andrew Dunstan


 Karel Zak wrote:

 The problem with  CSV is that it will correctly  work with new protocol
 only. Because old  versions of  clients are newline  sensitive. And CSV
 can contains newline in by quotation marks defined attributes:

 John, Smith, The White House
 1600 Pennsylvania Avenue NW
 Washington, DC 20500, male, open source software office

 It is one record.


 (Long Live President Smith!)

 I have never seen such a beast, and if I had been asked off the cuff
 would have said that it was probably illegal, except that I know of no
 standard actually defining the format. Perhaps others (Joshua Drake or
 Josh Berkus?) have wider experience. I think in any case we should
 ignore those for now and handle the straightforward case.

 I *have* seen monstrosities like fields that do not begin with the quote
 character but then break into a quote, e.g.:

 1,2,a,123abcdef,6,7,8


I have dealt with both, frequently.   The CSV format allows you to begin a
block of text with the quote.  Newlines are included in that quoted space.
  If qoutes are included in the field then the quotes are double quotes to
denote they are not part of the quoted space.

Also, the following is valid.

1,2,,

 is empty.

1,2,3,,

The 4 quotes denote a single double quote.

Writing simple CSV converts that just explode on commas and newlines miss
these oddities.

Try exporting an Access table with a Memo field (containg multiple lines)
to CSV.



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] COPY formatting

2004-03-19 Thread Andrew Dunstan
Bruce Momjian wrote:

OK, so for the separator, quote, and escape options:

separator is ,
quote is 
escape is 
so if the quote and escape are the same, then a double denotes a
single?
 

Yes. i.e. with the above settings abcdef - abcdef

cheers

andrew

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] COPY formatting

2004-03-19 Thread Andrew Dunstan
Josh Berkus wrote:

Overall, I assert again that approaching this issue through COPY enhancements 
is really not the way to go.We should be looking at a client utility, 
like pg_import and pg_export. The primary purpose of COPY is bulk loads 
for backup/restore, and I'm against doing a lot of tinkering which might make 
it less efficient or introduce new issues into what's currently very 
reliable.

 

That's not unreasonable. I floated my idea as an alternative to a much 
more radical proposal. If we decided against it we should remove the 
TODO item.

As against that, if we don't do this then I think we should embrace 
these utility programs more, possibly bringing them into the distribution.

cheers

andrew

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] COPY formatting

2004-03-19 Thread Bruce Momjian
Andrew Dunstan wrote:
 Josh Berkus wrote:
 
 
 Overall, I assert again that approaching this issue through COPY enhancements 
 is really not the way to go.We should be looking at a client utility, 
 like pg_import and pg_export. The primary purpose of COPY is bulk loads 
 for backup/restore, and I'm against doing a lot of tinkering which might make 
 it less efficient or introduce new issues into what's currently very 
 reliable.
 
   
 
 
 That's not unreasonable. I floated my idea as an alternative to a much 
 more radical proposal. If we decided against it we should remove the 
 TODO item.
 
 As against that, if we don't do this then I think we should embrace 
 these utility programs more, possibly bringing them into the distribution.

CSV seems to be the most widely requested conversion format.  Anything
else is probably a one-off job that should be done in perl or sed.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] COPY formatting

2004-03-18 Thread Karel Zak
On Thu, Mar 18, 2004 at 07:48:40AM +0100, Karel Zak wrote:
 On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
  Karel Zak [EMAIL PROTECTED] writes:
The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction, 
int nattrs, int attr, oid attrtype, oid relation)
  
  This seems like it could only reasonably be implemented as a C function.
 
  Why? I said it's pseudo code. It should use standard fmgr API like
  every other PostgreSQL function or is it problem and I overlook
  something? It must to support arbitrary programming language and not
  C only.

 Well, I  look over  the COPY  code and best  will start  with hardcoded
 version, but make it modular in code and if all will right we can think
 about some interface for others formats definition. OK?

 It's pity  that main idea of  current COPY is based  on separated lines
 and it is not more common interface for streaming data between FE and BE.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 http://home.zf.jcu.cz/~zakkr/

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] COPY formatting

2004-03-18 Thread Karel Zak
On Thu, Mar 18, 2004 at 09:29:03AM +, Lee Kindness wrote:
 To be honest this idea strikes me as overkill - over
 engineering. 

 It was suggestion, maybe you're right :-)

 While i have done a lot of messing around reading/writing the binary
 format (and been stung by changes in that format) if you are using
 this format then you're 99% likely to be in control of the
 incoming/outgoing data and thus able to format to your wishes outwith
 COPY.

 I partly agree.
 
 But.. there  is possible write  directly final file by  backend without
 data transfer  to client. If we  want to  support this feature  we need
 control output format by server...

 And.. I can image  format that is use for BE/FE  data transfer only and
 not for some  final data presentation. For example  compression of data
 stream from/to BE without PostgreSQL protocol change.

 Something else in the TODO regarding COPY is XML import/export, and
 for this to be supported in your proposed implementation the function
 would need to be passed in a heap more information.

 Yes, very  probably some  struct with all  COPY information  and format
 specific stuff. Tom  was right that  in this  case it will  C functions
 only. As I said  I will try implement it without  user defined function
 call for format conversion,  but I will do it modular  and in future we
 can create some interface for user defined formats.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 http://home.zf.jcu.cz/~zakkr/

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] COPY formatting

2004-03-18 Thread Andrew Dunstan
Lee Kindness wrote:

To be honest this idea strikes me as overkill - over
engineering. While there is a clear need for proper CSV import
(i.e. just setting DELIMITER to ',' doesn't work due to ','s in
strings) I cannot see how this would prove useful, or who would use
it?
 

I agree. My modest proposal for handling CSVs would be to extend the 
DELIMITER parameter to allow up to 3 characters - separator, quote and 
escape. Escape would default to the quote char and the quote char would 
default to unspecified. This would involve no grammar changes and fairly 
isolated and small code changes, I believe. In the most common CSV cases 
you would just use $$,$$ or $$,'$$. :-)

COPY is basically line/tuple oriented, and that alone would exclude many 
file formats (e.g. imagine wanting to import a spreadsheet where each 
worksheet was the table name and the first row on each worksheet was the 
field names - I have seen such beasts more than once). If we want a 
general facility for loading and exporting foreign file formats, I 
really believe that is the province of a utility program rather than a 
database engine function.

The reason in my mind for making CSV a special case is that it is very 
easy to do and so often asked for.

(I used to set parsing CSVs as a basic programming exercise - it is 
amazing how many way people find to get it wrong).

cheers

andrew

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] COPY formatting

2004-03-18 Thread Bruce Momjian
Andrew Dunstan wrote:
 Lee Kindness wrote:
 
 To be honest this idea strikes me as overkill - over
 engineering. While there is a clear need for proper CSV import
 (i.e. just setting DELIMITER to ',' doesn't work due to ','s in
 strings) I cannot see how this would prove useful, or who would use
 it?
 
   
 
 I agree. My modest proposal for handling CSVs would be to extend the 
 DELIMITER parameter to allow up to 3 characters - separator, quote and 
 escape. Escape would default to the quote char and the quote char would 
 default to unspecified. This would involve no grammar changes and fairly 
 isolated and small code changes, I believe. In the most common CSV cases 
 you would just use $$,$$ or $$,'$$. :-)
 
 COPY is basically line/tuple oriented, and that alone would exclude many 
 file formats (e.g. imagine wanting to import a spreadsheet where each 
 worksheet was the table name and the first row on each worksheet was the 
 field names - I have seen such beasts more than once). If we want a 
 general facility for loading and exporting foreign file formats, I 
 really believe that is the province of a utility program rather than a 
 database engine function.
 
 The reason in my mind for making CSV a special case is that it is very 
 easy to do and so often asked for.
 
 (I used to set parsing CSVs as a basic programming exercise - it is 
 amazing how many way people find to get it wrong).

I like the separator, quote, and escape idea.  It allows variety without
requiring folks to code in C.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] COPY formatting

2004-03-18 Thread Tom Lane
Karel Zak [EMAIL PROTECTED] writes:
 On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
 Karel Zak [EMAIL PROTECTED] writes:
 This seems like it could only reasonably be implemented as a C function.
 
 Why? I said it's pseudo code. It should use standard fmgr API like
 every other PostgreSQL function or is it problem and I overlook
 something? It must to support arbitrary programming language and not
 C only.

Sure, but the question is whether the *stuff it has to do* can
reasonably be coded in anything but C.  Why are you passing in a
relation OID, if not for lookups in relcache entries that are simply
not accessible above the C level?  (Don't tell me you want the function
to do a bunch of actual SELECTs from system catalogs for every line
of the copy...)

Passing in a relation OID is probably a bad idea anyway, as it ties this
API to the assumption that COPY is only for complete relations.  There's
been talk before of allowing a SELECT result to be presented via the
COPY protocol, for instance.  What might be a more usable API is

COPY OUT:
function formatter_out(text[]) returns text
COPY IN:
function formatter_in(text) returns text[]

where the text array is either the results of or the input to the
per-column datatype I/O routines.  This makes it explicit that the
formatter's job is solely to determine the column-level wrapping and
unwrapping of the data.  I'm assuming here that there is no good reason
for the formatter to care about the specific datatypes involved; can you
give a counterexample?

  It's pity  that main idea of  current COPY is based  on separated lines
  and it is not more common interface for streaming data between FE and BE.

Yeah, that was another concern I had.  This API would let the formatter
control line-level layout but it would not eliminate the hard-wired
significance of newline.  What's worse, there isn't any clean way to
deal with reading quoted newlines --- the formatter can't really replace
the default quoting rules if the low-level code is going to decide
whether a newline is quoted or not.

We could possibly solve that by specifying that the text output or input
(respectively) is the complete line sent to or from the client,
including newline or whatever other line-level formatting you are using.
This still leaves the problem of how the low-level COPY IN code knows
what is a complete line to pass off to the formatter_in routine.  We
could possibly fix this by adding a second input-control routine

function formatter_linelength(text) returns integer

which is defined to return -1 if the input isn't a complete line yet
(i.e., read some more data, append to the buffer, and try again), or
= 0 to indicate that the first N bytes of the buffer represent a
complete line to be passed off to formatter_in.  I don't see a way to
combine formatter_in and formatter_linelength into a single function
without relying on out parameters, which would again confine the
feature to format functions written in C.

It's a tad annoying that we need two functions for input.  One way that
we could still keep the COPY option syntax to be just
FORMAT csv
is to create an arbitrary difference in the signatures of the input
functions.  Then we could have coexisting functions
csv(text[]) returns text
csv(text) returns text[]
csv(text, ...) returns int
that are referenced by FORMAT csv.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] COPY formatting

2004-03-18 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Lee Kindness wrote:
 To be honest this idea strikes me as overkill - over
 engineering.
 
 I agree. My modest proposal for handling CSVs would be to extend the 
 DELIMITER parameter to allow up to 3 characters - separator, quote and 
 escape. Escape would default to the quote char and the quote char would 
 default to unspecified.

I could go with that too.  The question here is do we have any popular
use-cases that aren't solved by that extension, but could be solved by
simple user-level data formatting functions?  I'm not real eager to add
such a feature as an if we build it they will come speculation, but
if anyone can point to solid use-cases besides handling CSV, then it
probably is worth doing.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] COPY formatting

2004-03-18 Thread Bruce Momjian
Tom Lane wrote:
 Andrew Dunstan [EMAIL PROTECTED] writes:
  Lee Kindness wrote:
  To be honest this idea strikes me as overkill - over
  engineering.
  
  I agree. My modest proposal for handling CSVs would be to extend the 
  DELIMITER parameter to allow up to 3 characters - separator, quote and 
  escape. Escape would default to the quote char and the quote char would 
  default to unspecified.
 
 I could go with that too.  The question here is do we have any popular
 use-cases that aren't solved by that extension, but could be solved by
 simple user-level data formatting functions?  I'm not real eager to add
 such a feature as an if we build it they will come speculation, but
 if anyone can point to solid use-cases besides handling CSV, then it
 probably is worth doing.

The thing I liked about Andrew's idea is that it even covers escape
quoting for CVS, which might change from implementation to implentation,
and it is flexible without requiring C coding.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] COPY formatting

2004-03-18 Thread Josh Berkus
Karel, Andrew, Fernando:

 On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
  Karel Zak [EMAIL PROTECTED] writes:
    The formatting function API can be pretty simple:
    text *my_copy_format(text *attrdata, int direction, 
                int nattrs, int attr, oid attrtype, oid relation)

No offense, but isn't this whole thing more appropriate for a client program?   
Like the pg_import and pg_export projects on GBorg?   Has anyone looked at 
those projects?

I can see making a special provision for CSV in COPY, just because it's such a 
universal format.   But I personally don't see that a complex, sophisticated 
import/export formatter belongs on the SQL command line.   Particularly since 
most users will want a GUI to handle it.

And, BTW, I deal with CSV *all the time* for my insurance clients, and I can 
tell you that that format hasn't changed in 20 years.   We can hard-code it 
if it's easier.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] COPY formatting

2004-03-18 Thread Joshua D. Drake
And, BTW, I deal with CSV *all the time* for my insurance clients, and I can 
tell you that that format hasn't changed in 20 years.   We can hard-code it 
if it's easier.
Well many of my clients consider CSV Character Separated Value not 
Comma... Thus I get data like this:

Hello,Good Bye
Hello   Good Bye
Hello,Good Bye
This, They're
ThisThey're
ThisIs  A   1
Dealing with all of these different nuances is may or may not be beyond 
the scope of copy but it seems that it could be something that it can 
handle.

Python has a csv module that allows you to assign dialects to any 
specific type of import you are performing.

Sincerely,

Joshua D. Drake






--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL
begin:vcard
fn:Joshua D. Drake
n:Drake;Joshua D.
org:Command Prompt, Inc.
adr:;;PO Box 215;Cascade Locks;Oregon;97014;USA
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0034
note:Command Prompt, Inc. is the largest and oldest US based commercial PostgreSQL support provider. We  provide the only commercially viable integrated PostgreSQL replication solution, but also custom programming, and support. We authored  the book Practical PostgreSQL, the procedural language plPHP, and adding trigger capability to plPerl.
x-mozilla-html:FALSE
url:http://www.commandprompt.com/
version:2.1
end:vcard


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] COPY formatting

2004-03-18 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 I could go with that too.  The question here is do we have any popular
 use-cases that aren't solved by that extension, but could be solved by
 simple user-level data formatting functions?  I'm not real eager to add
 such a feature as an if we build it they will come speculation, but
 if anyone can point to solid use-cases besides handling CSV, then it
 probably is worth doing.

(I can't believe I'm saying this, but) It seems like xml output would be the
use case you're looking for.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] COPY formatting

2004-03-18 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 I could go with that too.  The question here is do we have any popular
 use-cases that aren't solved by that extension, but could be solved by
 simple user-level data formatting functions?

 (I can't believe I'm saying this, but) It seems like xml output would be the
 use case you're looking for.

Does that fall into the category of stuff that could be supported by the
kind of API we're talking about?  I should think that XML would need a
much more global view of the data, not just line-by-line reformatting.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] COPY formatting

2004-03-17 Thread Hans-Jürgen Schönig
Karel Zak wrote:
 Hi,

 in TODO is item: * Allow dump/load of CSV format. I don't think
 it's clean idea. Why CSV and why not something other? :-) 

 A why not allow to users full control of the format by they own
 function. It means something like:
 
 COPY tablename [ ( column [, ...] ) ]
 TO { 'filename' | STDOUT }
 [ [ WITH ] 
  [ BINARY ]
  [ OIDS ]
  [ DELIMITER [ AS ] 'delimiter' ]
  [ NULL [ AS ] 'null string' ]
  [ FORMAT funcname ] ]
   
 
 The formatting function API can be pretty simple:

 text *my_copy_format(text *attrdata, int direction, 
 int nattrs, int attr, oid attrtype, oid relation)

 -- it's pseudocode of course, it should be use standard fmgr
 interface.
 
 It's probably interesting for non-binary COPY version.
 
 Comments?

Karel



Karel,

This seems to be an excellent idea.
People have already asked for many different formats.
Usually I recommend them to use psql -c COPY ... dbname | awk 
Since Windows will be supported soon, it will be hard to pipe data to a 
useful program (awk, sed, ...). Maybe this feature would help a lot in 
this case.

	Regards,

		Hans



--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] COPY formatting

2004-03-17 Thread Fernando Nasser
Hans-Jürgen Schönig wrote:
Karel Zak wrote:

 Hi,

 in TODO is item: * Allow dump/load of CSV format. I don't think
 it's clean idea. Why CSV and why not something other? :-)
 A why not allow to users full control of the format by they own
 function. It means something like:
 
 COPY tablename [ ( column [, ...] ) ]
 TO { 'filename' | STDOUT }
 [ [ WITH ]   [ BINARY ]
  [ OIDS ]
  [ DELIMITER [ AS ] 'delimiter' ]
  [ NULL [ AS ] 'null string' ]
  [ FORMAT funcname ] ]
   
  The formatting 
function API can be pretty simple:

 text *my_copy_format(text *attrdata, int direction,  int 
nattrs, int attr, oid attrtype, oid relation)

 -- it's pseudocode of course, it should be use standard fmgr
 interface.
 
 It's probably interesting for non-binary COPY version.
 
 Comments?

Karel



Karel,

This seems to be an excellent idea.
People have already asked for many different formats.
Usually I recommend them to use psql -c COPY ... dbname | awk 
Since Windows will be supported soon, it will be hard to pipe data to a 
useful program (awk, sed, ...). Maybe this feature would help a lot in 
this case.

Could a CSV-generating function be provided with the distribution then?

Regards,
Fernando
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] COPY formatting

2004-03-17 Thread Tom Lane
Karel Zak [EMAIL PROTECTED] writes:
  The formatting function API can be pretty simple:
  text *my_copy_format(text *attrdata, int direction, 
  int nattrs, int attr, oid attrtype, oid relation)

This seems like it could only reasonably be implemented as a C function.
I can't really imagine the average user of COPY wanting to write C in
preference to, say, an external perl script.  What's the real use-case
for the feature?

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] COPY formatting

2004-03-17 Thread Andrew Dunstan
Karel Zak wrote:

Hi,

in TODO is item: * Allow dump/load of CSV format. I don't think
it's clean idea. Why CSV and why not something other? :-) 

A why not allow to users full control of the format by they own
function. It means something like:
COPY tablename [ ( column [, ...] ) ]
TO { 'filename' | STDOUT }
[ [ WITH ] 
 [ BINARY ]
 [ OIDS ]
 [ DELIMITER [ AS ] 'delimiter' ]
 [ NULL [ AS ] 'null string' ]
 [ FORMAT funcname ] ]
  

The formatting function API can be pretty simple:

text *my_copy_format(text *attrdata, int direction, 
int nattrs, int attr, oid attrtype, oid relation)

-- it's pseudocode of course, it should be use standard fmgr
interface.
It's probably interesting for non-binary COPY version.

 

Interesting ... The alternative might be an external program to munge 
CSVs and whatever other format people want to support and then call the 
exisiting COPY- either in bin or contrib. I have seen lots of people 
wanting to import CSVs, and that's even before we get a Windows port.

cheers

andrew

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] COPY formatting

2004-03-17 Thread Fernando Nasser
Tom Lane wrote:
Karel Zak [EMAIL PROTECTED] writes:

The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction, 
int nattrs, int attr, oid attrtype, oid relation)


This seems like it could only reasonably be implemented as a C function.
I can't really imagine the average user of COPY wanting to write C in
preference to, say, an external perl script.  What's the real use-case
for the feature?
That is why I suggested providing a pre-written/pre-compiled/installed 
function for CSV (call it CSV?).  Advanced users could still write their 
own as people can write many other things if they know their ways.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] COPY formatting

2004-03-17 Thread Joshua D. Drake



That is why I suggested providing a pre-written/pre-compiled/installed 
function for CSV (call it CSV?).  Advanced users could still write 
their own as people can write many other things if they know their ways.

As someone who just went through a whole truckload of crap getting 
delimited files parsed from MSSQL to PostgreSQL. I believe yes this 
would be  great thing. We ended up using plPython with the CSV module.

Sincerely,

Joshua Drake



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] COPY formatting

2004-03-17 Thread mike g
I deal with this daily in a cygwin environment.  I wrote a simple c++
program where I hardcoded the input file name/location and output file
name/location.  I strip the quotation marks out where they are used for
identifying text fields and change the comma's used as CSV's to pipes.  

I use a combination of bash scripting to execute the c++ program and
then Perl to execute a stored procedure.  I am new to Perl so I have not
yet considered migrating it all into Perl.

The dos2unix tools in cygwin always messed up the first character of the
first line. 

I thought the real issue with the copy function and CSVs was that it did
not like the use of quotations around the fields to identify text
fields.

For a true Windows port handling MS Excel files in their native format
would be a goal I would hope. If your api could handle that then I would
agree with your method.  



On Wed, 2004-03-17 at 22:14, Joshua D. Drake wrote:
 
 
 
 
  That is why I suggested providing a pre-written/pre-compiled/installed 
  function for CSV (call it CSV?).  Advanced users could still write 
  their own as people can write many other things if they know their ways.
 
 
 As someone who just went through a whole truckload of crap getting 
 delimited files parsed from MSSQL to PostgreSQL. I believe yes this 
 would be  great thing. We ended up using plPython with the CSV module.
 
 Sincerely,
 
 Joshua Drake
 
 
 
  ---(end of broadcast)---
  TIP 7: don't forget to increase your free space map settings
 
 


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] COPY formatting

2004-03-17 Thread Karel Zak
On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
 Karel Zak [EMAIL PROTECTED] writes:
   The formatting function API can be pretty simple:
   text *my_copy_format(text *attrdata, int direction, 
   int nattrs, int attr, oid attrtype, oid relation)
 
 This seems like it could only reasonably be implemented as a C function.

 Why? I said it's pseudo code. It should use standard fmgr API like
 every other PostgreSQL function or is it problem and I overlook
 something? It must to support arbitrary programming language and not
 C only.

 I can't really imagine the average user of COPY wanting to write C in
 preference to, say, an external perl script.  What's the real use-case
 for the feature?

 Don't hardcode any format to PostgreSQL, be open for others formats.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 http://home.zf.jcu.cz/~zakkr/

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly