Re: [HACKERS] pg_get_INDEXdef - opclass

2006-06-13 Thread Dave Page
 

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: 12 June 2006 18:32
 To: Dave Page
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] pg_get_INDEXdef - opclass 
 
 Dave Page dpage@vale-housing.co.uk writes:
  Following a pgAdmin bug report, I noticed that 
 pg_get_viewdef doesn't
  return the opclass when called for a specific column (in 8.1 
  at least) -
 
  Bah, I mean pg_get_indexdef of course :-) 
 
 This is intentional --- whoever asked for the per-column variant of
 the indexdef function wanted it that way.  It seems reasonable to me:
 you can extract the opclass name with a simple join against
 pg_index.indclass[N], when you need it, whereas if the function
 sometimes included an opclass name that would tend to break apps that
 weren't expecting it.  OTOH, getting the expression for an expression
 column would be seriously painful if there were no function to do it.

Right, but how can I conditionally join with pg_opclass based on whether
or not the opclass specified for the column is the default for that
type? For a base type index column I can probably do that with some SQL,
but what about cases where the index column is an expression?

Regards, Dave.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-13 Thread Simon Riggs
On Mon, 2006-06-12 at 19:15 -0400, Tom Lane wrote:
 [ moving to -hackers to get some more eyeballs on the question ]
 
 Simon Riggs [EMAIL PROTECTED] writes:
  On Sun, 2006-06-11 at 17:53 -0400, Alvaro Herrera wrote:
  Here I repost the patch to implement non-transactional catalogs, the
  first of which is pg_ntclass, intended to hold the non-transactional
  info about pg_class (reltuples, relpages).
 
  Will a user be able to update reltuples and relpages manually?
 
 No, which is a tad annoying now that you mention it.  I'm not sure that
 there's any very good reason for users to want to do that, though.  Once
 or twice I've hacked those fields manually to set up test cases for the
 planner, which is why I'd be annoyed to lose the ability --- but does it
 really matter to users?  (Especially in view of the fact that the
 planner no longer trusts relpages anyway.)

No need to have an SQL route. A special function call would suffice.

I'd like to be able to set up a test database that has the statistics
copied from the live system. A schema only pg_dump with mods is all I
need, but it sounds like we're moving away from that. We can then
perform various what-ifs on the design.

Elsewhere, it has been discussed that we might hold the number of blocks
in a relation in shared memory. Does that idea now fall down, or is it
complementary to this? i.e. would we replace ANALYZE's relpages with an
accurate relpages for the planner?

 It does seem like rather a lot of mechanism and overhead though,
 especially in view of Alvaro's worries about the non-cacheability of
 pg_class_nt rows.  I wonder whether we shouldn't take two steps back
 and rethink.

Review, yes. Could still be the best way.

 The main thing we are trying to accomplish here is to decouple
 transactional and nontransactional updates to a pg_class row.

With the goal being avoiding table bloat??

 Is there another way to do that?  Do we need complete decoupling?

 It strikes me that the only case where we absolutely must not lose a
 nontransactional update is where we are un-freezing a frozen rel.

Not sure why you'd want to do that, assuming I've understood you.

For me, freezing is last step before writing to WORM media, so there is
never an unfreeze step.

 If we could guarantee that un-freezing happens before any transactional
 update within a particular transaction, then maybe we could have that.
 Manual updates to pg_class seem like they'd risk breaking such a
 guarantee, but maybe there's a way around that.  Personally I'd be
 willing to live with commands that try to modify a frozen rel erroring
 out if they see the current pg_class row is uncommitted.

Sounds OK. It's a major state change after all.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] longjmp in psql considered harmful

2006-06-13 Thread Martijn van Oosterhout
On Mon, Jun 12, 2006 at 08:14:01PM -0400, Tom Lane wrote:
 I had interpreted the readline documentation to mean that readline would
 discard a partially typed line upon catching SIGINT.  Experimentation
 shows that this is not so, at least not with the version of readline I
 use here.  It does catch the signal and reset some internal state, but
 the partially typed line is NOT discarded.  Grumble.

Yeah, the documentation in readline there was pretty obtuse, but since
it didn't explicitly include typed characters as state, I figured it
didn't clear the line.

 I'll work on reviewing and applying the patch.  I don't much like the
 side-effects on the /scripts directory though ... there must be a better
 way than that.  Is it sane to declare the flag variable in print.c?

The problem is basically that some of those files are symlinked across
the tree and included from various different places. Some places
include print.c but don't include the signal handler stuff, which left
we with linker errors about undefined symbols.

In psql the symbol comes from common.c and so I also added it to
scripts/common.c, which cleared up all the errors. This would allow
psql to work and all other programs that included print.c would only
ever see zero in that variable.

Maybe some other arrangement is possible. Maybe like you suggest,
declare the symbol in print.c and make the declaration in common.c an
extern. The end result is the same though.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Kris Kennaway
On Mon, Jun 12, 2006 at 10:08:22AM -0500, Jim C. Nasby wrote:
 On Mon, Jun 12, 2006 at 12:24:36AM -0400, Kris Kennaway wrote:
  On Sun, Jun 11, 2006 at 10:07:13PM -0500, Jim C. Nasby wrote:
   On Sun, Jun 11, 2006 at 09:58:33PM -0400, Tom Lane wrote:
Kris Kennaway [EMAIL PROTECTED] writes:
 On Sun, Jun 11, 2006 at 07:43:03PM -0400, Tom Lane wrote:
 Let's see the evidence.

 The calls to setproctitle() (it looks like 4 setproctitle syscalls per
 DB query) are causing contention on the Giant lock 25% of the time on
 a dual p4 + HTT.  Disabling process title setting completely gives an
 8% peak performance boost to the super-smack select benchmark.

I think you misunderstood me: I asked for evidence, not interpretation.
What are you measuring, and with what tool, and what are the numbers?
On what benchmark case?  And what did you do to disable process title
setting completely?

The reason I'm being doubting Thomas here is that I've never seen any
   
   Ba-da-bum!
   
indication on any other platform that ps_status is a major bottleneck.
Now maybe FreeBSD really sucks, or maybe you're onto something of
interest, but let's see the proof in a form that someone else can
check and reproduce.
   
   It's also important to find out what version of FreeBSD this is. A lot
   of things have been pulled out of GIANT in 5.x and 6.x, so it's entirely
   possible this isn't an issue in newer versions.
  
 Can you provide the actual commands you used to setup and run the test?

I actually forget all the steps I needed to do to get super-smack
working with postgresql since it required a lot of trial and error for
a database newbie like me (compiling it from the
benchmarks/super-smack port was trivial, but unlike mysql it required
configuring the database by hand - this should hopefully be more
obvious to someone familiar with pgsql though).

It would be great if someone on your end could make this easier, BTW -
e.g. at least document the steps.  Also super-smack should be changed
to allow use via a local socket with pgsql (this is the default with
mysql) - this avoids measuring network stack overhead.

The only thing I had to change on FreeBSD was to edit the
select-key.smack and change localhost to 127.0.0.1 in two
locations to avoid possibly using IPv6 transport.

 This would allow others to duplicate your results and debug the
 situation on their own. This is also important because we've generally
 found HTT to be a loss (except on Windows), so it'd be good to see what
 impact this has on AMD hardware. It would also be very useful to have
 the raw test result numbers you obtained.

They were posted previously.  This is Intel hardware (AMD doesn't do
HTT), and I didn't retest without HTT.  I'll try to do so if I have
the time (however previously when profiling mysql, HTT did give a small
positive change).

  It's still true in 6.x and 7.x.  I have a patch that removes Giant
  from the sysctl in question, and I have also removed it from another
  relevant part of the kernel (semop() is bogusly acquiring Giant when
  it is supposed to be mpsafe).
  
 What affect did that patch have on the numbers? And where is it, in case
 anyone here wants to try it?

I didn't yet retest with the patch.  It's in my perforce branch:

  
http://perforce.freebsd.org/changeList.cgi?FSPC=//depot/user/kris/contention/...

although you probably need a combination of the changes in order for
it to be usable.

  When it's possible to commit that patch (should be in time for 7.0,
  but not sure if it will make it into 6.2) it will eliminate the worst
  part of the problem, but it still leaves postgresql making thousands
  of syscalls per second to flip its process titles back and forth,
  which needs to be looked at carefully for a performance impact.  For
  now, users of FreeBSD who want that 8% should turn it off though (or
  maybe one of the alternative methods is usable).
 
 We have a similar issue internally with stats_command_string. The issue
 is that it's very helpful to be able to see what a 'long running' (more
 than a second or so) statement is doing, but there's very little value
 in doing all that work for a statement that's only going to run for a
 few ms. If there's a very cheap way to set some kind of a timer that
 would update this information once a statement's been around long enough
 that might be a way to handle this (I don't know if we're already using
 ALRM in the backend, or if that's a cheap enough solution).

I don't know what the best way to implement it would be, but limiting
the frequency of these updates does seem to be the way to go.

  FYI, the biggest source of contention is via semop() - it might be
  possible to optimize that some more in FreeBSD, I don't know.
 
 Yeah, I've seen PostgreSQL on FreeBSD fall over at high load with a lot
 of procs in either semwait or semlock. :(

Part of that is Giant contention again; for example on 6.x 

Re: [HACKERS] Running a query twice to ensure cached results.

2006-06-13 Thread Luke Lonergan
Ron,

On 6/8/06 11:49 AM, Ron Mayer [EMAIL PROTECTED] wrote:

 Experimental results here suggest that for larger tables Linux seems
 to detect a seq-scan and not bother caching.   It's very reproducible
 for me here to do a reboot and not see the full speedup on a seq_scan
 until the third time I run a query.su

What you are seeing is the now infamous Postgres writes a table one more
time after loading behavior.

Simon Riggs once dug into it to find the root cause, and I no longer recall
exactly why, but after you've loaded data, the first seq scan will re-write
some large portion of the data while doing the initial scan. This wreaks
havoc on normal benchmarking practices.

Tom - can you explain what's going on with this?  It seems to write more
than just the contents of the WAL, so it's not a flush of the WAL writes
AFAICT.

- Luke 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Matthew T. O'Connor

Bill Bartlett wrote:

Can't -- the main production database is over at a CoLo site with access
only available via SSH, and tightly-restricted SSH at that. Generally
one of the developers will SSH over to the server, pull out whatever
data is needed into a text file via psql or pg_dump, scp the file(s)
back here and send them to the user.


ODBC over an SSH tunnnel?

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] timezones to own config file

2006-06-13 Thread Joachim Wieland
I looked into the timezone specifications and basically extracted a list of
existing offsets from the zic database.

My proposed format for the timezone files is something like this:

HADT   -32400 D  # Hawaii-Aleutain Daylight Time
 # (America/Adak)
HAST   -36000# Hawaii-Aleutain Standard Time
 # (America/Adak)

That is, the abbreviation, the offset in seconds, optionally a D to mark
daylight saving times (goes into tm-is_dst), the name of the timezone and
the full zic names that use this timezone.

I also made the extracting script find all conflicts and commented them
manually as shown here. Most of the conflicts are between America and Asia.

# CONFLICT! ADT is not unique
# Other timezones:
#  - ADT: Arabic Daylight Time (Asia)
ADT-10800 D  # Atlantic Daylight Time
 # (America/Glace_Bay)
 # (America/Goose_Bay)
 # (America/Halifax)
 # (America/Thule)
 # (Atlantic/Bermuda)

However, even within all America/... names, there are conflicts. For
example CST is used as US Central Time and as Cuba Central Standard Time.
While US Central time is UTC-6h, Cuba Central Standard Time is UTC-5h.

Another problem is that lots of the timezone names that are hardcoded into
the backend seem to be way outdated or just doubtable, many of them do not
show up in the zic database.

For example NT (Nome Time) seemed to have existed until 1967, America/Nome
is listed in the zic database at AKDT/AKST which is Alaska Daylight/Standard
Time. Other examples:

JAYT, Jayapura Time: Asia/Jayapura is listed as EIT (East Indonesia Time) in
the zic database.

JAVT, Java Time (07:00? see JT): zic database says that it is outdated and
was used until 1932.
JT, Java Time (07:30? see JAVT): I did not find a proof that this is really
+7.5 hours, some sources say it's just 7 hours.

HMT is the strangest of the bunch, I have found the name Heard and
Mc.Donald Time but with a different offset. I could not find a reference to
some Hellas-Time as indicated in the comment.

So could we remove some of those on the grounds that they do not seem to
be used any more (please correct me here if someone knows more) and that
you can easily add offsets for those if you need them?

With the same argument we could even remove timezones like BDST (British
Double Summer Time), DNT (Dansk Normal Tid), FST (French Summer Time), NOR
(Norway Standard Time), SWT (Swedish Winter Time). Could anybody from those
countries comment on whether or not those are still used or just outdated? I
figure that most of those countries have moved since long to the more common
timezone names...

Ok, after all this has been sorted out I propose to make different files for
the different continents and let the user specify with a guc which ones he
wants to use.

I could think of three possible ways:

1) (See Toms idea in
http://archives.postgresql.org/pgsql-hackers/2006-05/msg01048.php )

Conflicts within one set can just be commented - we would try to include
whatever will probably be used by the majority of users and comment the
other one(s). Conflicts between two sets would show up when postmaster gets
started, it would complain about different definitions for the same
timezone. An American who wants to use some Asian timezones would have to
work through both files and comment conflicting timezones on one side or the
other to make postmaster start up without errors.

2) Find out which timezones do not conflict, put them in a set and load this
by default. Create other sets that are conflicting but that have some
override capability with regard to previous timezone definitions. Decide
on the default value for the guc (could point to American timezones for
example). An Australian could either select only the Australian file or
could specify America, Australia and the Australian set overrides the
American timezones in case of conflicts. This way, most people do not have
to make changes and those who have to can specify their override-file and
keep all the rest, including non-conflicting timezones from a conflicting
timezone set.

3) Combine both, let the user specify the guc variable as A, B, C and look
into C first, then in B and then in A *thinking*  Right now I actually
think that the overriding idea is not that intuitive, most people would
probably expect that this is a list of priorities, so A overrides B which
overrides C.

What do you think?

Having a larger token table in datetime.c does not seem to affect
performance all that much. I did parsing tests with 2 million timestamps
equally distributed over all timezone abbreviations that I had loaded
previously and the difference of 154 timezones in comparsion to other runs
with just 35 was at about ~120ms (on my quite slow laptop computer).

The timezone definition files should be read at server start but should they
also be read at SIGHUP? If so, should they be read only by the 

Re: [HACKERS] Running a query twice to ensure cached results.

2006-06-13 Thread Martijn van Oosterhout
On Tue, Jun 13, 2006 at 04:54:05AM -0700, Luke Lonergan wrote:
  Experimental results here suggest that for larger tables Linux seems
  to detect a seq-scan and not bother caching.   It's very reproducible
  for me here to do a reboot and not see the full speedup on a seq_scan
  until the third time I run a query.su
 
 What you are seeing is the now infamous Postgres writes a table one more
 time after loading behavior.
 
 Simon Riggs once dug into it to find the root cause, and I no longer recall
 exactly why, but after you've loaded data, the first seq scan will re-write
 some large portion of the data while doing the initial scan. This wreaks
 havoc on normal benchmarking practices.

Is it possible it may have something to do with the hint bits? There's
are a bunch of bits in the header to deal with speeding up of MVCC
tests. Maybe changing those bits marks the page dirty and forces a
write?

Have a ncie day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Running a query twice to ensure cached results.

2006-06-13 Thread Luke Lonergan
Martin,

On 6/13/06 5:42 AM, Martijn van Oosterhout kleptog@svana.org wrote:

 Is it possible it may have something to do with the hint bits? There's
 are a bunch of bits in the header to deal with speeding up of MVCC
 tests. Maybe changing those bits marks the page dirty and forces a
 write?

Yes, that's it.  Writing data twice (three times including WAL) is a Very
Bad Thing for large data work - how should we fix it?

- Luke



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] timezones to own config file

2006-06-13 Thread Martijn van Oosterhout
On Tue, Jun 13, 2006 at 02:20:09PM +0200, Joachim Wieland wrote:
 I looked into the timezone specifications and basically extracted a list of
 existing offsets from the zic database.
 
 My proposed format for the timezone files is something like this:

sip

Any particular reason this can't be a normal table in pg_catalog which
you can select/update.

 Another problem is that lots of the timezone names that are hardcoded into
 the backend seem to be way outdated or just doubtable, many of them do not
 show up in the zic database.

snip lots of dodgy timezones

I've been trying to convince people for a while now that the
appropriate tz string for australia is AEST/ACST/AWST but no-one seems
convinced yet. Hence, I never actually specify timezones and all my
timestamps are inserted as GMT.

IMHO, you should simply setup the table so that it is backward
compatable and let people edit it themselves. You're never going to be
able to convince anyone that people arn't relying on it exactly the way
it is now. The most important thing is to get rid of the
australian_timezones hack, everything else is bonus.

 The timezone definition files should be read at server start but should they
 also be read at SIGHUP? If so, should they be read only by the postmaster or
 by all backends?

Good question...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Running a query twice to ensure cached results.

2006-06-13 Thread Martijn van Oosterhout
On Tue, Jun 13, 2006 at 05:46:23AM -0700, Luke Lonergan wrote:
 Martin,
 
 On 6/13/06 5:42 AM, Martijn van Oosterhout kleptog@svana.org wrote:
 
  Is it possible it may have something to do with the hint bits? There's
  are a bunch of bits in the header to deal with speeding up of MVCC
  tests. Maybe changing those bits marks the page dirty and forces a
  write?
 
 Yes, that's it.  Writing data twice (three times including WAL) is a Very
 Bad Thing for large data work - how should we fix it?

Well, I don't think you want to get rid of it entirely because
otherwise forevermore, every lookup in that table will require a check
to see if the transaction is committed. So at some point the hint bit
needs to be set and/or the xmin frozen (vacuum freeze does that for
example).

What you might be able to do is to reduce its effect. The thing that
occurs to me is to make hint bit changes only mark a page half-dirty.
If the page is evicted because the space is needed in the buffer cache,
it can be just dropped. However, the bgwriter will write it as normal.
Hence, setting hint bits will become a sort of background operation,
done when there's time.

It seems to me that if a large table is loaded in a single transaction,
the check for if the transaction is committed should be cheap because
it's checking the same transaction id over and over.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Andrew Dunstan
Matthew T. OConnor said:
 Bill Bartlett wrote:
 Can't -- the main production database is over at a CoLo site with
 access only available via SSH, and tightly-restricted SSH at that.
 Generally one of the developers will SSH over to the server, pull out
 whatever data is needed into a text file via psql or pg_dump, scp the
 file(s) back here and send them to the user.

 ODBC over an SSH tunnnel?


I wish I could understand why people are so keen to make other people turn
handsprings in order to avoid a feature which, as Bruce points out, is
already on the TODO list, and which, by my 10 minute analysis, would involve
almost trivial code impact and risk. If this involved major impact I might
understand, but it really doesn't.

I know many people work in a Postgres only world. I wish everybody did, and
then we could just forget about things like CSV. They don't, so we can't.

I think I have said this before, but I'll say it again. From time to time
people thank me for things I have done for Postgres. The two things that
stand out BY FAR on the list of these are CSV import/export and dollar
quoting. This is a widely used feature.

cheers

andrew




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Bruce Momjian
Andrew Dunstan wrote:
 Matthew T. OConnor said:
  Bill Bartlett wrote:
  Can't -- the main production database is over at a CoLo site with
  access only available via SSH, and tightly-restricted SSH at that.
  Generally one of the developers will SSH over to the server, pull out
  whatever data is needed into a text file via psql or pg_dump, scp the
  file(s) back here and send them to the user.
 
  ODBC over an SSH tunnnel?
 
 
 I wish I could understand why people are so keen to make other people turn
 handsprings in order to avoid a feature which, as Bruce points out, is
 already on the TODO list, and which, by my 10 minute analysis, would involve
 almost trivial code impact and risk. If this involved major impact I might
 understand, but it really doesn't.
 
 I know many people work in a Postgres only world. I wish everybody did, and
 then we could just forget about things like CSV. They don't, so we can't.
 
 I think I have said this before, but I'll say it again. From time to time
 people thank me for things I have done for Postgres. The two things that
 stand out BY FAR on the list of these are CSV import/export and dollar
 quoting. This is a widely used feature.

I think the bottom line is that ease of use isn't as high enough on
the project's priority list as you (and others) think it should be.

I personally feel as you do that we should value ease of use more.

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Running a query twice to ensure cached results.

2006-06-13 Thread Luke Lonergan
Martin,

On 6/13/06 6:00 AM, Martijn van Oosterhout kleptog@svana.org wrote:

 What you might be able to do is to reduce its effect. The thing that
 occurs to me is to make hint bit changes only mark a page half-dirty.
 If the page is evicted because the space is needed in the buffer cache,
 it can be just dropped. However, the bgwriter will write it as normal.
 Hence, setting hint bits will become a sort of background operation,
 done when there's time.
 
 It seems to me that if a large table is loaded in a single transaction,
 the check for if the transaction is committed should be cheap because
 it's checking the same transaction id over and over.

Thanks for the tip - it's now in the mill for mulling over and finding
someone with the time to work something in...

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] pg_get_INDEXdef - opclass

2006-06-13 Thread Tom Lane
Dave Page dpage@vale-housing.co.uk writes:
 Right, but how can I conditionally join with pg_opclass based on whether
 or not the opclass specified for the column is the default for that
 type?

Check pg_opclass.opcdefault and compare pg_opclass.opcintype to the
index column's datatype (which you'd get from its pg_attribute row
... whether it's an expression is irrelevant).

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-13 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Elsewhere, it has been discussed that we might hold the number of blocks
 in a relation in shared memory. Does that idea now fall down, or is it
 complementary to this?

It's been the case for some time that the planner uses
RelationGetNumberOfBlocks() to determine true rel size.  The only reason
relpages is still stored at all is that it's used to approximate true
number of tuples via
true_ntuples = (reltuples/relpages) * true_npages
ie, assuming that the tuple density is still what it was at the last
VACUUM or ANALYZE.  So you can't fool the system with a totally made-up
relation size anyway.  (This too is moderately annoying for planner
testing, but it seems the only way to get the planner to react when a
table's been filled without an immediate vacuum/analyze.)

The only point of tracking rel size in shared memory would be to avoid
the costs of lseek() kernel calls in RelationGetNumberOfBlocks.

 The main thing we are trying to accomplish here is to decouple
 transactional and nontransactional updates to a pg_class row.

 With the goal being avoiding table bloat??

No, with the goal being correctness.  If you have a freeze/unfreeze
mechanism then unfreezing a relation is an action that must NOT be
rolled back if your transaction (or any other one for that matter) later
aborts.  The tuples you put into it meanwhile need to be vacuumed anyway.
So you can't mark it unfrozen in an uncommitted pg_class entry that
might never become committed.

 For me, freezing is last step before writing to WORM media, so there is
 never an unfreeze step.

That is not what Alvaro is after.  Nor anyone else here.  I have not
heard anyone mention WORM media for Postgres in *years*.

It strikes me though that automatic UNFREEZE isn't necessarily the
requirement.  What if VACUUM FREEZE causes the table to become
effectively read-only, and you need an explicit UNFREEZE command to
put it back into a read-write state?  Then UNFREEZE could be a
transactional operation, and most of these issues go away.  The case
where this doesn't work conveniently is copying a frozen database
(viz template0), but maybe biting the bullet and finding a way to do
prep work in a freshly made database is the answer for that.  We've
certainly seen plenty of other possible uses for post-CREATE processing
in a new database.

Another reason for not doing unfreeze automatically is that as the patch
stands, any database user can force unfreezing of any table, whether he
has any access rights on it or not (because the LockTable will happen
before we check access rights, I believe).  This is probably Not Good.
Ideally I think FREEZE/UNFREEZE would be owner-permission-required.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 I wish I could understand why people are so keen to make other people turn
 handsprings in order to avoid a feature which, as Bruce points out, is
 already on the TODO list, and which, by my 10 minute analysis, would involve
 almost trivial code impact and risk. If this involved major impact I might
 understand, but it really doesn't.

Supporting all of the CSV options in pg_dump would involve major bloat
in its option set, and it already has far too many options.  If it were
just a matter of adding a --csv switch I wouldn't be complaining, but
there are half a dozen more sub-options, and it seems like every time we
turn around someone is finding a reason for another one.  Propagating
all that cruft through pg_dump would be a PITA, and updating it to track
future additions would be too.

Furthermore, the entire rationale for the feature is predicated on the
claim that programs other than pg_restore might find it useful.  But
this conveniently ignores the fact that if there are any such programs
in existence, what this will really do is BREAK them, because they won't
be able to cope with all the variants that pass for CSV.

My opinions would be less negative if I thought that CSV were a
well-defined format that would never change.  I don't believe that it
has either property, however, and so I'm against letting it get into our
dump file format.  I think we'll just live to regret it if we do.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Bruce Momjian

Good point.  The number of CSV options would be hard to support for
pg_dump.  Any thoughts from anyone on how to do that cleanly?  Could we
just support the default behavior?

---

Tom Lane wrote:
 Andrew Dunstan [EMAIL PROTECTED] writes:
  I wish I could understand why people are so keen to make other people turn
  handsprings in order to avoid a feature which, as Bruce points out, is
  already on the TODO list, and which, by my 10 minute analysis, would involve
  almost trivial code impact and risk. If this involved major impact I might
  understand, but it really doesn't.
 
 Supporting all of the CSV options in pg_dump would involve major bloat
 in its option set, and it already has far too many options.  If it were
 just a matter of adding a --csv switch I wouldn't be complaining, but
 there are half a dozen more sub-options, and it seems like every time we
 turn around someone is finding a reason for another one.  Propagating
 all that cruft through pg_dump would be a PITA, and updating it to track
 future additions would be too.
 
 Furthermore, the entire rationale for the feature is predicated on the
 claim that programs other than pg_restore might find it useful.  But
 this conveniently ignores the fact that if there are any such programs
 in existence, what this will really do is BREAK them, because they won't
 be able to cope with all the variants that pass for CSV.
 
 My opinions would be less negative if I thought that CSV were a
 well-defined format that would never change.  I don't believe that it
 has either property, however, and so I'm against letting it get into our
 dump file format.  I think we'll just live to regret it if we do.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
 

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Martijn van Oosterhout
On Tue, Jun 13, 2006 at 10:20:53AM -0400, Bruce Momjian wrote:
 
 Good point.  The number of CSV options would be hard to support for
 pg_dump.  Any thoughts from anyone on how to do that cleanly?  Could we
 just support the default behavior?

What this tells me is that we need a tool somewhere between psql and
pg_dump, say, pgquery. It's sole purpose in life is to generate output
from various queries. Because it's a seperate tool there's no question
of psql or pg_dump being able to parse them.

While you're at it, you could add modules to support many different
output styles, like CSV, XML, Excel format, HTML, etc.

This I beleive would take the load off psql to provide many different
output styles, as well as the load off pg_dump to produce
parsable-by-third-party output.

Thoughts?

Side note: In my experience Excel happily slurps up tab delimited
output, so I'm not sure why all of this is an issue in the first place.

Have a ncie day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] pg_get_INDEXdef - opclass

2006-06-13 Thread Dave Page
 

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: 13 June 2006 14:42
 To: Dave Page
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] pg_get_INDEXdef - opclass 
 
 Dave Page dpage@vale-housing.co.uk writes:
  Right, but how can I conditionally join with pg_opclass 
 based on whether
  or not the opclass specified for the column is the default for that
  type?
 
 Check pg_opclass.opcdefault and compare pg_opclass.opcintype to the
 index column's datatype (which you'd get from its pg_attribute row
 ... whether it's an expression is irrelevant).

Ahh right - thanks. I got it into my head that there was no pg_attribute
row for an expression. Must be the sun...

Regards, Dave.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Volkan YAZICI
On Jun 13 10:20, Bruce Momjian wrote:
 
 Good point.  The number of CSV options would be hard to support for
 pg_dump.  Any thoughts from anyone on how to do that cleanly?  Could we
 just support the default behavior?

IMHO, it might be better if we'd support a syntax like

  pg_dump --csv=opt0,para0:opt2,opt3

This can save us from the pg_dump parameter pollution a little bit.

Furthermore, I think CSV format for the dump files can be maintained
better under an external project. (pgFoundry?) By this way, main
developers will be able to cope with their own core problems while
other users/developers can contribute on the CSV code easily. And if
any user will ever want to get CSV functionality in the pg_dump,
he/she will just issue a --csv parameter (with the above syntax) and
pg_dump will make a suitable dlopen() call for the related (CSV)
module. Anyway, this is just an idea for modularity; but the main
thing I try to underline is to give pg_dump a module functionality for
similar problems.


Regards.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Andrew Dunstan



Tom Lane wrote:


there are half a dozen more sub-options, and it seems like every time we
turn around someone is finding a reason for another one.  
 



This is a bit unfair. The feature was introduced in 8.0, and slightly 
enhanced in 8.1. There have not been any additional CSV features this 
release cycle unless my memory is worse than I thought, and I at least 
have said previously that I will be resistant to the addition of further 
CSV options.


My thoughts regarding options for pg_dump was actually to provide a much 
smaller set than the full set available with COPY, specifically to 
provide for using a single rather than a double quote char, and optional 
header lines - no alternate escape or delimiter, and no FORCE QUOTE  
(FORCE NOT NULL isn't relevant as it is only useful for non-postgres 
derived data). At least that would be a reasonable starting point, and 
would I believe cater for the vast majority of uses.


cheers

andrew

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Running a query twice to ensure cached results.

2006-06-13 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 What you are seeing is the now infamous Postgres writes a table one more
 time after loading behavior.

 Tom - can you explain what's going on with this?

Setting XMIN_COMMITTED hint bits.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Joshua D. Drake

Bruce Momjian wrote:

Good point.  The number of CSV options would be hard to support for
pg_dump.  Any thoughts from anyone on how to do that cleanly?  Could we
just support the default behavior?


Although I don't see a real need for the feature, I do think that if we 
were to support 1 (well two if you include the already tab delimited) 
csv output it would be a large amount of bloat.


Perhaps we could pick 1 output, say comma delimted with quoted fields?

foo,bar   ,baz

Joshua D. Drake



---

Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:

I wish I could understand why people are so keen to make other people turn
handsprings in order to avoid a feature which, as Bruce points out, is
already on the TODO list, and which, by my 10 minute analysis, would involve
almost trivial code impact and risk. If this involved major impact I might
understand, but it really doesn't.

Supporting all of the CSV options in pg_dump would involve major bloat
in its option set, and it already has far too many options.  If it were
just a matter of adding a --csv switch I wouldn't be complaining, but
there are half a dozen more sub-options, and it seems like every time we
turn around someone is finding a reason for another one.  Propagating
all that cruft through pg_dump would be a PITA, and updating it to track
future additions would be too.

Furthermore, the entire rationale for the feature is predicated on the
claim that programs other than pg_restore might find it useful.  But
this conveniently ignores the fact that if there are any such programs
in existence, what this will really do is BREAK them, because they won't
be able to cope with all the variants that pass for CSV.

My opinions would be less negative if I thought that CSV were a
well-defined format that would never change.  I don't believe that it
has either property, however, and so I'm against letting it get into our
dump file format.  I think we'll just live to regret it if we do.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match






--

=== The PostgreSQL Company: Command Prompt, Inc. ===
  Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
  Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] timezones to own config file

2006-06-13 Thread Tom Lane
Joachim Wieland [EMAIL PROTECTED] writes:
 The timezone definition files should be read at server start but should they
 also be read at SIGHUP? If so, should they be read only by the postmaster or
 by all backends?

Presumably the name of the definition file to use will be a GUC
variable.  I would expect the code to re-read the file any time the
variable's value is changed.  In the case of a change via postgresql.conf
this would automatically happen in all backends as well as the
postmaster.  You'll need to make it follow the semantics already in use
for errors in postgresql.conf, viz:
  * error detected during postmaster startup - report error and quit
  * error detected during postmaster reload - log message, ignore new setting
  * error detected during backend reload - debug message, ignore new setting

As far as the appropriate contents of the files go, I'd suggest *not*
trying to account for every abbreviation mentioned in the zic database;
lots of them are surely uninteresting, and anyone who does want Nome Time
will now be able to add it for himself.  The more abbreviations you try
to understand, the less chance you have of detecting plain old errors.

IIRC, the conflicts we've actually heard about in practice are IST
(Israel vs India) and Aussie vs. USA zone names.  So it might work to
have two base definition files, one for Europe/Americas (with USA and
Israel names) and one for Far East (with Aussie and Indian names).

I am not sure where Tom Lockhart got the list of timezone names that's
currently hardwired in datetime.c, but for sure you needn't treat it as
being graven on stone tablets.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Andrew Dunstan



Volkan YAZICI wrote:


On Jun 13 10:20, Bruce Momjian wrote:
 


Good point.  The number of CSV options would be hard to support for
pg_dump.  Any thoughts from anyone on how to do that cleanly?  Could we
just support the default behavior?
   



IMHO, it might be better if we'd support a syntax like

 pg_dump --csv=opt0,para0:opt2,opt3

This can save us from the pg_dump parameter pollution a little bit.

Furthermore, I think CSV format for the dump files can be maintained
better under an external project. (pgFoundry?) By this way, main
developers will be able to cope with their own core problems while
other users/developers can contribute on the CSV code easily. And if
any user will ever want to get CSV functionality in the pg_dump,
he/she will just issue a --csv parameter (with the above syntax) and
pg_dump will make a suitable dlopen() call for the related (CSV)
module. Anyway, this is just an idea for modularity; but the main
thing I try to underline is to give pg_dump a module functionality for
similar problems.

 



There are some problems with this, though:

. FORCE QUOTE is table specific, and COPY will barf if you name a column 
that isn't on the table. Providing for this option would involve lots 
more code in pg_dump, as we'd have to filter the list according to the 
column names in each table.


. specifying arbitrary chars for quote, escape and delimiter could be 
tricky from the command line, especially if you want to specify a tab 
delimiter or backslash escape.


cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-13 Thread Simon Riggs
On Tue, 2006-06-13 at 10:02 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Elsewhere, it has been discussed that we might hold the number of blocks
  in a relation in shared memory. Does that idea now fall down, or is it
  complementary to this?
 
 It's been the case for some time that the planner uses
 RelationGetNumberOfBlocks() to determine true rel size.  The only reason
 relpages is still stored at all is that it's used to approximate true
 number of tuples via
   true_ntuples = (reltuples/relpages) * true_npages
 ie, assuming that the tuple density is still what it was at the last
 VACUUM or ANALYZE.  So you can't fool the system with a totally made-up
 relation size anyway.  (This too is moderately annoying for planner
 testing, but it seems the only way to get the planner to react when a
 table's been filled without an immediate vacuum/analyze.)
 
 The only point of tracking rel size in shared memory would be to avoid
 the costs of lseek() kernel calls in RelationGetNumberOfBlocks.

Yes, understood. With the second point to allow them to be separately
set for PGSQL developer testing of optimizer, and application dev
testing of SQL and/or what/if scenarios.

  The main thing we are trying to accomplish here is to decouple
  transactional and nontransactional updates to a pg_class row.
 
  With the goal being avoiding table bloat??
 
 No, with the goal being correctness.  If you have a freeze/unfreeze
 mechanism then unfreezing a relation is an action that must NOT be
 rolled back if your transaction (or any other one for that matter) later
 aborts.  The tuples you put into it meanwhile need to be vacuumed anyway.
 So you can't mark it unfrozen in an uncommitted pg_class entry that
 might never become committed.
 
  For me, freezing is last step before writing to WORM media, so there is
  never an unfreeze step.
 
 That is not what Alvaro is after.  Nor anyone else here. 

So what is unfreeze for again?

 I have not
 heard anyone mention WORM media for Postgres in *years*.

Oh? Big requirements for archive these days, much more so than before.
This will allow years of data in a seamless on-line/near-line
partitioned table set. Lots of people want that: .gov, .mil, .com

More modern equivalent: a MAID archive system for WORO data

 It strikes me though that automatic UNFREEZE isn't necessarily the
 requirement.  What if VACUUM FREEZE causes the table to become
 effectively read-only, and you need an explicit UNFREEZE command to
 put it back into a read-write state?  Then UNFREEZE could be a
 transactional operation, and most of these issues go away.  

That works for me. Very much preferred.

 The case
 where this doesn't work conveniently is copying a frozen database
 (viz template0), but maybe biting the bullet and finding a way to do
 prep work in a freshly made database is the answer for that.  We've
 certainly seen plenty of other possible uses for post-CREATE processing
 in a new database.
 
 Another reason for not doing unfreeze automatically is that as the patch
 stands, any database user can force unfreezing of any table, whether he
 has any access rights on it or not (because the LockTable will happen
 before we check access rights, I believe).  This is probably Not Good.
 Ideally I think FREEZE/UNFREEZE would be owner-permission-required.

Seems like a plan.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Andrew Dunstan



Martijn van Oosterhout wrote:



Side note: In my experience Excel happily slurps up tab delimited
output, so I'm not sure why all of this is an issue in the first place.

 



I guess you experience doesn't run to data that has embedded tabs, for 
example.


There really is a reason we did this in the first place, and it wasn't 
for fun.


cheers

andrew

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Rod Taylor
On Mon, 2006-06-12 at 16:28 -0400, Bill Bartlett wrote:
 Can't -- the main production database is over at a CoLo site with access
 only available via SSH, and tightly-restricted SSH at that. Generally
 one of the developers will SSH over to the server, pull out whatever
 data is needed into a text file via psql or pg_dump, scp the file(s)
 back here and send them to the user.

I don't get it. If you can use psql then you already have csv support.

psql -c 'COPY pg_class TO STDOUT WITH CSV' postgres  pg_class.csv

  -Original Message-
  From: Joshua D. Drake [mailto:[EMAIL PROTECTED] 
  Sent: Monday, June 12, 2006 4:15 PM
  To: Bill Bartlett
  Cc: 'Andrew Dunstan'; 'Tom Lane'; 'PG Hackers'
  Subject: Re: [HACKERS] CSV mode option for pg_dump
  
  
  Bill Bartlett wrote:
   Here's me speaking up -- I'd definitely use it!   As a 
  quick way to pull
   data into Excel to do basic reports or analysis, a CSV 
  format would be 
   great.
  
  Why not just use ODBC?
  
  Joshua D. Drake
  -- 
  
   === The PostgreSQL Company: Command Prompt, Inc. ===
 Sales/Support: +1.503.667.4564 || 24x7/Emergency: 
  +1.800.492.2240
 Providing the most comprehensive  PostgreSQL solutions 
  since 1997
http://www.commandprompt.com/
  
  
 
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
 
-- 


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] timezones to own config file

2006-06-13 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 Any particular reason this can't be a normal table in pg_catalog which
 you can select/update.

That doesn't do anything to help with one of the main problems: that
we have at least two (maybe more) alternative sets of names that people
might want as default.  Getting rid of australian_timezones is fine,
but we can't do it by saying all you aussies have to hack the standard
list according to your own ideas.  I don't expect that very many people
will actually need to make custom timezone name lists --- if we find
they do, we'll need to work harder on the default lists.  So the design
center should be select one of a few predefined lists, not hack away
on system catalog until you like it.  Especially not if they have to do
it in template0, template1, postgres, etc.  Basically, a GUC variable is
just about the right paradigm for this, a system catalog isn't.

I'd also be a bit worried about performance issues, eg, whether VACUUM
FULL on such a table would bring datetime operations to a halt.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Steve Atkins


On Jun 13, 2006, at 7:34 AM, Martijn van Oosterhout wrote:


What this tells me is that we need a tool somewhere between psql and
pg_dump, say, pgquery. It's sole purpose in life is to generate output
from various queries. Because it's a seperate tool there's no question
of psql or pg_dump being able to parse them.

While you're at it, you could add modules to support many different
output styles, like CSV, XML, Excel format, HTML, etc.

This I beleive would take the load off psql to provide many different
output styles, as well as the load off pg_dump to produce
parsable-by-third-party output.

Thoughts?


Perl+DBD::Pg+CPAN does almost all of this already. Lots of support
for countless different output formats, and mostly fairly well battle- 
tested.


I suspect that a perl script to do all that would be dominated by  
commandline

option parsing, as all the hard work is in existing modules.

Would that be adequate, or do we really want to reimplement and  
maintain all

the output format complexity in our own code, in C?

Cheers,
  Steve

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Joshua D. Drake

Rod Taylor wrote:

On Mon, 2006-06-12 at 16:28 -0400, Bill Bartlett wrote:

Can't -- the main production database is over at a CoLo site with access
only available via SSH, and tightly-restricted SSH at that. Generally
one of the developers will SSH over to the server, pull out whatever
data is needed into a text file via psql or pg_dump, scp the file(s)
back here and send them to the user.


I don't get it. If you can use psql then you already have csv support.

psql -c 'COPY pg_class TO STDOUT WITH CSV' postgres  pg_class.csv


If you data looks like this:

foo barbaz  bing

You are o.k. You have three columns, tab delimited.

However if you data looks like this:

foo bar baz bing

You have a problem.

foo is one column
bar and baz are a single column
bing is a single column

How does excel know that bar	baz is a single column? It doesn't because 
you told it to delimit on tabs and thus you have four columns as far as 
Excel is concerned.


An alternative although I don't know what kind of headaches it would 
cause is to have a text delimiter as well as a field delimter, e.g;


foo bar   baz   bing

Sincerely,

Joshua D. Drake







-Original Message-
From: Joshua D. Drake [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 12, 2006 4:15 PM

To: Bill Bartlett
Cc: 'Andrew Dunstan'; 'Tom Lane'; 'PG Hackers'
Subject: Re: [HACKERS] CSV mode option for pg_dump


Bill Bartlett wrote:
Here's me speaking up -- I'd definitely use it!   As a 

quick way to pull
data into Excel to do basic reports or analysis, a CSV 
format would be 

great.

Why not just use ODBC?

Joshua D. Drake
--

 === The PostgreSQL Company: Command Prompt, Inc. ===
   Sales/Support: +1.503.667.4564 || 24x7/Emergency: 
+1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions 
since 1997

  http://www.commandprompt.com/




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly




--

=== The PostgreSQL Company: Command Prompt, Inc. ===
  Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
  Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread PFC


From what I gather, the CSV format dump would only contain data.
	I think pg_dump is the friend of pg_restore. It dumps everything  
including user defined functions, types, schemas etc. CSV does not fit  
with this.


	Besides, people will probably want to dump into CSV the result of any  
query, to load it into excel, not just the full contents of a table.


	So, why not create a separate tool, someone suggested pg_query for that,  
I second it.
	This tool would take a query and format options, and would output a file  
in whatever format chosen by the user (CSV, COPY format, xml, whatever)


	A script language (python) can be used, which will significantly shorten  
development times and allow easy modularity, as it is easier to add a  
module to a python program than a C program.
	I would vote for Python because I love it and it has a very good postgres  
adapter (psycopg2) which knows how to convers every postgres type to a  
native language type (yes, even multidimensional arrays of BOX get  
converted). And it's really fast at retrieving large volumes of data.


	So you have a stable, fast tool for backup and restore (pg_dump) and a  
rapidly evolving, user-friendly and extendable tool for exporting data,  
and everyone is happy.


	Mr Momijan talks about adding modular functionality to pg_dump. Is it  
really necessary ? What is the objective ? Is it to reuse code in pg_dump  
? I guess not ; if a user wants to dump, for instance, all the tables in a  
schema, implementing this logic in python is only a few lines of code  
(select from information_schema...)


	To be realistic, output format modules should be written in script  
languages. Noone sane is eager to do string manipulation in C. Thus these  
modules would have to somehow fit with pg_dump, maybe with a pipe or  
something. This means designing another protocol. Reimplementing in a  
scripting langage the parts of pg_dump which will be reused by this  
project (mainly, enumerating tables and stuff) will be far easier.


Just look.

Python 2.4.2 (#1, Mar 30 2006, 14:34:35)
[GCC 3.4.4 (Gentoo 3.4.4-r1, ssp-3.4.4-1.0, pie-8.7.8)] on linux2
Type help, copyright, credits or license for more information.

...opens a db connection...


c.execute( SELECT * FROM test.csv )
data = c.fetchall()
data
[[1, datetime.date(2006, 6, 13), 'this\tcontains\ttabulations'], [2,  
datetime.date(2006, 6, 13), this'contains'quotes], [3,  
datetime.date(2006, 6, 13), 'thiscontainsdouble quotes']]

import csv, sys
c = csv.writer( sys.stdout, dialect = csv.excel )
c.writerows( data )

1,2006-06-13,this   containstabulations
2,2006-06-13,this'contains'quotes
3,2006-06-13,thiscontainsdouble quotes

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Andrew Dunstan



PFC wrote:



From what I gather, the CSV format dump would only contain data.
I think pg_dump is the friend of pg_restore. It dumps everything  
including user defined functions, types, schemas etc. CSV does not 
fit  with this.


   



This is just nonsense. There is not the slightest reason that CSV data 
cannot be embedded in a text dump nor exist as the data members of a 
custom or tar dump with the corresponding COPY statements modified 
accordingly.


Really, let's get the facts straight, please.

cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Joshua D. Drake


Would that be adequate, or do we really want to reimplement and maintain 
all

the output format complexity in our own code, in C?


I think the point is that we should provide a native implementation 
because not everyone is crazy enough to use perl (blatant jab ;)). I 
would never expect a customer to write a perl or python script just to 
get their data in what is widely considered a standard business format 
that can be imported by their userland application.


The people on the hackers list, are NOT the target for this feature. The 
people on general, admin and novice are.


Sincerely,

Joshua D. Drake






Cheers,
  Steve

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org




--

=== The PostgreSQL Company: Command Prompt, Inc. ===
  Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
  Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Bill Bartlett
 From: Rod Taylor [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, June 13, 2006 11:31 AM
 
 
 On Mon, 2006-06-12 at 16:28 -0400, Bill Bartlett wrote:
  Can't -- the main production database is over at a CoLo site with 
  access only available via SSH, and tightly-restricted SSH at that. 
  Generally one of the developers will SSH over to the 
 server, pull out 
  whatever data is needed into a text file via psql or 
 pg_dump, scp the 
  file(s) back here and send them to the user.
 
 I don't get it. If you can use psql then you already have csv support.
 
 psql -c 'COPY pg_class TO STDOUT WITH CSV' postgres  pg_class.csv

Ah - RTF-UPDATED-M on my part.  Most of my systems are still in PG 7.4.x
databases so I tend to stick with the 7.x docs, and I didn't notice the
WITH CSV option added in 8.0.  That, plus temp tables, will be very
useful.

However, I also agree with the need for a new pg_query / pg_export
program. A program geared solely towards exporting the results of a
query would allow many of the options that are needed for the
ever-growing variety of output formats (XML, CSV, HTML, XHTML, etc.) and
details for each format without needing to clutter up pg_dump with
things that really having nothing to do with backing up and restoring
data.  It could also allow a large range of options related to getting
data out (where, order by), many of which have also been discussed for
pg_dump recently.

- Bill

 
   -Original Message-
   From: Joshua D. Drake [mailto:[EMAIL PROTECTED]
   Sent: Monday, June 12, 2006 4:15 PM
   To: Bill Bartlett
   Cc: 'Andrew Dunstan'; 'Tom Lane'; 'PG Hackers'
   Subject: Re: [HACKERS] CSV mode option for pg_dump
   
   
   Bill Bartlett wrote:
Here's me speaking up -- I'd definitely use it!   As a 
   quick way to pull
data into Excel to do basic reports or analysis, a CSV
   format would be
great.
   
   Why not just use ODBC?
   
   Joshua D. Drake
   --
   
=== The PostgreSQL Company: Command Prompt, Inc. ===
  Sales/Support: +1.503.667.4564 || 24x7/Emergency:
   +1.800.492.2240
  Providing the most comprehensive  PostgreSQL solutions
   since 1997
 http://www.commandprompt.com/
   
   
 -- 
 


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Steve Atkins


On Jun 13, 2006, at 9:47 AM, Joshua D. Drake wrote:

Would that be adequate, or do we really want to reimplement and  
maintain all

the output format complexity in our own code, in C?


I think the point is that we should provide a native implementation  
because not everyone is crazy enough to use perl (blatant jab ;)).  
I would never expect a customer to write a perl or python script  
just to get their data in what is widely considered a standard  
business format that can be imported by their userland application.


That wasn't what I was discussing, on two levels. Firstly, I wasn't  
suggesting that the end user write anything, secondly I was talking  
about the other output formats discussed (Excel, HTML...) rather than  
just CSV.




The people on the hackers list, are NOT the target for this  
feature. The people on general, admin and novice are.


I was referring to the other export formats mentioned (Excel,  
HTML...). We already support CSV export of single tables with the  
bundled software, don't we?


Cheers,
  Steve

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Andrew Dunstan



Steve Atkins wrote:



Would that be adequate, or do we really want to reimplement and  
maintain all

the output format complexity in our own code, in C?




Code to produce CSVs is there, now, today, and has been since 8.0.  That 
is *not* what is at issue here. If you want to debate whether or not it 
should be in the backend you are 2 years too late. The ONLY question 
here is about whether or not to have it enabled as an option in pg_dump.


try these and see the interesting results:
 COPY pg_class TO '/tmp/pg_class.csv' CSV;
or in psql
 \copy pg_class to '/tmp/pg_class.csv' csv

As for XML which was also mentioned, you should be aware that there is a 
Google Summer of Code project to implement SQL/XML, so watch this space. 
(As for why that needs to be in the backend, see previous discussions)


cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Jim C. Nasby
On Mon, Jun 12, 2006 at 11:38:01AM -0400, Kris Kennaway wrote:
 On Mon, Jun 12, 2006 at 10:08:22AM -0500, Jim C. Nasby wrote:
  On Mon, Jun 12, 2006 at 12:24:36AM -0400, Kris Kennaway wrote:
   On Sun, Jun 11, 2006 at 10:07:13PM -0500, Jim C. Nasby wrote:
On Sun, Jun 11, 2006 at 09:58:33PM -0400, Tom Lane wrote:
 Kris Kennaway [EMAIL PROTECTED] writes:
  On Sun, Jun 11, 2006 at 07:43:03PM -0400, Tom Lane wrote:
  Let's see the evidence.
 
  The calls to setproctitle() (it looks like 4 setproctitle syscalls 
  per
  DB query) are causing contention on the Giant lock 25% of the time 
  on
  a dual p4 + HTT.  Disabling process title setting completely gives 
  an
  8% peak performance boost to the super-smack select benchmark.
 
 I think you misunderstood me: I asked for evidence, not 
 interpretation.
 What are you measuring, and with what tool, and what are the numbers?
 On what benchmark case?  And what did you do to disable process title
 setting completely?
 
 The reason I'm being doubting Thomas here is that I've never seen any

Ba-da-bum!

 indication on any other platform that ps_status is a major bottleneck.
 Now maybe FreeBSD really sucks, or maybe you're onto something of
 interest, but let's see the proof in a form that someone else can
 check and reproduce.

It's also important to find out what version of FreeBSD this is. A lot
of things have been pulled out of GIANT in 5.x and 6.x, so it's entirely
possible this isn't an issue in newer versions.
   
  Can you provide the actual commands you used to setup and run the test?
 
 I actually forget all the steps I needed to do to get super-smack
 working with postgresql since it required a lot of trial and error for
 a database newbie like me (compiling it from the
 benchmarks/super-smack port was trivial, but unlike mysql it required
 configuring the database by hand - this should hopefully be more
 obvious to someone familiar with pgsql though).
 
 It would be great if someone on your end could make this easier, BTW -
 e.g. at least document the steps.  Also super-smack should be changed
 to allow use via a local socket with pgsql (this is the default with
 mysql) - this avoids measuring network stack overhead.
 
Unless supersmack has improved substantially, you're unlikely to find
much interest. Last I heard it was a pretty brain-dead benchmark. DBT2/3
(http://sourceforge.net/projects/osdldbt) is much more realistic (based
on TPC-C and TPC-H).
   FYI, the biggest source of contention is via semop() - it might be
   possible to optimize that some more in FreeBSD, I don't know.
  
  Yeah, I've seen PostgreSQL on FreeBSD fall over at high load with a lot
  of procs in either semwait or semlock. :(
 
 Part of that is Giant contention again; for example on 6.x semop() and
 setproctitle() both want to acquire it, so they'll fight with each
 other and with anything else on the system that wants Giant
 (e.g. IPv6, or the USB stack, etc).  As I mentioned Giant is not an
 issue here going forward, but there is still as much lock contention
 just between semop() calls running on different CPUs.  It may be
 possible for someone to implement more fine-grained locking here, but
 I don't know if there is available interest.

FWIW, getting turning off stats_command_string substantially reduced
this contention, so it appears the issue is somewhere in the stats code.
This code sends stats messages to a different process via a socket (or
is it UDP?), with the intention that if the system gets heavily loaded
we'll lose some stats in the interest of not bogging down all the
backends. It seems that doesn't work so hot on FreeBSD. :(
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Running a query twice to ensure cached results.

2006-06-13 Thread Simon Riggs
On Tue, 2006-06-13 at 15:00 +0200, Martijn van Oosterhout wrote:

 What you might be able to do is to reduce its effect. The thing that
 occurs to me is to make hint bit changes only mark a page half-dirty.
 If the page is evicted because the space is needed in the buffer cache,
 it can be just dropped. However, the bgwriter will write it as normal.
 Hence, setting hint bits will become a sort of background operation,
 done when there's time.

Yes, I think that's a very good idea. What that introduces is the
concept of dirty priority - i.e. some pages are more important to
write out quickly than others. If the bgwriter cleans the higher
priority ones first it should do a better job of keeping the bufferpool
clean. That needs some work on it before its a fully fledged proposal.

 It seems to me that if a large table is loaded in a single transaction,
 the check for if the transaction is committed should be cheap because
 it's checking the same transaction id over and over.

I was considering this the other day. My original idea was to set the
xmin to be FrozenTransaction when loading a table with COPY in the same
transaction as the one that loaded it but that didn't work because of
MVCC violation. If we just set the hint bit to show XMIN_COMMITTED then
we need not worry about MVCC violations, since the xmin will still be
higher than any pre-existing snapshots.

I think Tom had a problem with that because it relied on file truncation
on recovery, but I'm not clear on why exactly?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Bruce Momjian
Jim C. Nasby wrote:
 FWIW, getting turning off stats_command_string substantially reduced
 this contention, so it appears the issue is somewhere in the stats code.
 This code sends stats messages to a different process via a socket (or
 is it UDP?), with the intention that if the system gets heavily loaded
 we'll lose some stats in the interest of not bogging down all the
 backends. It seems that doesn't work so hot on FreeBSD. :(

I am working on a patch for 8.2 to fix that for all platforms.

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Jim C. Nasby
On Tue, Jun 13, 2006 at 02:10:15PM -0400, Bruce Momjian wrote:
 Jim C. Nasby wrote:
  FWIW, getting turning off stats_command_string substantially reduced
  this contention, so it appears the issue is somewhere in the stats code.
  This code sends stats messages to a different process via a socket (or
  is it UDP?), with the intention that if the system gets heavily loaded
  we'll lose some stats in the interest of not bogging down all the
  backends. It seems that doesn't work so hot on FreeBSD. :(
 
 I am working on a patch for 8.2 to fix that for all platforms.

Excellent. Did I miss discussion of that or have you not mentioned it?
I'm curious as to how you're fixing it...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Bruce Momjian
Jim C. Nasby wrote:
 On Tue, Jun 13, 2006 at 02:10:15PM -0400, Bruce Momjian wrote:
  Jim C. Nasby wrote:
   FWIW, getting turning off stats_command_string substantially reduced
   this contention, so it appears the issue is somewhere in the stats code.
   This code sends stats messages to a different process via a socket (or
   is it UDP?), with the intention that if the system gets heavily loaded
   we'll lose some stats in the interest of not bogging down all the
   backends. It seems that doesn't work so hot on FreeBSD. :(
  
  I am working on a patch for 8.2 to fix that for all platforms.
 
 Excellent. Did I miss discussion of that or have you not mentioned it?
 I'm curious as to how you're fixing it...

The patches are in the hold queue:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

Title is Stats collector performance improvement.  I need someone to
test my patches on a non-BSD platform to move forward.

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal for debugging of server-side stored procedures

2006-06-13 Thread Lukas Smith

Hi,

I was just talking to Derick the author of DBGp and I realized this 
might be a topic for a joint effort among all open source RDBMS. I think 
it would be awesome if we could get a common protocol setup for stored 
procedure debugging.


regards,
Lukas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 Jim C. Nasby wrote:
 Excellent. Did I miss discussion of that or have you not mentioned it?
 I'm curious as to how you're fixing it...

 The patches are in the hold queue:
   http://momjian.postgresql.org/cgi-bin/pgpatches_hold

That's talking about the stats code, which has approximately zip to do
with setproctitle (ps_status.c).  But IIRC that patch is on hold because
nobody particularly liked the approach it's taking.  I think we should
investigate rewriting the stats communication architecture entirely ---
in particular, do we really need the stats buffer process at all?  It'd
be interesting to see what happens if we just make the collector process
read the UDP socket directly.  Or alternatively drop the UDP socket in
favor of having the backends write directly to the collector process'
input pipe (not sure if this would port to Windows though).

As far as Kris' complaint goes, one thing that might be interesting is
to delay both the setproctitle call and the stats command message send
until the current command has been running a little while (say 100ms
or so).  The main objection I see to this is that it replaces a kernel
call that actually does some work with a kernel call to start a timer,
plus possibly a later kernel call to actually do the work.  Not clear
that there's a win there.  (If you're using statement_timeout it might
not matter, but if you aren't...)

Also not clear how you make the necessary actions happen when the timer
expires --- I seriously doubt it'd be safe to try to do either action
directly in a signal handler.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Martijn van Oosterhout
On Tue, Jun 13, 2006 at 04:35:24PM -0400, Tom Lane wrote:
 ...  The main objection I see to this is that it replaces a kernel
 call that actually does some work with a kernel call to start a timer,
 plus possibly a later kernel call to actually do the work.  Not clear
 that there's a win there.
 
And ofcourse it's an almost guarenteed loss on systems that don't
require a syscall to set the proc title.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Magnus Hagander
 That's talking about the stats code, which has approximately 
 zip to do with setproctitle (ps_status.c).  But IIRC that 
 patch is on hold because nobody particularly liked the 
 approach it's taking.  I think we should investigate 
 rewriting the stats communication architecture entirely --- 
 in particular, do we really need the stats buffer process at 
 all?  It'd be interesting to see what happens if we just make 
 the collector process read the UDP socket directly.  Or 
 alternatively drop the UDP socket in favor of having the 
 backends write directly to the collector process'
 input pipe (not sure if this would port to Windows though).

(Yes,  Iremember saying I was planning to look at this. As is probably
obvious by now, I haven't had the time to do that (yet)).

As for your question, it will be a bit painful to port to windows. We
did have a lot of problems with the pgstat pipe in the initial porting
work, and I'm not convinced that there aren't some small issues still
lurking there under heavy load. The point is that the whole concept of
sharing socket descriptors doesn't really play well between processes on
Windows.

Using UDP would make that a whole lot better. Without knowing anything,
I would assume the overhead of a localhost UDP packet isn't very large
on a reasonably modern platform.


//Magnus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian pgman@candle.pha.pa.us writes:
  Jim C. Nasby wrote:
  Excellent. Did I miss discussion of that or have you not mentioned it?
  I'm curious as to how you're fixing it...
 
  The patches are in the hold queue:
  http://momjian.postgresql.org/cgi-bin/pgpatches_hold
 
 That's talking about the stats code, which has approximately zip to do
 with setproctitle (ps_status.c).  But IIRC that patch is on hold because

I thought the bug reporter was asking about the stats code was well.

 nobody particularly liked the approach it's taking.  I think we should
 investigate rewriting the stats communication architecture entirely ---
 in particular, do we really need the stats buffer process at all?  It'd
 be interesting to see what happens if we just make the collector process
 read the UDP socket directly.  Or alternatively drop the UDP socket in

Agreed, that's what I would prefer, and tested something like that, but
even pulling the packet into the buffer and throwing them away had
significant overhead, so I think the timeout trick has to be employed as
well as going to a single process.

 favor of having the backends write directly to the collector process'
 input pipe (not sure if this would port to Windows though).
 
 As far as Kris' complaint goes, one thing that might be interesting is
 to delay both the setproctitle call and the stats command message send
 until the current command has been running a little while (say 100ms
 or so).  The main objection I see to this is that it replaces a kernel
 call that actually does some work with a kernel call to start a timer,
 plus possibly a later kernel call to actually do the work.  Not clear
 that there's a win there.  (If you're using statement_timeout it might
 not matter, but if you aren't...)
 
 Also not clear how you make the necessary actions happen when the timer
 expires --- I seriously doubt it'd be safe to try to do either action
 directly in a signal handler.

Right.  What if the postmaster signals the backend once a second to do
their reporting.  I am not sure what the final solution will be, but we
_need_ one based on the performance numbers I and others have seen. 
Could we have PGPROC have a reporting boolean that is set every second
and somehow checked by each backend?

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Bruce Momjian

pg_dump CSV TODO item removed until we come up with something everyone
can agree on.

---

Joshua D. Drake wrote:
  
  Would that be adequate, or do we really want to reimplement and maintain 
  all
  the output format complexity in our own code, in C?
 
 I think the point is that we should provide a native implementation 
 because not everyone is crazy enough to use perl (blatant jab ;)). I 
 would never expect a customer to write a perl or python script just to 
 get their data in what is widely considered a standard business format 
 that can be imported by their userland application.
 
 The people on the hackers list, are NOT the target for this feature. The 
 people on general, admin and novice are.
 
 Sincerely,
 
 Joshua D. Drake
 
 
 
 
  
  Cheers,
Steve
  
  ---(end of broadcast)---
  TIP 4: Have you searched our list archives?
  
http://archives.postgresql.org
  
 
 
 -- 
 
  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
   http://www.commandprompt.com/
 
 
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] timezones to own config file

2006-06-13 Thread Martijn van Oosterhout
On Tue, Jun 13, 2006 at 11:51:25AM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  Any particular reason this can't be a normal table in pg_catalog which
  you can select/update.
 
 That doesn't do anything to help with one of the main problems: that
 we have at least two (maybe more) alternative sets of names that people
 might want as default.

snip

I think my actual point was something else. We currently get calls from
people trying to administer machines that it's annoying that various
configuration information is stored in files, beyond the easy reach of
SQL.

What I was thinking is why we couldn't just store the information in a
global shared system table that is only read on config reload. You
could have a few columns, maybe the first being a list name, which is
referenced from a GUC.

If you issue a config reload during a VACUUM FULL, I guess that might
be an issue, yes. I was just thinking people might appreciate being
able to configure the timezones without opening a shell... Similarly,
it would also give a way for user-interfaces to get a list of available
valid timezones and their actual meanings, which is currently
impossible.

Just a thought really...
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Jim C. Nasby
On Tue, Jun 13, 2006 at 05:05:31PM -0400, Bruce Momjian wrote:
 Tom Lane wrote:
  Bruce Momjian pgman@candle.pha.pa.us writes:
   Jim C. Nasby wrote:
   Excellent. Did I miss discussion of that or have you not mentioned it?
   I'm curious as to how you're fixing it...
  
   The patches are in the hold queue:
 http://momjian.postgresql.org/cgi-bin/pgpatches_hold
  
  That's talking about the stats code, which has approximately zip to do
  with setproctitle (ps_status.c).  But IIRC that patch is on hold because
 
 I thought the bug reporter was asking about the stats code was well.
 
It did get brought up...

  As far as Kris' complaint goes, one thing that might be interesting is
  to delay both the setproctitle call and the stats command message send
  until the current command has been running a little while (say 100ms
  or so).  The main objection I see to this is that it replaces a kernel
  call that actually does some work with a kernel call to start a timer,
  plus possibly a later kernel call to actually do the work.  Not clear
  that there's a win there.  (If you're using statement_timeout it might
  not matter, but if you aren't...)
  
  Also not clear how you make the necessary actions happen when the timer
  expires --- I seriously doubt it'd be safe to try to do either action
  directly in a signal handler.
 
 Right.  What if the postmaster signals the backend once a second to do
 their reporting.  I am not sure what the final solution will be, but we
 _need_ one based on the performance numbers I and others have seen. 
 Could we have PGPROC have a reporting boolean that is set every second
 and somehow checked by each backend?

One second might be a bit more delay than some folks want... it would be
nice if this was tuneable. Also, what would the overhead on this look
like if there's a large number of idle backends?

It does sound more appealing than setting a timer every time you start a
transaction, though...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] timezones to own config file

2006-06-13 Thread Jim C. Nasby
On Tue, Jun 13, 2006 at 11:11:26PM +0200, Martijn van Oosterhout wrote:
 On Tue, Jun 13, 2006 at 11:51:25AM -0400, Tom Lane wrote:
  Martijn van Oosterhout kleptog@svana.org writes:
   Any particular reason this can't be a normal table in pg_catalog which
   you can select/update.
  
  That doesn't do anything to help with one of the main problems: that
  we have at least two (maybe more) alternative sets of names that people
  might want as default.
 
 snip
 
 I think my actual point was something else. We currently get calls from
 people trying to administer machines that it's annoying that various
 configuration information is stored in files, beyond the easy reach of
 SQL.
 
 What I was thinking is why we couldn't just store the information in a
 global shared system table that is only read on config reload. You
 could have a few columns, maybe the first being a list name, which is
 referenced from a GUC.
 
 If you issue a config reload during a VACUUM FULL, I guess that might
 be an issue, yes. I was just thinking people might appreciate being
 able to configure the timezones without opening a shell... Similarly,
 it would also give a way for user-interfaces to get a list of available
 valid timezones and their actual meanings, which is currently
 impossible.

ISTM that's an issue that affects all configuration stuff, not just the
timezones; if we're going to come up with a way to manage settings
without touching a file, it should work for everything.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] timezones to own config file

2006-06-13 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 What I was thinking is why we couldn't just store the information in a
 global shared system table that is only read on config reload. You
 could have a few columns, maybe the first being a list name, which is
 referenced from a GUC.

Hmmm ... if we keep the notion of a GUC that identifies a set of
compatible timezone names, then a table with a primary key of
(tz_set_name, tz_name) doesn't seem quite so awful.  The main
remaining objection I can see is that the postmaster couldn't use
it, only backends.  Now this doesn't matter much as far as timestamp
operations go because I don't think the postmaster does any operations
that need TZ data --- but what of verifying that the GUC variable has
a valid value in postgresql.conf at startup?  If you're willing to
abandon sanity checking on that string, it might work.

One interesting thought about a system table is that it could be
referenced through a syscache, which'd have the nice property that only
the (probably few) values actually referenced in a given session need to
get loaded.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Andrew Dunstan



Bruce Momjian wrote:


pg_dump CSV TODO item removed until we come up with something everyone
can agree on.

 



That's a pity.

Just to show you how little is involved in what I was suggesting, a 
prototype patch is attached - it's 182 lines of context diff, which is 
pretty small for a new feature. It took me about an hour to write and I 
have tested it against the regression db in both text and binary dump 
modes, where it works without a hitch.


cheers

andrew


Index: src/bin/pg_dump/pg_dump.c
===
RCS file: /cvsroot/pgsql/src/bin/pg_dump/pg_dump.c,v
retrieving revision 1.438
diff -c -r1.438 pg_dump.c
*** src/bin/pg_dump/pg_dump.c	9 Jun 2006 19:46:09 -	1.438
--- src/bin/pg_dump/pg_dump.c	14 Jun 2006 00:32:03 -
***
*** 113,118 
--- 113,123 
  /* flag to turn on/off dollar quoting */
  static int	disable_dollar_quoting = 0;
  
+ /* flag to control if using CSv */
+ static bool use_csv = false;
+ 
+ /* holder for CSV options */
+ static PQExpBuffer csv_opts;
  
  static void help(const char *progname);
  static NamespaceInfo *findNamespace(Oid nsoid, Oid objoid);
***
*** 251,256 
--- 256,265 
  		{disable-triggers, no_argument, disable_triggers, 1},
  		{use-set-session-authorization, no_argument, use_setsessauth, 1},
  
+ 		/* long options with no short version */
+ 		{csv, no_argument, NULL ,2},
+ 		{csv-option,required_argument, NULL, 3},
+ 
  		{NULL, 0, NULL, 0}
  	};
  	int			optindex;
***
*** 285,290 
--- 294,301 
  		}
  	}
  
+ 	csv_opts = createPQExpBuffer();
+ 
  	while ((c = getopt_long(argc, argv, abcCdDE:f:F:h:in:oOp:RsS:t:uU:vWxX:Z:,
  			long_options, optindex)) != -1)
  	{
***
*** 419,424 
--- 430,462 
  break;
  /* This covers the long options equivalent to -X xxx. */
  
+ 			case 2:  /* csv */
+ use_csv = true;
+ break;
+ 
+ 			case 3: /* csv-option */
+ if (strcmp(optarg, singlequote) == 0)
+ {
+ 	appendPQExpBuffer(csv_opts,QUOTE AS  );
+ }
+ else if (strcmp(optarg, tabdelimiter) == 0)
+ {
+ 	appendPQExpBuffer(csv_opts,DELIMITER AS E'\\t' );
+ }
+ else if (strcmp(optarg, header) == 0)
+ {
+ 	appendPQExpBuffer(csv_opts,HEADER );
+ }
+ else
+ {
+ 	fprintf(stderr,
+ 			_(%s: invalid csv option -- %s\n),
+ 			progname, optarg);
+ 	fprintf(stderr, _(Try \%s --help\ for more information.\n), progname);
+ 	exit(1);
+ }
+ 	
+ 
  			case 0:
  break;
  
***
*** 463,468 
--- 501,518 
  		exit(1);
  	}
  
+ 	if ( use_csv == true  dumpInserts == true)
+ 	{
+ 		write_msg(NULL, INSERT (-d, -D) and CSV (--csv) options cannot be used together\n);
+ 		exit(1);
+ 	}
+ 
+ 	if ( use_csv == false  strlen(csv_opts-data)  0)
+ 	{
+ 		write_msg(NULL, You must specify --csv to use --csv-option\n);
+ 		exit(1);
+ 	}
+ 
  	/* open the output file */
  	switch (format[0])
  	{
***
*** 714,719 
--- 764,771 
  			use SESSION AUTHORIZATION commands instead of\n
  			OWNER TO commands\n));
  
+ 	printf(_(  --csvuse CSV mode\n));
+ 	printf(_(  --csv-option=opt one of header, tabdelimiter or singlequote\n));
  	printf(_(\nConnection options:\n));
  	printf(_(  -h, --host=HOSTNAME  database server host or socket directory\n));
  	printf(_(  -p, --port=PORT  database server port number\n));
***
*** 881,898 
  
  	if (oids  hasoids)
  	{
! 		appendPQExpBuffer(q, COPY %s %s WITH OIDS TO stdout;,
  		  fmtQualifiedId(tbinfo-dobj.namespace-dobj.name,
  		 classname),
  		  column_list);
  	}
  	else
  	{
! 		appendPQExpBuffer(q, COPY %s %s TO stdout;,
  		  fmtQualifiedId(tbinfo-dobj.namespace-dobj.name,
  		 classname),
  		  column_list);
  	}
  	res = PQexec(g_conn, q-data);
  	check_sql_result(res, g_conn, q-data, PGRES_COPY_OUT);
  	PQclear(res);
--- 933,955 
  
  	if (oids  hasoids)
  	{
! 		appendPQExpBuffer(q, COPY %s %s WITH OIDS TO stdout,
  		  fmtQualifiedId(tbinfo-dobj.namespace-dobj.name,
  		 classname),
  		  column_list);
  	}
  	else
  	{
! 		appendPQExpBuffer(q, COPY %s %s TO stdout,
  		  fmtQualifiedId(tbinfo-dobj.namespace-dobj.name,
  		 classname),
  		  column_list);
  	}
+ 	if (use_csv)
+ 	{
+ 		appendPQExpBuffer(q,  CSV %s, csv_opts-data);
+ 	}
+ 	appendPQExpBuffer(q, ;);
  	res = PQexec(g_conn, q-data);
  	check_sql_result(res, g_conn, q-data, PGRES_COPY_OUT);
  	PQclear(res);
***
*** 1139,1147 
  		/* must use 2 steps here 'cause fmtId is nonreentrant */
  		appendPQExpBuffer(copyBuf, COPY %s ,
  		  fmtId(tbinfo-dobj.name));
! 		appendPQExpBuffer(copyBuf, %s %sFROM stdin;\n,
  		  fmtCopyColumnList(tbinfo),
  	  (tdinfo-oids  tbinfo-hasoids) ? WITH OIDS  : );
  		copyStmt = copyBuf-data;
 

Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Jim Nasby


On Jun 12, 2006, at 10:38 AM, Kris Kennaway wrote:

FYI, the biggest source of contention is via semop() - it might be
possible to optimize that some more in FreeBSD, I don't know.


Yeah, I've seen PostgreSQL on FreeBSD fall over at high load with  
a lot

of procs in either semwait or semlock. :(


Part of that is Giant contention again; for example on 6.x semop() and
setproctitle() both want to acquire it, so they'll fight with each
other and with anything else on the system that wants Giant
(e.g. IPv6, or the USB stack, etc).  As I mentioned Giant is not an
issue here going forward, but there is still as much lock contention
just between semop() calls running on different CPUs.  It may be
possible for someone to implement more fine-grained locking here, but
I don't know if there is available interest.


BTW, there's another FBSD performance odditiy I've run across. Running

pg_dump -t email_contrib -COx stats | bzip2  ec.sql.bz2 

which dumps the email_contrib table to bzip2 then to disk, the OS  
won't use more than 1 CPU on an SMP system... unless the data is  
cached. According to both gstat and systat -v, the system isn't I/O  
bound; both are reporting the RAID10 with that table on it as only  
about 10% busy. If I let that command run for a bit then cancel it  
and re-start it so that the beginning of that table is in cache, it  
will use one entire CPU for bzip2, which is what I'd expect to happen.

--
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Marc G. Fournier

On Tue, 13 Jun 2006, Jim Nasby wrote:



On Jun 12, 2006, at 10:38 AM, Kris Kennaway wrote:

FYI, the biggest source of contention is via semop() - it might be
possible to optimize that some more in FreeBSD, I don't know.


Yeah, I've seen PostgreSQL on FreeBSD fall over at high load with a lot
of procs in either semwait or semlock. :(


Part of that is Giant contention again; for example on 6.x semop() and
setproctitle() both want to acquire it, so they'll fight with each
other and with anything else on the system that wants Giant
(e.g. IPv6, or the USB stack, etc).  As I mentioned Giant is not an
issue here going forward, but there is still as much lock contention
just between semop() calls running on different CPUs.  It may be
possible for someone to implement more fine-grained locking here, but
I don't know if there is available interest.


BTW, there's another FBSD performance odditiy I've run across. Running

pg_dump -t email_contrib -COx stats | bzip2  ec.sql.bz2 

which dumps the email_contrib table to bzip2 then to disk, the OS won't use 
more than 1 CPU on an SMP system... unless the data is cached. According to 
both gstat and systat -v, the system isn't I/O bound; both are reporting the 
RAID10 with that table on it as only about 10% busy. If I let that command 
run for a bit then cancel it and re-start it so that the beginning of that 
table is in cache, it will use one entire CPU for bzip2, which is what I'd 
expect to happen.


What version of FreeBSD are you dealing with here?  I'm guessing at least 
6.x, but just figured I'd clarify ...



Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email . [EMAIL PROTECTED]  MSN . [EMAIL PROTECTED]
Yahoo . yscrappy   Skype: hub.orgICQ . 7615664

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 Right.  What if the postmaster signals the backend once a second to do
 their reporting.  I am not sure what the final solution will be, but we
 _need_ one based on the performance numbers I and others have seen. 
 Could we have PGPROC have a reporting boolean that is set every second
 and somehow checked by each backend?

I don't see any point in involving the postmaster in it.  What might be
interesting is to replace the current backend timer-interrupt handling
by a free-running cyclic interrupt every N milliseconds (the resolution
of the statement_timeout and deadlock_check delays would then be no
better than N milliseconds, since those actions would occur at the next
cyclic interrupt after the desired time expires).  We could possibly
drive stats reports and ps_status updates from this, as well as sampling
EXPLAIN ANALYZE if anyone cares to pursue that.

Complaints I can foresee:

* lots of cycles wasted in idle backends.  Possibly a backend that's not
received any command for a second or two could shut down its interrupt
until it next gets a command.

* not clear whether the interrupt happens when waiting for I/O.  I
already mentioned that this would be a problem for EXPLAIN ANALYZE,
but it might be no big deal for the other uses.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] postgresql and process titles

2006-06-13 Thread Jim C. Nasby
On Tue, Jun 13, 2006 at 11:13:55PM -0300, Marc G. Fournier wrote:
 BTW, there's another FBSD performance odditiy I've run across. Running
 
 pg_dump -t email_contrib -COx stats | bzip2  ec.sql.bz2 
 
 which dumps the email_contrib table to bzip2 then to disk, the OS won't 
 use more than 1 CPU on an SMP system... unless the data is cached. 
 According to both gstat and systat -v, the system isn't I/O bound; both 
 are reporting the RAID10 with that table on it as only about 10% busy. If 
 I let that command run for a bit then cancel it and re-start it so that 
 the beginning of that table is in cache, it will use one entire CPU for 
 bzip2, which is what I'd expect to happen.
 
 What version of FreeBSD are you dealing with here?  I'm guessing at least 
 6.x, but just figured I'd clarify ...

FreeBSD 6.0-STABLE #6: Fri Dec  9 19:14:19 UTC 2005
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CSV mode option for pg_dump

2006-06-13 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 This is just nonsense. There is not the slightest reason that CSV data 
 cannot be embedded in a text dump nor exist as the data members of a 
 custom or tar dump with the corresponding COPY statements modified 
 accordingly.

Well, the really *core* question here is whether we trust the stability
of the CSV format definition (and code) enough to want to rely on it for
data dump/restore purposes.  I'm still a few years away from that,
myself.  AFAICT the raison d'etre of the CSV code is emit whatever it
takes to satisfy this, that, and the other broken Microsoft application.
That's fine as an export tool, but as a dump/reload tool, nyet.  If you
put it in pg_dump you're just handing neophytes another foot-gun.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq