Re: [HACKERS] Vertical Partitioning with TOAST

2005-12-08 Thread Junji TERAMOTO
Hello all,

Thank you for having the interest.

Jim C. Nasby wrote:
 Valid point. I do think there's a lot of benefit to being able to set
 the limit much lower than what it currently defaults to today. We have a
 client that has a queue-type table that is updated very frequently. One
 of the fields is text, that is not updated as frequently. Keeping this
 table vacuumed well enough has proven to be problematic, because any
 delay to vacuuming quickly results in a very large amount of bloat.
 Moving that text field into a seperate table would most likely be a win.

Yes, our team think that this patch is effective that the tuple can be
partially updated.
For instance, DBT-2 updates frequently contents excluding c_data in the
customer table. Because c_data(about 400bytes: The size of the entire
tuple is 500bytes.) is copied together in every case, it is thought that
it has decreased the performance.
That is more important than the vertical partitioning function.

Of course, it is important to change DDL of the table. However, I think
it might be useful when it is not possible to change.

As pointed out by Tom, this is a patch to verify the idea.
I want to know that community is how much interested in a partial update.
Of course, it is interested whether to want the vertical partitioning
function in PostgreSQL, too. :-)


By the way, should I send the patch to -patches again?

-- 
Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Reducing relation locking overhead

2005-12-08 Thread Simon Riggs
On Thu, 2005-12-01 at 21:37 -0500, Tom Lane wrote:
 In looking at the current pgbench results, I notice that one
 considerable contribution to LWLock contention is access to the
 heavyweight-lock manager. 

 4. The only reason we need to take relation-level locks on indexes
 at all is to make the world safe for REINDEX being done concurrently
 with read-only accesses to the table (that don't use the index being
 reindexed).  If we went back to requiring exclusive lock for reindex we
 could forget all about both #2 and #3.  Particularly for updates of
 relations with lots of indexes, this could be a pretty significant win.
 However we'd definitely be giving up something that was seen as a
 feature at one point, so I'm not sold on this idea ... unless someone
 can see a way to reduce the overhead without giving up concurrent
 REINDEX.

In a high txn rate workload the majority of lock requests will
definitely be towards indexes. This is a major overhead and contention
point and as you say, exists only to allow REINDEX (and online CREATE
TABLE when it exists) to operate.

There is a considerable price to be paid for this flexibility, so
although I think that flexibility is important, I do think we need to
look at ways of reducing the cost for the 95%+ of the time we pay the
price for no benefit. It seems clear that if we do not, then we are
going to have to completely redesign the lock manager, so any way
forward from here is a reasonable size task.

Putting extra restrictions on REINDEX-like operations would be
acceptable in these circumstances. 

Now follows various ideas, many of which are quite possibly bogus, but
the main point is that we need a way, even if it isn't one of these:

One such idea might be to require that they occur outside of a
transaction block, just like VACUUM.

Further thoughts:
1. Normally, we do not lock indexes via the LockMgrLock

2. When a REINDEX-like operation comes along, it first of all updates an
MaintIntentLock flag on the index relation, which causes a relcache
invalidation. It then waits until all backends have updated their
relcache.

3. While the MaintIntentLock is set, all operations acquire and release
LockMgrLocks on the index.

4. The REINDEX-like operation proceeds, having accepted a possibly
extensive delay in acquiring the MaintIntentLock, in the name of
concurrent access.

5. When the REINDEX completes, we unset the MaintIntentLock and other
backends return to normal operation.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-08 Thread Simon Riggs
On Wed, 2005-12-07 at 22:46 -0500, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Is hashtable overhead all that large?  Each table could be made
  initially size-of-current-table/N entries.  One problem is that
  currently the memory freed from a hashtable is not put back into shmem
  freespace, is it?
 
 Yeah; the problem is mainly that we'd have to allocate extra space to
 allow for unevenness of usage across the multiple hashtables.  It's hard
 to judge how large the effect would be without testing, but I think that
 this problem would inhibit us from having dozens or hundreds of separate
 partitions.

The imbalance across partitions would be a major issue because of the
difficulty of selecting a well-distributed partitioning key. If you use
the LOCKTAG, then operations on the heaviest hit tables would go to the
same partitions continually for LockRelation requests. The frequency of
access per table usually drops off dramatically in rank order: look at
TPC-B (pgbench) and TPC-C; my own experience would be that you seldom
have as many even as 16 heavy hit tables. My guess would be that doing
all of that would do little more than reduce contention to ~50%, and
that this would show quickly diminishing returns for N  4. Also, the
more sharply defined your application profile, the worse this effect
will be.

Having said that, I think this *is* the best way forward *if* we
continue to accept the barrage of lock requests. So I've reopened the
debate on the earlier thread: 
[HACKERS] Reducing relation locking overhead
and am reviewing thoughts/requirements on that thread to avoid the
necessity of altering the lock manager in this way.

pgbench is the right workload to expose this effect and measure worst
case contention, so at least performance gains are easy to measure.

 A possible response is to try to improve dynahash.c to make its memory
 management more flexible, but I'd prefer not to get into that unless
 it becomes really necessary.  A shared freespace pool would create a
 contention bottleneck of its own...

...but a less frequently accessed one.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-08 Thread Simon Riggs
On Wed, 2005-12-07 at 22:53 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  My view would be that the LockMgrLock is not relevant for all workloads,
  but I want even more to be able to discuss whether it is, or is not, on
  an accepted basis before discussions begin.
 
 Certainly.  I showed the evidence ...

The output you gave wasn't anything I recognize in the code. Assuming
its not already there, please can you share code you are using to find
the evidence, even if its just privately in some form?

You're looking at the number of spins to acquire each lock? Or some
other measure of wait time on a lock?

I want to be in a position to run tests and then share the output with
the project in an agreed form, then quickly move to action. You're right
to put the burden of proof onto test results; I want to agree the
measurements before we test.

Manfred's earlier patch provides very clear output for observing
contention, including full summaries. Could we commit that, so we can
all use this for analysis? Updated with the wait info.

Best Regards, Simon Riggs





---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Reducing relation locking overhead

2005-12-08 Thread Hannu Krosing
Ühel kenal päeval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby:
 On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote:
  ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
   On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
Tom Lane [EMAIL PROTECTED] writes:
 What's worse, once you have excluded writes you have to rescan the 
 entire
 table to be sure you haven't missed anything. So in the scenarios 
 where this
 whole thing is actually interesting, ie enormous tables, you're still
 talking about a fairly long interval with writes locked out. Maybe 
 not as
 long as a complete REINDEX, but long.

I was thinking you would set a flag to disable use of the FSM for
inserts/updates while the reindex was running. So you would know where 
to find
the new tuples, at the end of the table after the last tuple you read.
   
   What about keeping a seperate list of new tuples? Obviously we'd only do
   this when an index was being built on a table. 
  
  The problem with separate list is that it can be huge. For example on a
  table with 200 inserts/updates per second an index build lasting 6 hours
  would accumulate total on 6*3600*200 = 432 new tuples.
 
 Sure, but it's unlikely that such a table would be very wide, so 4.3M
 tuples would probably only amount to a few hundred MB of data. It's also
 possible that this list could be vacuumed by whatever the regular vacuum
 process is for the table.

I think that keeping such list as part the table at well defined
location (like pages from N to M) is the best strategy, as it will
automatically make all new tuples available to parallel processes and
avoids both duplicate storage as well as the the need for changing
insert/update code.

---
Hannu


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: Concurrent CREATE INDEX, try 2 (was Re: [HACKERS] Reducing

2005-12-08 Thread Csaba Nagy
On Wed, 2005-12-07 at 19:36, Greg Stark wrote:
[snip]
 We periodically ran into problems with load spikes or other performance
 problems causing things to get very slow and stay slow for a while. Letting
 things settle out usually worked but occasionally we had to restart the whole
 system to clear out the queue of requests.

Just as a personal opinion: I would love a REINDEX which does not block
reads/writes, even if writes will be more expensive while it's running.
There's always a period of time I can schedule the REINDEX so there's
very low write activity, but it is impossible to find a time slot when
there's NO write activity... and I think most of the real world
applications are like this. I think it's very rare that an application
is constantly getting high load, but most of them are constantly getting
SOME important activity which makes downtime hard to schedule. Now if
the slowdown of writes is not more than the acceptable service level,
then it is a very workable solution to schedule the REINDEX on a not so
busy time slot.

Cheers,
Csaba.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] Inherited Constraints

2005-12-08 Thread Simon Riggs
On Wed, 2005-12-07 at 21:24 +, Simon Riggs wrote:
 Following patch implements record of whether a constraint is inherited
 or not, and prevents dropping of inherited constraints.

Patch posted to -patches list.

 What it doesn't do:
 It doesn't yet prevent dropping the parent constraint, which is wrong,
 clearly, but what to do about it?
 1. make dropping a constraint drop all constraints dependent upon it
 (without any explicit cascade)
 2. add a new clause to ALTER TABLE  DROP CONSTRAINT  CASCADE 
 
 I prefer (1), since it is SQL Standard compliant, easier to remember and
 automatic de-inheritance is the natural opposite of the automatic
 inheritance process.

Comments, please -hackers?

Which implementation should I pick (or another)?

 Further patch will utilise this new knowledge to reduce the number of
 tests made during constraint_exclusion.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-08 Thread pmagnoli
I can only add that patched code did not build on windows, contacted author
about that and never got an answer back.
Regards

paolo

Tom Lane [EMAIL PROTECTED] ha scritto

 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  Anyone remember this patch?
  http://gorda.di.uminho.pt/community/pgsqlhooks/
  The discussion seems to be pretty minimal:
  http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php
  Does anyone see a need to investigate it further?
 
 I had hoped to see some comments from the Slony people about it.
 I'd feel better about the validity of a set of hooks if more than
 one project agreed that it was useful/appropriate ...
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
 




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] forced to restart postgresql service yesterday

2005-12-08 Thread Merlin Moncure
Yesterday one of our clients called up and complained about lousy
performance and being unable to log in to our postgresql 8.0 backed ERP
running on windows 2000 server.  The server has been run for several
months without being restarted or rebooted.

The login was hanging in a simple plpgsql login script which basically
did an insert/update on a small table.  It would hang when called from
within psql, and once hung the query would not respond to cancel
requests (such as they are implemented on win32).  Investigating
further, trying to select form this small table at all would also hang.
Unfortunately, this table tells me which pids are safe to kill and which
are not, so I had no choice to do emergency restart of postgresql
service which went without complaint and everything worked normally,
with nothing extraordinary in the server log.

Not ruling out an obscure win32 problem here but this is fairly
untraceable. This is just a FYI type of post.  This particular client
with about 50 users has been running over a year on win32/pg and this is
the first time I had to restart the service :(.  I am really pushing a
move to linux although there is no reason to believe this will prevent
this from happening again.

Merlin




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-08 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 The output you gave wasn't anything I recognize in the code. Assuming
 its not already there, please can you share code you are using to find
 the evidence, even if its just privately in some form?

See below.  Also, the message I previously mentioned shows a different
tack on the same theme:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php
although in the light of later events I think that keeping the counts
in shared memory like that is a bad idea --- too likely to skew the
results.

 You're looking at the number of spins to acquire each lock?

Number of semop waits.

 Manfred's earlier patch provides very clear output for observing
 contention, including full summaries. Could we commit that, so we can
 all use this for analysis? Updated with the wait info.

What patch would that be?

regards, tom lane

*** src/backend/storage/ipc/ipc.c.orig  Tue Nov 22 16:06:33 2005
--- src/backend/storage/ipc/ipc.c   Tue Nov 29 12:27:13 2005
***
*** 125,130 
--- 125,132 
  {
elog(DEBUG3, shmem_exit(%d), code);
  
+   LWLockStats();
+ 
/*
 * call all the registered callbacks.
 *
*** src/backend/storage/lmgr/lwlock.c.orig  Tue Dec  6 18:08:33 2005
--- src/backend/storage/lmgr/lwlock.c   Tue Dec  6 18:16:05 2005
***
*** 21,26 
--- 21,28 
   */
  #include postgres.h
  
+ #include unistd.h
+ 
  #include access/clog.h
  #include access/multixact.h
  #include access/subtrans.h
***
*** 32,37 
--- 34,43 
  /* We use the ShmemLock spinlock to protect LWLockAssign */
  extern slock_t *ShmemLock;
  
+ static int num_counts;
+ static int *sh_acquire_counts;
+ static int *ex_acquire_counts;
+ static int *block_counts;
  
  typedef struct LWLock
  {
***
*** 209,214 
--- 215,226 
LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
LWLockCounter[0] = (int) NumFixedLWLocks;
LWLockCounter[1] = numLocks;
+ 
+   /* local counter space */
+   num_counts = numLocks;
+   sh_acquire_counts = calloc(numLocks, sizeof(int));
+   ex_acquire_counts = calloc(numLocks, sizeof(int));
+   block_counts = calloc(numLocks, sizeof(int));
  }
  
  
***
*** 257,262 
--- 269,278 
int extraWaits = 0;
  
PRINT_LWDEBUG(LWLockAcquire, lockid, lock);
+   if (mode == LW_EXCLUSIVE)
+   ex_acquire_counts[lockid]++;
+   else
+   sh_acquire_counts[lockid]++;
  
/*
 * We can't wait if we haven't got a PGPROC.  This should only occur
***
*** 328,333 
--- 344,351 
if (!mustwait)
break;  /* got the lock */
  
+   block_counts[lockid]++;
+ 
/*
 * Add myself to wait queue.
 *
***
*** 598,601 
--- 616,640 
return true;
}
return false;
+ }
+ 
+ void
+ LWLockStats(void)
+ {
+   int pid = getpid();
+   int i;
+ 
+   LWLockAcquire(0, LW_EXCLUSIVE);
+ 
+   for (i = 0; i  num_counts; i++)
+   {
+   if (sh_acquire_counts[i] || ex_acquire_counts[i] || 
block_counts[i])
+   {
+   fprintf(stderr, PID %d lwlock %d: shacq %u exacq %u 
blk %u\n,
+   pid, i, sh_acquire_counts[i], 
ex_acquire_counts[i],
+   block_counts[i]);
+   }
+   }
+ 
+   LWLockRelease(0);
  }

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] forced to restart postgresql service yesterday

2005-12-08 Thread Tom Lane
Merlin Moncure [EMAIL PROTECTED] writes:
 The login was hanging in a simple plpgsql login script which basically
 did an insert/update on a small table.  It would hang when called from
 within psql, and once hung the query would not respond to cancel
 requests (such as they are implemented on win32).  Investigating
 further, trying to select form this small table at all would also hang.
 Unfortunately, this table tells me which pids are safe to kill and which
 are not,

Did you look at pg_locks or pg_stat_activity?

There is pretty much nothing we can do with this report given the lack
of detail.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Vertical Partitioning with TOAST

2005-12-08 Thread Bruce Momjian
Jim C. Nasby wrote:
 On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote:
  Jim C. Nasby [EMAIL PROTECTED] writes:
   This seems like a useful feature to add, allowing for easy built-in
   verticle partitioning. Are there issues with the patch as-is?
  
  Other than the ones mentioned by the poster?
  
  It seemed to me more like a not-too-successful experiment than something
  ready for application.  If you take the viewpoint that this is just
  another TOAST storage strategy, I think it's pretty useless.  A large
  field value is going to get toasted anyway with the regular strategy,
  and if your column happens to contain some values that are not large,
  forcing them out-of-line anyway is simply silly.  (You could make a case
  for making the threshold size user-controllable, but I don't see the
  case for setting the threshold to zero, which is what this amounts to.)
 
 Valid point. I do think there's a lot of benefit to being able to set
 the limit much lower than what it currently defaults to today. We have a
 client that has a queue-type table that is updated very frequently. One
 of the fields is text, that is not updated as frequently. Keeping this
 table vacuumed well enough has proven to be problematic, because any
 delay to vacuuming quickly results in a very large amount of bloat.
 Moving that text field into a seperate table would most likely be a win.
 
 Presumably this would need to be settable on at least a per-table basis.
 
 Would adding such a variable be a good beginner TODO, or is it too
 invasive?

Well, we have now:

   ALTER TABLE ALTER [ COLUMN ] column 
SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }

What else is needed?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing relation locking overhead

2005-12-08 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Further thoughts:
 1. Normally, we do not lock indexes via the LockMgrLock

 2. When a REINDEX-like operation comes along, it first of all updates an
 MaintIntentLock flag on the index relation, which causes a relcache
 invalidation. It then waits until all backends have updated their
 relcache.

There isn't any way for it to do that (ie, be sure everyone else has
adjusted to the new state of affairs), short of acquiring some sort of
short-term exclusive lock on the table, which is a really bad idea.
The pending lock would block other incoming requests on the table until
all the current users exited their transactions.

I think the idea of not doing locking on indexes was pretty much shot
down in this thread, and we have to go looking for other solutions ...
thus my other proposal.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-08 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 The imbalance across partitions would be a major issue because of the
 difficulty of selecting a well-distributed partitioning key. If you use
 the LOCKTAG, then operations on the heaviest hit tables would go to the
 same partitions continually for LockRelation requests. The frequency of
 access per table usually drops off dramatically in rank order: look at
 TPC-B (pgbench) and TPC-C; my own experience would be that you seldom
 have as many even as 16 heavy hit tables. My guess would be that doing
 all of that would do little more than reduce contention to ~50%, and
 that this would show quickly diminishing returns for N  4. Also, the
 more sharply defined your application profile, the worse this effect
 will be.

That's a fair point, and reinforces my instinct that having a large
number of partitions would be a losing game.  But you are mistaken to
think that the number of hot-spot tables is the only limit on the number
of usable partitions.  It's the number of their indexes that matters most.
(The pgbench test is if anything probably understating the problem,
because it has only a single index on each table.)  In any case, even a
factor of 2 or so reduction in direct conflicts should have a useful
impact on the number of semop waits, because it's a nonlinear effect...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Reducing relation locking overhead

2005-12-08 Thread Csaba Nagy
On Thu, 2005-12-08 at 16:05, Tom Lane wrote:
[SNIP]
 There isn't any way for it to do that (ie, be sure everyone else has
 adjusted to the new state of affairs), short of acquiring some sort of
 short-term exclusive lock on the table, which is a really bad idea.
 The pending lock would block other incoming requests on the table until
 all the current users exited their transactions.
 

But it is an acceptable compromise to lock the table until all current
transactions are over... the alternative for reindexing a big table is
usually to schedule a down-time, which is even worse...

REINDEX is usually used to fix a big tables big index bloat, and that
won't fly without a downtime, or, with this short-term full table lock
in a low-traffic time-slot. 

For my usage patterns I would vote with the table lock if it is just a
means of blocking new transactions until the running ones finish. I'll
just make sure there are none long running when I issue the REINDEX...

Cheers,
Csaba.





---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-08 Thread Darcy Buskermolen
On Wednesday 07 December 2005 20:24, Tom Lane wrote:
 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  Anyone remember this patch?
  http://gorda.di.uminho.pt/community/pgsqlhooks/
  The discussion seems to be pretty minimal:
  http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php
  Does anyone see a need to investigate it further?

 I had hoped to see some comments from the Slony people about it.
 I'd feel better about the validity of a set of hooks if more than
 one project agreed that it was useful/appropriate ...

I missed seeing it all together the first time through, I'll see what I can do 
about taking a indepth look at it over the next few days and provide some 
feedback.


   regards, tom lane

 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

-- 
Darcy Buskermolen
Wavefire Technologies Corp.

http://www.wavefire.com
ph: 250.717.0200
fx: 250.763.1759

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] About my new work at Command Prompt Inc.

2005-12-08 Thread Jan Wieck

On 12/7/2005 9:37 AM, Devrim GUNDUZ wrote:


Hi,

I'd like to inform the people who does not read Planet PostgreSQL

Command Prompt Inc.has just hired me for my community work I have been
doing so far, like PostgreSQL RPM stuff and other PostgreSQL related
RPMs, such as Slony-I, pgpool, PostGIS, etc) and website things. That
means I'll spend more time on these. 


Congratulations Devrim,

and a big thanks for the work you're doing.


Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Foreign key trigger timing bug?

2005-12-08 Thread Jan Wieck

On 12/7/2005 4:50 PM, Stephan Szabo wrote:


On Wed, 7 Dec 2005, Bruce Momjian wrote:


I had an open 8.1 item that was:

o  fix foreign trigger timing issue

Would someone supply text for a TODO entry on this, as I don't think we
fixed it in 8.1.


I'd split this into two separate items now.

 Fix before delete triggers on cascaded deletes to run after the cascaded
delete is done.  This is odd, but seems to be what the spec requires.


Ugh, that sounds ugly. One problem I see is, what do we do if the BEFORE 
trigger then returns NULL (to skip the delete). The cascaded operations 
are already done. Do we have to execute the cascaded deletes in a 
subtransaction or do we disallow the skip in this case?



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] forced to restart postgresql service yesterday

2005-12-08 Thread Merlin Moncure
 Subject: Re: [HACKERS] forced to restart postgresql service yesterday
 
 Merlin Moncure [EMAIL PROTECTED] writes:
  The login was hanging in a simple plpgsql login script which
basically
  did an insert/update on a small table.  It would hang when called
from
  within psql, and once hung the query would not respond to cancel
  requests (such as they are implemented on win32).  Investigating
  further, trying to select form this small table at all would also
hang.
  Unfortunately, this table tells me which pids are safe to kill and
which
  are not,
 
 Did you look at pg_locks or pg_stat_activity?
 
 There is pretty much nothing we can do with this report given the lack
 of detail.
 
understood, I was in a big hurry to get the server back up.

pg_stat_activity worked ok...there were a lot of hung processes and it's
possible pg was over connection limit although pretty much everything
logs in as super user.  There was also a ton of stuff in pg_locks but it
was hard to determine anything useful because my app makes a lot of use
of userlocks and I don't get the benefit of the revamped 8.1 pg_locks
view.  

In any case, while waiting on a lock win32 pg will respond to query
cancel, and the server wouldn't while trying to do anything with this
particular table.  It was like a process sucking black hole.  Also, cpu
load was 0 as was disk.

Merlin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] Inherited Constraints

2005-12-08 Thread Hannu Krosing
Ühel kenal päeval, N, 2005-12-08 kell 11:10, kirjutas Simon Riggs:
 On Wed, 2005-12-07 at 21:24 +, Simon Riggs wrote:
  Following patch implements record of whether a constraint is inherited
  or not, and prevents dropping of inherited constraints.
 
 Patch posted to -patches list.
 
  What it doesn't do:
  It doesn't yet prevent dropping the parent constraint, which is wrong,
  clearly, but what to do about it?
  1. make dropping a constraint drop all constraints dependent upon it
  (without any explicit cascade)
  2. add a new clause to ALTER TABLE  DROP CONSTRAINT  CASCADE 
  
  I prefer (1), since it is SQL Standard compliant, easier to remember and
  automatic de-inheritance is the natural opposite of the automatic
  inheritance process.
 
 Comments, please -hackers?

It would be logical to do the same as DROP TABLE does, i.e (2).

hannu=# create table parent(i int);
CREATE TABLE
hannu=# create table child() inherits(parent);
CREATE TABLE
hannu=# drop table parent;
NOTICE:  table child depends on table parent
ERROR:  cannot drop table parent because other objects depend on it
HINT:  Use DROP ... CASCADE to drop the dependent objects too.
hannu=#

Maybe there should be another option in addition to CASCADE, say
DISINHERIT, which leaves all child constraints as heads of new
ingeritance hierarchies. DROP CASCADE + ADD BACK ALL CHILD CONSTRAINTS
may be prohibitively expensive for biggish tables.

Another nice (but no doubt more complex) thing would be ability to add
multiple constraints at once, needing only one seqscan to check for
compliance with added constraints and/or making constraint checks
smarter, so that for.ex. ADD CONSTRAINT CHECK i  0 could make use of
index on i instead of doing a seqscan. Or if there is a constraint
CHECK i  0 then adding another like CHECK i  -1 would not need to
check actual data either.

 Which implementation should I pick (or another)?
 
  Further patch will utilise this new knowledge to reduce the number of
  tests made during constraint_exclusion.

So will hierarchical inheritance be the thing to do to take advantage of
i then ?

year
 +- month1
 |+-day1
 |+-day2
.
 |\-day31
 +- month2

etc. 

btw, will your DROP patch support multiple inheritance ?

---
Hannu



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-08 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 That's a fair point, and reinforces my instinct that having a large
 number of partitions would be a losing game.  But you are mistaken to
 think that the number of hot-spot tables is the only limit on the number
 of usable partitions.  It's the number of their indexes that matters most.

Hm, so hypothetically an insert or update on a table with 4 indexes which have
been split into 4 partitions would need to touch each partition?

Would that defeat the benefits of the partitioning? Or enhance it?

Would it be better to ensure that the indexes of a single table ended up in
the same partition? Or to ensure they're spread across partitions?

-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] generalizing the planner knobs

2005-12-08 Thread Jaime Casanova

 BTW, there's another end to the 'enable_seqscan=false' problem... it
 sometimes doesn't work! Last I looked, enable_seqscan=false only added a
 fixed overhead cost to a seqscan (100 IIRC). The problem is, some
 queries will produce estimates for other methodes that are more
 expensive than a seqscan even with the added burden. If instead of
 adding a fixed amount enable_seqscan=false multiplied by some amount
 then this would probably be impossible to occur.

 (And before someone asks, no, I don't remember which query was actually
 faster...)
 --
 Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
 Pervasive Software  http://pervasive.comwork: 512-231-6117
 vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461


I have often considered that this is an indication that seq scan is
actually the better plan... although, i have to admit that is a little
confusing that  enable_seqscan = false actually let you use a seqscan
if the other plans are bad enough

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-08 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Hm, so hypothetically an insert or update on a table with 4 indexes which have
 been split into 4 partitions would need to touch each partition?

That would be the best case, actually, that each heavily-used lock ends
up in a different partition.  As Simon points out, we have no way to
guarantee that.

 Would that defeat the benefits of the partitioning? Or enhance it?

It'd be what you'd want, because it would reduce the odds that two
processes doing this concurrently would need to touch the same partition
at the same time.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] Inherited Constraints

2005-12-08 Thread Rod Taylor
 Another nice (but no doubt more complex) thing would be ability to add
 multiple constraints at once, needing only one seqscan to check for
 compliance with added constraints and/or making constraint checks
 smarter, so that for.ex. ADD CONSTRAINT CHECK i  0 could make use of
 index on i instead of doing a seqscan. Or if there is a constraint
 CHECK i  0 then adding another like CHECK i  -1 would not need to
 check actual data either.

Check out the comma in alter table.

ALTER TABLE tab ADD COLUMN serial NOT NULL UNIQUE,
ADD CHECK (foo  24),
ADD CHECK (baz  18),
ADD COLUMN integer NOT NULL DEFAULT 32;

Table tab (and each of the tables that inherits from it) is scanned and
rewritten once.

I believe this was added for 8.0.

-- 


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] bgwriter leaks resources after errors

2005-12-08 Thread Tom Lane
I think that bug #2099
http://archives.postgresql.org/pgsql-bugs/2005-12/msg00075.php
is probably explained by the observation that the bgwriter fails to call
AtEOXact_Files() after an error; therefore, if it gets an error while
holding an open DIR pointer, the DIR will never get closed.  (There are
calls to AllocateDir in the checkpoint code path, eg SimpleLruTruncate.)

But while looking at this I realized we have worse problems: there's
no mechanism for releasing buffer pins after an error in the bgwriter.
So, for example, a write error on a dirty buffer will cause bgwriter
to permanently leak a buffer pin.  Even if the write problem is later
rectified, the buffer will remain pinned indefinitely.  Quite aside
from the consequences of resource-leakage, this could make it impossible
to complete subsequent VACUUMs on the affected relation.  That raises it
to a must fix in my mind.  The bug exists in 8.0 and later.

The cleanest solution seems to be to make the bgwriter create and use a
ResourceOwner, and probably get rid of the special hack in resowner.c
that allows CurrentResourceOwner to be NULL.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly



Re: [HACKERS] generalizing the planner knobs

2005-12-08 Thread Tom Lane
Jaime Casanova [EMAIL PROTECTED] writes:
 BTW, there's another end to the 'enable_seqscan=false' problem... it
 sometimes doesn't work!

 I have often considered that this is an indication that seq scan is
 actually the better plan...

There are cases where it is the *only* plan, eg, you have no relevant
indexes.  I am not sure that applies to Jim's complaint though.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-08 Thread Simon Riggs
On Thu, 2005-12-08 at 10:26 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  The imbalance across partitions would be a major issue because of the
  difficulty of selecting a well-distributed partitioning key. If you use
  the LOCKTAG, then operations on the heaviest hit tables would go to the
  same partitions continually for LockRelation requests. The frequency of
  access per table usually drops off dramatically in rank order: look at
  TPC-B (pgbench) and TPC-C; my own experience would be that you seldom
  have as many even as 16 heavy hit tables. My guess would be that doing
  all of that would do little more than reduce contention to ~50%, and
  that this would show quickly diminishing returns for N  4. Also, the
  more sharply defined your application profile, the worse this effect
  will be.
 
 That's a fair point, and reinforces my instinct that having a large
 number of partitions would be a losing game.  But you are mistaken to
 think that the number of hot-spot tables is the only limit on the number
 of usable partitions.  It's the number of their indexes that matters most.
 (The pgbench test is if anything probably understating the problem,
 because it has only a single index on each table.)  

True. So either 16 partitions, or maybe 8, does sound about right then.

 In any case, even a
 factor of 2 or so reduction in direct conflicts should have a useful
 impact on the number of semop waits, because it's a nonlinear effect...

Nonlinear effects work both ways. Factor of 2 is great, but not enough
to prevent this discussion becoming an annual event ;-)
(Thinks: Oh, it already is...)

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing relation locking overhead

2005-12-08 Thread Simon Riggs
On Thu, 2005-12-08 at 16:23 +0100, Csaba Nagy wrote:
 On Thu, 2005-12-08 at 16:05, Tom Lane wrote:
 [SNIP]
  There isn't any way for it to do that (ie, be sure everyone else has
  adjusted to the new state of affairs), short of acquiring some sort of
  short-term exclusive lock on the table, which is a really bad idea.
  The pending lock would block other incoming requests on the table until
  all the current users exited their transactions.

It might be possible to do this with some new form of utility lock that
can be unlocked before end of transaction, then re-locked again later.
That would imply the use of optimistic locking, as already discussed.
Only DDL and VACUUMs *need* to be locked out during the REINDEX.

I'm hand-waving here, so I'll stop. But we do know there *is* a way,
because this is already implemented elsewhere, somehow.

 But it is an acceptable compromise to lock the table until all current
 transactions are over... the alternative for reindexing a big table is
 usually to schedule a down-time, which is even worse...
 
 REINDEX is usually used to fix a big tables big index bloat, and that
 won't fly without a downtime, or, with this short-term full table lock
 in a low-traffic time-slot. 
 
 For my usage patterns I would vote with the table lock if it is just a
 means of blocking new transactions until the running ones finish. I'll
 just make sure there are none long running when I issue the REINDEX...

Certainly that is my view. You only schedule these things when you have
to, picking as light traffic period as you can. Most people would accept
some pretty hefty restrictions in order to get it to work.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Replication on the backend

2005-12-08 Thread Gustavo Tonini
Are you sure that no way to implement a generic aproach on the backend?
What does specification say? Does Oracle 10g have a core implementation
of replication (cluster)?

Gustavo.
2005/12/7, Andrew Sullivan [EMAIL PROTECTED]:
On Tue, Dec 06, 2005 at 12:35:43AM -0500, Jan Wieck wrote: We do not plan to implement replication inside the backend. Replication needs are so diverse that pluggable replication support makes a lot more
 sense. To me it even makes more sense than keeping transaction support outside of the database itself and add it via pluggable storage add-on.And, as I say every single time this comes up, Oracle's and IBM's and
MS's and everybody else's replication systems are _also_ add ons.Ifyou don't believe me, look at the license costs.You can get asystem without it enabled, which means (by definition) it's a modularextension.
A--Andrew Sullivan| [EMAIL PROTECTED]In the future this spectacle of the middle classes shocking the avant-garde will probably become the textbook definition of Postmodernism.
--Brad Holland---(end of broadcast)---TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not
 match


[HACKERS] Improving free space usage (was: Reducing relation locking overhead)

2005-12-08 Thread Jim C. Nasby
On Thu, Dec 08, 2005 at 11:58:50AM +0200, Hannu Krosing wrote:
 ??hel kenal p??eval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby:
  On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote:
   ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
 Tom Lane [EMAIL PROTECTED] writes:
  What's worse, once you have excluded writes you have to rescan the 
  entire
  table to be sure you haven't missed anything. So in the scenarios 
  where this
  whole thing is actually interesting, ie enormous tables, you're 
  still
  talking about a fairly long interval with writes locked out. Maybe 
  not as
  long as a complete REINDEX, but long.
 
 I was thinking you would set a flag to disable use of the FSM for
 inserts/updates while the reindex was running. So you would know 
 where to find
 the new tuples, at the end of the table after the last tuple you read.

What about keeping a separate list of new tuples? Obviously we'd only do
this when an index was being built on a table. 
   
   The problem with separate list is that it can be huge. For example on a
   table with 200 inserts/updates per second an index build lasting 6 hours
   would accumulate total on 6*3600*200 = 432 new tuples.
  
  Sure, but it's unlikely that such a table would be very wide, so 4.3M
  tuples would probably only amount to a few hundred MB of data. It's also
  possible that this list could be vacuumed by whatever the regular vacuum
  process is for the table.
 
 I think that keeping such list as part the table at well defined
 location (like pages from N to M) is the best strategy, as it will
 automatically make all new tuples available to parallel processes and
 avoids both duplicate storage as well as the the need for changing
 insert/update code.

There's one thing I hate about that idea though: good luck trying to
move those tuples somewhere else after the index build is done and you
now want to shrink the table back down to a more normal size. If we had
a better way to do that it would be much more palatable, but right now
on a heavily updated table this would result in a lot of bloat.

Along those lines, I've wondered if it makes sense to add more
flexibility in how free space is reclaimed in a table. One obvious
possibility is to have a strategy where new tuples will always look to
the FSM for space (instead of going into the current page if possible),
and the FSM will always hand out the earliest page in the table it has.
This mode would have the effect of moving tuples towards the front of
the table, allowing for space reclamation. A variation might be that
this mode will not effect tuples that are generated as part of an UPDATE
and are in the first x% of the table, since it doesn't make sense to
move a tuple from page 2 to page 1 in a 1000 page table.

Another possibility is to always go to the FSM and to have the FSM hand
back the page that is closest to the new tuple according to a certain
index. This would allow for ALTER TABLE CLUSTER to be much more
self-maintaining. The downside is that you'd have to do a lookup on that
index, but presumably if the index is important enough to cluster on
then it should be well-cached. There's probably some other tweaks that
could be done as well to make this more performant.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Vertical Partitioning with TOAST

2005-12-08 Thread Jan Wieck

On 12/8/2005 1:42 PM, Jim C. Nasby wrote:


On Thu, Dec 08, 2005 at 10:03:43AM -0500, Bruce Momjian wrote:

Jim C. Nasby wrote:
 On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote:
  Jim C. Nasby [EMAIL PROTECTED] writes:
   This seems like a useful feature to add, allowing for easy built-in
   verticle partitioning. Are there issues with the patch as-is?
  
  Other than the ones mentioned by the poster?
  
  It seemed to me more like a not-too-successful experiment than something

  ready for application.  If you take the viewpoint that this is just
  another TOAST storage strategy, I think it's pretty useless.  A large
  field value is going to get toasted anyway with the regular strategy,
  and if your column happens to contain some values that are not large,
  forcing them out-of-line anyway is simply silly.  (You could make a case
  for making the threshold size user-controllable, but I don't see the
  case for setting the threshold to zero, which is what this amounts to.)
 
 Valid point. I do think there's a lot of benefit to being able to set

 the limit much lower than what it currently defaults to today. We have a
 client that has a queue-type table that is updated very frequently. One
 of the fields is text, that is not updated as frequently. Keeping this
 table vacuumed well enough has proven to be problematic, because any
 delay to vacuuming quickly results in a very large amount of bloat.
 Moving that text field into a seperate table would most likely be a win.
 
 Presumably this would need to be settable on at least a per-table basis.
 
 Would adding such a variable be a good beginner TODO, or is it too

 invasive?

Well, we have now:

   ALTER TABLE ALTER [ COLUMN ] column 
		SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }


What else is needed?


As Tom suggested, I think it would be best to be able to change the size
at which a field gets stored externally. I think it also makes sense to
have this reverse the normal order of compress first, then if it still
doesn't fit store it externally. I forsee this typically being useful
when you have fields that are between ~100 and 1000 bytes in size, and
I'm doubtful that compression would do much good there. But I wouldn't
rule out this being useful on fields that can also sometimes contain
much larger amounts of data, so I don't think it makes sense to disable
compression completely. So, I think this leaves two new options:


It's not the size of a field that triggers toasting. It is the size of 
the entire tuple. As long as that is  BLKSIZE/4, the toaster will pick 
the currently largest inline value and do something with it. 
something is either compressing or (if not allowed or already done) 
moving external.



Jan



SET STORAGE EXTERNAL [THRESHOLD x]
If a field is over x in size, it's stored externally.

SET STORAGE EXTENDED [THRESHOLD x]
If a field is over x in size, it's stored externally. If it's over
BLCKSZ/4 it will also be compressed (I think that's how things work
now).

Actually, that's rather ugly. I think it would be better to just break
external storage and compression out into their own attributes:

SET STORAGE [ ALLOW EXTERNAL [THRESHOLD x] ] [ ALLOW COMPRESSION [THRESHOLD x] ]

ALLOW EXTERNAL: if a field is larger than x bytes (BLCKSZ/4 by default)
then it will be stored externally. May be specified along with ALLOW
COMPRESSION.

ALLOW COMPRESSION: if a field is larger than x bytes (BLCKSZ/4 by default)
then it will be COMPRESSED. May be specified along with ALLOW EXTERNAL.
--
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461



--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Replication on the backend

2005-12-08 Thread Andrew Dunstan


What is the point of these questions? If you have a concrete, practical 
proposal to make, please do so. Otherwise, you have already got the 
answer from the people who are actually working on replication and 
understand it far beyond abstract considerations. If you think there is 
a good reason to do replication directly in the backend code rather than 
as an addon, possibly using an agreed API, then you need to provide hard 
evidence, not mere assertion or conjecture.


cheers

andrew

Gustavo Tonini wrote:

Are you sure that no way to implement a generic aproach on the 
backend? What does specification say? Does Oracle 10g have a core 
implementation of replication (cluster)?


Gustavo.


2005/12/7, Andrew Sullivan [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED]:


On Tue, Dec 06, 2005 at 12:35:43AM -0500, Jan Wieck wrote:
 We do not plan to implement replication inside the backend.
Replication
 needs are so diverse that pluggable replication support makes a
lot more
 sense. To me it even makes more sense than keeping transaction
support
 outside of the database itself and add it via pluggable storage
add-on.

And, as I say every single time this comes up, Oracle's and IBM's and
MS's and everybody else's replication systems are _also_ add ons.  If
you don't believe me, look at the license costs.  You can get a
system without it enabled, which means (by definition) it's a modular
extension.

A

--
Andrew Sullivan  | [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism.
--Brad Holland

---(end of
broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-08 Thread Jim C. Nasby
On Thu, Dec 08, 2005 at 08:33:59AM -0800, Darcy Buskermolen wrote:
 On Wednesday 07 December 2005 20:24, Tom Lane wrote:
  Christopher Kings-Lynne [EMAIL PROTECTED] writes:
   Anyone remember this patch?
   http://gorda.di.uminho.pt/community/pgsqlhooks/
   The discussion seems to be pretty minimal:
   http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php
   Does anyone see a need to investigate it further?
 
  I had hoped to see some comments from the Slony people about it.
  I'd feel better about the validity of a set of hooks if more than
  one project agreed that it was useful/appropriate ...
 
 I missed seeing it all together the first time through, I'll see what I can 
 do 
 about taking a indepth look at it over the next few days and provide some 
 feedback.

While this code might be useful, whouldn't it be much more valuable to
provide hooks into xlog so that we could do non-trigger-based
replication? (As well as non-trigger-based materialized views...)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] bgwriter leaks resources after errors

2005-12-08 Thread Qingqing Zhou

Tom Lane [EMAIL PROTECTED] wrote

 The cleanest solution seems to be to make the bgwriter create and use a
 ResourceOwner, and probably get rid of the special hack in resowner.c
 that allows CurrentResourceOwner to be NULL.


Is that true that every moment the bgwriter just has one buffer pinned? Then 
remember that buffer is enough. This could avoid using ResourceOwner but it 
may need some hack into pin/unpin.

Regards,
Qingqing 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] generalizing the planner knobs

2005-12-08 Thread Jim C. Nasby
On Thu, Dec 08, 2005 at 01:07:10PM -0500, Tom Lane wrote:
 Jaime Casanova [EMAIL PROTECTED] writes:
  BTW, there's another end to the 'enable_seqscan=false' problem... it
  sometimes doesn't work!
 
  I have often considered that this is an indication that seq scan is
  actually the better plan...
 
 There are cases where it is the *only* plan, eg, you have no relevant
 indexes.  I am not sure that applies to Jim's complaint though.

IIRC I ran into this when I was working on generating some numbers about
how well a high correlation improves the performance of an index scan
(since afaict the cost estimator for index scan is rather broken :( ) In
that case, I had defined an index on a ~120M row table on a collumn with
a very low correlation. It's pretty much a given that a seqscan and sort
would be faster than the index scan, but it would have still been good
to be able to verify that. Because of how enable_seqscan works, I
couldn't.

BTW,
http://archives.postgresql.org/pgsql-performance/2005-04/msg00669.php is
where I first mentioned this, including the cost function that I think
is broken.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Replication on the backend

2005-12-08 Thread Jan Wieck

On 12/8/2005 1:28 PM, Gustavo Tonini wrote:


Are you sure that no way to implement a generic aproach on the backend? What


You mean generic as in a replication system that can do asynchronous 
master-slave, asynchronous multimaster with conflict resolution based on 
timestamps, system priority or user defined resolution stubs, can do 
synchronous predicate locking but also does support thousands of 
asynchronous, partial replica (salesman on the road), and last but not 
least can run as a synchronous cluster in a LAN segment. All the above 
can of course be mixed together ... like a central cluster of 8 load 
balanced, fault tolerant systems, with async multimaster replica in all 
external branch servers, partial multimaster replica for the road 
warriers and some slave leaf nodes for reporting.


If you can present a prototype for the above, I am sure that we will 
change our opinion and finally settle for one, builtin replication system.



does specification say? Does Oracle 10g have a core implementation of
replication (cluster)?


What specification?


Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-08 Thread Jan Wieck

On 12/7/2005 11:24 PM, Tom Lane wrote:


Christopher Kings-Lynne [EMAIL PROTECTED] writes:

Anyone remember this patch?
http://gorda.di.uminho.pt/community/pgsqlhooks/
The discussion seems to be pretty minimal:
http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php
Does anyone see a need to investigate it further?


I had hoped to see some comments from the Slony people about it.
I'd feel better about the validity of a set of hooks if more than
one project agreed that it was useful/appropriate ...


I am going to look into this in the next couple of days.


Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-08 Thread Jonah H. Harris
On 12/8/05, Jim C. Nasby [EMAIL PROTECTED] wrote:
While this code might be useful, whouldn't it be much more valuable toprovide hooks into xlog so that we could do non-trigger-basedreplication? (As well as non-trigger-based materialized views...)

If we're going to do hooks for replication, I think we should look at
including xlog. Trigger-based replication is an OLTP performance
killer if you're just looking to do async replication.




Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-08 Thread Dann Corbit








A general purpose log miner is also useful
in many other areas besides replication.













From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Jonah H. Harris
Sent: Thursday, December 08, 2005
11:32 AM
To: Jim C. Nasby
Cc: Darcy Buskermolen;
pgsql-hackers@postgresql.org; Tom
  Lane; Christopher Kings-Lynne
Subject: Re: [HACKERS] HOOKS for
Synchronous Replication?





On 12/8/05, Jim C. Nasby
[EMAIL PROTECTED] wrote:





While this code might be useful, whouldn't it be much more valuable to
provide hooks into xlog so that we could do non-trigger-based
replication? (As well as non-trigger-based materialized views...)






If we're going to do hooks for replication, I think we should look at including
xlog. Trigger-based replication is an OLTP performance killer if you're
just looking to do async replication.












Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-08 Thread Jan Wieck

On 12/8/2005 2:05 PM, Jim C. Nasby wrote:


On Thu, Dec 08, 2005 at 08:33:59AM -0800, Darcy Buskermolen wrote:

On Wednesday 07 December 2005 20:24, Tom Lane wrote:
 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  Anyone remember this patch?
  http://gorda.di.uminho.pt/community/pgsqlhooks/
  The discussion seems to be pretty minimal:
  http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php
  Does anyone see a need to investigate it further?

 I had hoped to see some comments from the Slony people about it.
 I'd feel better about the validity of a set of hooks if more than
 one project agreed that it was useful/appropriate ...

I missed seeing it all together the first time through, I'll see what I can do 
about taking a indepth look at it over the next few days and provide some 
feedback.


While this code might be useful, whouldn't it be much more valuable to
provide hooks into xlog so that we could do non-trigger-based
replication? (As well as non-trigger-based materialized views...)


I don't see why these would be mutually exclusive. A generic API needs 
to have them all.


Without having looked at the patch yet, what I expect from an API is 
that the backend will (after initialization and becoming a member of a 
database) check if this database is replicated. If so load the specific 
shared object that implement the backend part of the replication system 
and then call an init() function in that. This init() function then will 
add callbacks to all the hooks where this particular replication system 
wants to be called. So one replication system might want to be called on 
commit, just before writing the WAL record, some other system doesn't 
care about that, but wants to see the WAL record after it was written.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Reducing relation locking overhead

2005-12-08 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 I'm hand-waving here, so I'll stop. But we do know there *is* a way,
 because this is already implemented elsewhere, somehow.

That's not really the point --- the question is whether the cure is
worse than the disease.  It's entirely possible that the tradeoffs
needed to support fully concurrent REINDEX would represent a higher
price than the feature is worth, or that it's simply too much work
to get there from here.

For instance, I would imagine that the way Oracle does this relies
on their use of rollback segments, which is something we're certainly
unlikely to emulate.

Given the discussion so far, it seems likely to me that completely
concurrent REINDEX is indeed out of reach, and that what we ought to
be thinking about is what sort of compromise design (ie, partially
concurrent REINDEX) is reasonable.

Something that might work is:

1. Take ShareUpdateExclusiveLock (this blocks VACUUM and DDL changes),
then run existing CREATE INDEX code.  The finished index may be missing
some tuples inserted during the run.

2. Commit transaction so that index becomes visible (we assume it's
marked so that the planner will know not to rely on it).  Continue to
hold ShareUpdateExclusiveLock so VACUUM doesn't run.

3. Attempt to acquire ShareLock (possibly a ConditionalLockAcquire/sleep
loop instead of just flat-out LockAcquire).  Once we have this we know
there are no active writer transactions.  Release the lock immediately.

4. Make a new scan of the table and insert any rows not already present
in the index.  (This need not process anything inserted later than step
3, because any new transactions will know to insert rows in the index
anyway.)

5. Mark index good and commit, releasing all locks.

I don't think that it's worth the effort and complexity to try to avoid
a full table scan in step 4.  At best you would save much less than 50%
of the total work, and the impact on normal operations is not free.

If what you want is a REINDEX rather than creating an independent new
index, then at step 5 you need to do a swap operation which'll require
obtaining exclusive lock on the index.  This creates another opportunity
for deadlock failures, but again a conditional loop might help.

There are still some issues about the behavior when the index is UNIQUE.
Ideally you would like the CREATE INDEX to fail on a duplicate, not any
concurrent writer transaction, but I don't think it's possible to
guarantee that.

Also, I'm not sure how we get rid of the broken index if there is a
failure later than step 2.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-08 Thread Jonah H. Harris

True, I think we need hooks for both methods.On 12/8/05, Jan Wieck [EMAIL PROTECTED] wrote:
On 12/8/2005 2:05 PM, Jim C. Nasby wrote: On Thu, Dec 08, 2005 at 08:33:59AM -0800, Darcy Buskermolen wrote: On Wednesday 07 December 2005 20:24, Tom Lane wrote:  Christopher Kings-Lynne 
[EMAIL PROTECTED] writes:   Anyone remember this patch?   http://gorda.di.uminho.pt/community/pgsqlhooks/
   The discussion seems to be pretty minimal:   http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php
   Does anyone see a need to investigate it further?   I had hoped to see some comments from the Slony people about it.  I'd feel better about the validity of a set of hooks if more than
  one project agreed that it was useful/appropriate ... I missed seeing it all together the first time through, I'll see what I can do about taking a indepth look at it over the next few days and provide some
 feedback. While this code might be useful, whouldn't it be much more valuable to provide hooks into xlog so that we could do non-trigger-based replication? (As well as non-trigger-based materialized views...)
I don't see why these would be mutually exclusive. A generic API needsto have them all.Without having looked at the patch yet, what I expect from an API isthat the backend will (after initialization and becoming a member of a
database) check if this database is replicated. If so load the specificshared object that implement the backend part of the replication systemand then call an init() function in that. This init() function then will
add callbacks to all the hooks where this particular replication systemwants to be called. So one replication system might want to be called oncommit, just before writing the WAL record, some other system doesn't
care about that, but wants to see the WAL record after it was written.Jan--#==## It's easier to get forgiveness for being wrong than for being right. #
#
Let's break this rule - forgive
me.##== [EMAIL PROTECTED] #---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] generic builtin functions

2005-12-08 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Still thinking a bit more about this ... how about we have output 
 functions take an optional second argument, which is the type oid?

No.  We just undid that for good and sufficient security reasons.
If an output function depends on anything more than the contents of
the object it's handed, it's vulnerable to being lied to.
http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php

I realize that being told the wrong type ID might be less than
catastrophic for enum_out, but I'm going to fiercely resist putting back
any extra arguments for output functions.  The temptation to use them
unsafely is just too strong --- we've learned that the hard way more
than once already, and I don't want to repeat the same mistake yet again.

 Input funcs get a typioparam and typmod argument in addition to the
 data value,

Entirely different situation because the only thing an input function
assumes is that it's been handed some text ... which it can validate.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] generic builtin functions

2005-12-08 Thread Andrew Dunstan



Tom Lane wrote:


Andrew Dunstan [EMAIL PROTECTED] writes:
 

Still thinking a bit more about this ... how about we have output 
functions take an optional second argument, which is the type oid?
   



No.  We just undid that for good and sufficient security reasons.
If an output function depends on anything more than the contents of
the object it's handed, it's vulnerable to being lied to.
http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php

I realize that being told the wrong type ID might be less than
catastrophic for enum_out, but I'm going to fiercely resist putting back
any extra arguments for output functions.  The temptation to use them
unsafely is just too strong --- we've learned that the hard way more
than once already, and I don't want to repeat the same mistake yet again.

 


Input funcs get a typioparam and typmod argument in addition to the
data value,
   



Entirely different situation because the only thing an input function
assumes is that it's been handed some text ... which it can validate.
 



OK, If you resist fiercely I'm sure it won't happen, so I'll put away 
the asbestos underpants ;-)


I see in that discussion you say:


Every rowtype Datum will carry its own concrete type.



Are we storing that on disk in every composite object? If not, where do 
we get it from before calling the output function?


cheers

andrew



regards, tom lane

 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] 8.1.1 stamped

2005-12-08 Thread Bruce Momjian
I have stamped the 8.1.1 CVS tree and it is ready for testing/packaging.
The release notes are here:


http://candle.pha.pa.us/main/writings/pgsql/sgml/release.html#RELEASE-8-1-1

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Upcoming PG re-releases

2005-12-08 Thread Gavin Sherry
On Tue, 6 Dec 2005, Bruce Momjian wrote:


 Exactly what does vim do that iconv does not?  Fuzzy encoding sounds
 scary to me.


Right. It actually makes assumptions about the source encoding. People who
care about their data need, unfortunately, to spend a bit of time on this
problem. I've been discussing the same issue on the slony1 mailing list,
because the issue can affect people's ability upgrade using slony1.

http://gborg.postgresql.org/pipermail/slony1-general/2005-December/003430.html

It would be good if had the script I suggest in the email:

A script which identifies non-utf-8 characters and provides some
context, line numbers, etc, will greatly speed up the process of
remedying the situation.

Thoughts?

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Upcoming PG re-releases

2005-12-08 Thread Bruce Momjian
Gavin Sherry wrote:
 On Tue, 6 Dec 2005, Bruce Momjian wrote:
 
 
  Exactly what does vim do that iconv does not?  Fuzzy encoding sounds
  scary to me.
 
 
 Right. It actually makes assumptions about the source encoding. People who
 care about their data need, unfortunately, to spend a bit of time on this
 problem. I've been discussing the same issue on the slony1 mailing list,
 because the issue can affect people's ability upgrade using slony1.
 
 http://gborg.postgresql.org/pipermail/slony1-general/2005-December/003430.html
 
 It would be good if had the script I suggest in the email:
 
   A script which identifies non-utf-8 characters and provides some
   context, line numbers, etc, will greatly speed up the process of
   remedying the situation.

I think the best we can do is the iconv -c with the diff idea, which
is already in the release notes.  I suppose we could merge the iconv and
diff into a single command, but I don't see a portable way to output the
iconv output to stdout., /dev/stdin not being portable.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] Inherited Constraints

2005-12-08 Thread Simon Riggs
On Thu, 2005-12-08 at 11:10 +, Simon Riggs wrote:
 On Wed, 2005-12-07 at 21:24 +, Simon Riggs wrote:
  Following patch implements record of whether a constraint is inherited
  or not, and prevents dropping of inherited constraints.
 
 Patch posted to -patches list.
 
  What it doesn't do:
  It doesn't yet prevent dropping the parent constraint, which is wrong,
  clearly, but what to do about it?
  1. make dropping a constraint drop all constraints dependent upon it
  (without any explicit cascade)
  2. add a new clause to ALTER TABLE  DROP CONSTRAINT  CASCADE 
  
  I prefer (1), since it is SQL Standard compliant, easier to remember and
  automatic de-inheritance is the natural opposite of the automatic
  inheritance process.
 
 Comments, please -hackers?

Late night hacking again

ALTER TABLE  DROP CONSTRAINT  CASCADE

does of course already exist, so the following should cause dependency
violation ERRORs:
- omitting the CASCADE when attempting to delete parent constraint
- attempting to drop the child constraint

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Upcoming PG re-releases

2005-12-08 Thread Gregory Maxwell
On 12/8/05, Bruce Momjian pgman@candle.pha.pa.us wrote:
A script which identifies non-utf-8 characters and provides some
context, line numbers, etc, will greatly speed up the process of
remedying the situation.

 I think the best we can do is the iconv -c with the diff idea, which
 is already in the release notes.  I suppose we could merge the iconv and
 diff into a single command, but I don't see a portable way to output the
 iconv output to stdout., /dev/stdin not being portable.

No, what is needed for people who care about fixing their data is a
loadable strip_invalid_utf8() that works in older versions.. then just
select * from bar where foo != strip_invalid_utf8(foo);  The function
would be useful in general, for example, if you have an application
which doesn't already have much utf8 logic, you want to use a text
field, and stripping is the behaviour you want. For example, lots of
simple web applications.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Another encoding issue

2005-12-08 Thread Gavin Sherry
Hi all,

Here's another interesting encoding issue. I cannot recall having seen it
on the lists.

---
[EMAIL PROTECTED] build7]$ bin/createdb -E LATIN1 test
CREATE DATABASE
[EMAIL PROTECTED] build7]$ cat break.sh
dat=`echo -en \245\241`

echo create table test (d text);
echo insert into test values('$dat');
[EMAIL PROTECTED] build7]$ sh break.sh | bin/psql test
CREATE TABLE
INSERT 0 1
[EMAIL PROTECTED] build7]$ bin/createdb -T test test2
CREATE DATABASE
[EMAIL PROTECTED] build7]$ bin/createdb -T test -E UTF-8 test2
CREATE DATABASE
[EMAIL PROTECTED] build7]$ bin/pg_dump -C test2  test2.dmp
[EMAIL PROTECTED] build7]$ bin/dropdb test2
DROP DATABASE
[EMAIL PROTECTED] build7]$ bin/psql template1 -f test2.dmp
SET
SET
SET
CREATE DATABASE
ALTER DATABASE
You are now connected to database test2.
[...]
CREATE TABLE
ALTER TABLE
psql:test2.dmp:345: ERROR:  invalid UTF-8 byte sequence detected near byte
0xa5
CONTEXT:  COPY test, line 1, column d:   
[...]
---

Until createdb() is a lot more sophisticated, we cannot translate
characters between encodings. I don't think this is a huge issue though,
as most people are only going to be creating empty databases anyway.
Still, it probably requires documentation.

Thoughts?

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Foreign key trigger timing bug?

2005-12-08 Thread Stephan Szabo
On Thu, 8 Dec 2005, Jan Wieck wrote:

 On 12/7/2005 4:50 PM, Stephan Szabo wrote:

  On Wed, 7 Dec 2005, Bruce Momjian wrote:
 
  I had an open 8.1 item that was:
 
 o  fix foreign trigger timing issue
 
  Would someone supply text for a TODO entry on this, as I don't think we
  fixed it in 8.1.
 
  I'd split this into two separate items now.
 
   Fix before delete triggers on cascaded deletes to run after the cascaded
  delete is done.  This is odd, but seems to be what the spec requires.

 Ugh, that sounds ugly.

Yeah.  I really don't understand it, but it appears to me to be explicitly
different in the spec for on delete cascade even compared to the rest of
the referential actions.

 One problem I see is, what do we do if the BEFORE
 trigger then returns NULL (to skip the delete). The cascaded operations
 are already done. Do we have to execute the cascaded deletes in a
 subtransaction or do we disallow the skip in this case?

I think we'd have disallow skipping.  Especially since skipping would
probably end up with a violated constraint.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] generic builtin functions

2005-12-08 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 I see in that discussion you say:

 Every rowtype Datum will carry its own concrete type.

 Are we storing that on disk in every composite object?

Yup.  For a rowtype datum it's not a serious overhead.  I realize
that the same is not true of enums :-(

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] Inherited Constraints

2005-12-08 Thread Trent Shipley
On Thursday 2005-12-08 15:47, Simon Riggs wrote:

 does of course already exist, so the following should cause dependency
 violation ERRORs:
 - omitting the CASCADE when attempting to delete parent constraint
 - attempting to drop the child constraint

Why should dropping the child constraint fail?  

Child tables are supposed to be able to over-ride parent constraints.  
Dropping a parent's constraint sounds like just a way to over-ride a 
constraint with no constraint at all.  (Making the column unconstrained.)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Another encoding issue

2005-12-08 Thread Christopher Kings-Lynne
If we're bringing up odd encoding issues, why not talk about the mystery 
encoding of the shared catalogs? :)


Basically depending on which database you're logged into when you alter 
a catalog will affect what encoding the new object appears as in the 
shared catalog.


This for one makes it impossible for us in phpPgAdmin to display a list 
of databases, where some database names are in EUC and some are in UTF-8 
and some are in LATIN5...


I bring it up as I notice that in MySQL 5 at least, all system object 
names (in our case that'd be all strings in the shared catalogs) are 
stored in UTF-8, always.


Chris


Gavin Sherry wrote:

Hi all,

Here's another interesting encoding issue. I cannot recall having seen it
on the lists.

---
[EMAIL PROTECTED] build7]$ bin/createdb -E LATIN1 test
CREATE DATABASE
[EMAIL PROTECTED] build7]$ cat break.sh
dat=`echo -en \245\241`

echo create table test (d text);
echo insert into test values('$dat');
[EMAIL PROTECTED] build7]$ sh break.sh | bin/psql test
CREATE TABLE
INSERT 0 1
[EMAIL PROTECTED] build7]$ bin/createdb -T test test2
CREATE DATABASE
[EMAIL PROTECTED] build7]$ bin/createdb -T test -E UTF-8 test2
CREATE DATABASE
[EMAIL PROTECTED] build7]$ bin/pg_dump -C test2  test2.dmp
[EMAIL PROTECTED] build7]$ bin/dropdb test2
DROP DATABASE
[EMAIL PROTECTED] build7]$ bin/psql template1 -f test2.dmp
SET
SET
SET
CREATE DATABASE
ALTER DATABASE
You are now connected to database test2.
[...]
CREATE TABLE
ALTER TABLE
psql:test2.dmp:345: ERROR:  invalid UTF-8 byte sequence detected near byte
0xa5
CONTEXT:  COPY test, line 1, column d:   
[...]
---

Until createdb() is a lot more sophisticated, we cannot translate
characters between encodings. I don't think this is a huge issue though,
as most people are only going to be creating empty databases anyway.
Still, it probably requires documentation.

Thoughts?

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Foreign key trigger timing bug?

2005-12-08 Thread Tom Lane
Stephan Szabo [EMAIL PROTECTED] writes:
 Yeah.  I really don't understand it, but it appears to me to be explicitly
 different in the spec for on delete cascade even compared to the rest of
 the referential actions.

 One problem I see is, what do we do if the BEFORE
 trigger then returns NULL (to skip the delete). The cascaded operations
 are already done. Do we have to execute the cascaded deletes in a
 subtransaction or do we disallow the skip in this case?

 I think we'd have disallow skipping.  Especially since skipping would
 probably end up with a violated constraint.

That seems to me to be a sufficient reason to not follow the spec in
this respect.  A BEFORE trigger should be run BEFORE anything happens,
full stop.  I can't think of any good reason why the spec's semantics
are better.  (It's not like our triggers are exactly spec-compatible
anyway.)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Improving free space usage (was: Reducing relation locking overhead)

2005-12-08 Thread Jim C. Nasby
On Fri, Dec 09, 2005 at 12:00:14AM +0200, Hannu Krosing wrote:
  Along those lines, I've wondered if it makes sense to add more
  flexibility in how free space is reclaimed in a table. One obvious
  possibility is to have a strategy where new tuples will always look to
  the FSM for space (instead of going into the current page if possible),
  and the FSM will always hand out the earliest page in the table it has.
  This mode would have the effect of moving tuples towards the front of
  the table, allowing for space reclamation. A variation might be that
  this mode will not effect tuples that are generated as part of an UPDATE
  and are in the first x% of the table, since it doesn't make sense to
  move a tuple from page 2 to page 1 in a 1000 page table.
 
 This % could be depending on some fill factor of the table, aiming not
 to move tuples, that would end up in the final volume of a balance
 table, which, in case of heavily updated table, would probably be 2 to 3
 times the volume of densely populated table.
 
  Another possibility is to always go to the FSM and to have the FSM hand
  back the page that is closest to the new tuple according to a certain
  index. This would allow for ALTER TABLE CLUSTER to be much more
  self-maintaining. The downside is that you'd have to do a lookup on that
  index, but presumably if the index is important enough to cluster on
  then it should be well-cached. There's probably some other tweaks that
  could be done as well to make this more performant.
 
 Yes, I agree on all your points about better placement of new tuples,
 all they would be useful indeed.

Sounds like a TODO, barring objections...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Warm-cache prefetching

2005-12-08 Thread Qingqing Zhou

I found an interesting paper improving index speed by prefetching memory
data to L1/L2 cache here (there is discussion about prefetching disk
data to memory several days ago ice-breaker thread):
http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf

Also related technique used to speedup memcpy:
http://people.redhat.com/arjanv/pIII.c

I wonder if we could use it to speed up in-memory scan opertion for heap
or index. Tom's patch has made scan can handle a page (vs. row) every
time, which is a basis for this optimization.

I write a program try to simulate it, but I am not good at micro
optimization, and I just get a very weak but kind-of-stable improvement. I
wonder if any people here are interested to take a look.

Regards,
Qingqing

--

Test results
--
Cache line size: 64
CPU: P4 2.4G
$#./prefetch 10 16
Sum: -951304192: with prefetch on - duration: 42.163 ms
Sum: -951304192: with prefetch off - duration: 42.838 ms
Sum: -951304192: with prefetch on - duration: 44.044 ms
Sum: -951304192: with prefetch off - duration: 42.792 ms
Sum: -951304192: with prefetch on - duration: 42.324 ms
Sum: -951304192: with prefetch off - duration: 42.803 ms
Sum: -951304192: with prefetch on - duration: 42.189 ms
Sum: -951304192: with prefetch off - duration: 42.801 ms
Sum: -951304192: with prefetch on - duration: 42.155 ms
Sum: -951304192: with prefetch off - duration: 42.827 ms
Sum: -951304192: with prefetch on - duration: 42.179 ms
Sum: -951304192: with prefetch off - duration: 42.798 ms
Sum: -951304192: with prefetch on - duration: 42.180 ms
Sum: -951304192: with prefetch off - duration: 42.804 ms
Sum: -951304192: with prefetch on - duration: 42.193 ms
Sum: -951304192: with prefetch off - duration: 42.827 ms
Sum: -951304192: with prefetch on - duration: 42.164 ms
Sum: -951304192: with prefetch off - duration: 42.810 ms
Sum: -951304192: with prefetch on - duration: 42.182 ms
Sum: -951304192: with prefetch off - duration: 42.826 ms

Test program


/*
 * prefetch.c
 *  PostgreSQL warm-cache sequential scan simulator with prefetch
 */

#include stdio.h
#include stdlib.h
#include memory.h
#include errno.h
#include sys/time.h

typedef char bool;
#define true((bool) 1)
#define false   ((bool) 0)

#define BLCKSZ  8192
#define CACHESZ 64
#define NBLCKS  5000

int sum;

int
main(int argc, char *argv[])
{
int i, rounds;
char*blocks;
int cpu_cost;

if (argc != 3)
{
fprintf(stderr, usage: prefetch rounds cpu_cost [1, 
16]\n);
exit(-1);
}

rounds = atoi(argv[1]);
cpu_cost  = atoi(argv[2]);
if (cpu_cost  16)
exit(-1);

for (i = 0; i  2*rounds; i++)
{
int j, k;
struct  timeval start_t, stop_t;
boolenable = i%2?false:true;
char*blck;

blocks = (char *)malloc(BLCKSZ*NBLCKS);
memset(blocks, 'a', BLCKSZ*NBLCKS);

sum = 0;
gettimeofday(start_t, NULL);

for (j = 0; j  NBLCKS; j++)
{
blck = blocks + j*BLCKSZ;
for (k=0; k  BLCKSZ; k+=CACHESZ)
{
int *s = (int *)(blck + k);
int u = cpu_cost;

if (enable)
{
/* prefetch ahead */
__asm__ __volatile__ (
1: prefetchnta 128(%0)\n
: : r (s) : memory );
}

/* pretend to process current tuple */
while (u--) sum += (*(s+u))*(*(s+u));
}
}
gettimeofday(stop_t, NULL);

free(blocks);

/* measure the time */
if (stop_t.tv_usec  start_t.tv_usec)
{
stop_t.tv_sec--;
stop_t.tv_usec += 100;
}
fprintf (stdout, Sum: %d: with prefetch %s - duration: 
%ld.%03ld ms\n,
sum,
enable?on:off,
(long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 
+
(stop_t.tv_usec - 
start_t.tv_usec) / 1000),
(long) (stop_t.tv_usec - start_t.tv_usec) % 
1000);

}

exit(0);
}

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get 

Re: [HACKERS] Warm-cache prefetching

2005-12-08 Thread Luke Lonergan
Qingqing,

On 12/8/05 8:07 PM, Qingqing Zhou [EMAIL PROTECTED] wrote:

 /* prefetch ahead */
 __asm__ __volatile__ (
 1: prefetchnta 128(%0)\n
 : : r (s) : memory );

I think this kind / grain of prefetch is handled as a compiler optimization
in the latest GNU compilers, and further there are some memory streaming
operations for the Pentium 4 ISA that are now part of the standard compiler
optimizations done by gcc.

What I think would be tremendously beneficial is to implement L2 cache
blocking in certain key code paths like sort.  What I mean by cache
blocking is performing as many operations on a block of memory (maybe 128
pages worth for a 1MB cache) as possible, then moving to the next batch of
memory and performing all of the work on that batch, etc.

The other thing to consider in conjunction with this would be maximizing use
of the instruction cache, increasing use of parallel functional units and
minimizing pipeline stalls.  The best way to do this would be to group
operations into tighter groups and separating out branching:

So instead of structures like this:

  function_row_at_a_time(row)
if conditions
  do some work
else if other
  do different work
else if error
  print error_log

You'd have
  function_buffer_at_a_time(buffer_of_rows)
loop on sizeof(buffer_of_rows) / sizeof_L2cache
   do a lot of work on each row

loop on sizeof(buffer_of_rows) / sizeof_L2cache
   if error exit
   
The ideas in the above optimizations:
- Delay work until a buffer can be gathered
- Increase the computational intensity of the loops by putting more
instructions together
- While in loops doing lots of work, avoid branches / jumps

- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Warm-cache prefetching

2005-12-08 Thread Qingqing Zhou

Luke Lonergan [EMAIL PROTECTED] wrote

 /* prefetch ahead */
 __asm__ __volatile__ (
 1: prefetchnta 128(%0)\n
 : : r (s) : memory );

 I think this kind / grain of prefetch is handled as a compiler 
 optimization
 in the latest GNU compilers, and further there are some memory streaming
 operations for the Pentium 4 ISA that are now part of the standard 
 compiler
 optimizations done by gcc.


Is there any special kind of optimization flag of gcc needed to support 
this? I just tried both 2.96 and 4.01 with O2. Unfortunately, 
sse_clear_page() encounters a core-dump by 4.0.1 at this line:

__asm__ __volatile__ ( movntdq %%xmm0, %0::m(sse_save[0]) );

So I removed this test (sorry ...). There is no non-trivial difference 
AFAICS. The results is attached. I will look into the other parts of your 
thread tomorrow,

Regards,
Qingqing

---

*#ll prefp3-*
-rwx--1 zhouqq   jmgrp 38k Dec  9 00:49 prefp3-296
-rwx--1 zhouqq   jmgrp 16k Dec  9 00:49 prefp3-401
*#./prefp3-296
2392.975 MHz
clear_page function 'gcc clear_page()'   took 27142 cycles per page (172.7 
MB/s)
clear_page function 'normal clear_page()'took 27161 cycles per page 
(172.6 MB/s)
clear_page function 'mmx clear_page()   'took 17293 cycles per page 
(271.1 MB/s)
clear_page function 'gcc clear_page()'   took 27174 cycles per page (172.5 
MB/s)
clear_page function 'normal clear_page()'took 27142 cycles per page 
(172.7 MB/s)
clear_page function 'mmx clear_page()   'took 17291 cycles per page 
(271.1 MB/s)

copy_page function 'normal copy_page()'  took 18552 cycles per page (252.7 
MB/s)
copy_page function 'mmx copy_page()   '  took 12511 cycles per page (374.6 
MB/s)
copy_page function 'sse copy_page()   '  took 12318 cycles per page (380.5 
MB/s)
*#./prefp3-401
2392.970 MHz
clear_page function 'gcc clear_page()'   took 27120 cycles per page (172.8 
MB/s)
clear_page function 'normal clear_page()'took 27151 cycles per page 
(172.6 MB/s)
clear_page function 'mmx clear_page()   'took 17295 cycles per page 
(271.0 MB/s)
clear_page function 'gcc clear_page()'   took 27152 cycles per page (172.6 
MB/s)
clear_page function 'normal clear_page()'took 27114 cycles per page 
(172.9 MB/s)
clear_page function 'mmx clear_page()   'took 17296 cycles per page 
(271.0 MB/s)

copy_page function 'normal copy_page()'  took 18586 cycles per page (252.2 
MB/s)
copy_page function 'mmx copy_page()   '  took 12620 cycles per page (371.4 
MB/s)
copy_page function 'sse copy_page()   '  took 12698 cycles per page (369.1 
MB/s)



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Warm-cache prefetching

2005-12-08 Thread Min Xu (Hsu)
Perhaps because P4 is already doing H/W prefetching?

http://www.tomshardware.com/2000/11/20/intel/page5.html

I ran the test program on an opteron 2.2G:

% ./a.out 10 16
Sum: -951304192: with prefetch on - duration: 81.166 ms
Sum: -951304192: with prefetch off - duration: 79.769 ms
Sum: -951304192: with prefetch on - duration: 81.173 ms
Sum: -951304192: with prefetch off - duration: 79.718 ms
Sum: -951304192: with prefetch on - duration: 83.293 ms
Sum: -951304192: with prefetch off - duration: 79.731 ms
Sum: -951304192: with prefetch on - duration: 81.227 ms
Sum: -951304192: with prefetch off - duration: 79.851 ms
Sum: -951304192: with prefetch on - duration: 81.003 ms
Sum: -951304192: with prefetch off - duration: 79.724 ms
Sum: -951304192: with prefetch on - duration: 81.084 ms
Sum: -951304192: with prefetch off - duration: 79.728 ms
Sum: -951304192: with prefetch on - duration: 81.009 ms
Sum: -951304192: with prefetch off - duration: 79.723 ms
Sum: -951304192: with prefetch on - duration: 81.074 ms
Sum: -951304192: with prefetch off - duration: 79.719 ms
Sum: -951304192: with prefetch on - duration: 81.188 ms
Sum: -951304192: with prefetch off - duration: 79.724 ms
Sum: -951304192: with prefetch on - duration: 81.075 ms
Sum: -951304192: with prefetch off - duration: 79.719 ms

Got slowdown.

I ran it on a PIII 650M:

% ./a.out 10 16
Sum: -951304192: with prefetch on - duration: 284.952 ms
Sum: -951304192: with prefetch off - duration: 291.439 ms
Sum: -951304192: with prefetch on - duration: 290.690 ms
Sum: -951304192: with prefetch off - duration: 299.692 ms
Sum: -951304192: with prefetch on - duration: 295.287 ms
Sum: -951304192: with prefetch off - duration: 290.992 ms
Sum: -951304192: with prefetch on - duration: 285.116 ms
Sum: -951304192: with prefetch off - duration: 294.127 ms
Sum: -951304192: with prefetch on - duration: 286.986 ms
Sum: -951304192: with prefetch off - duration: 291.001 ms
Sum: -951304192: with prefetch on - duration: 283.233 ms
Sum: -951304192: with prefetch off - duration: 401.910 ms
Sum: -951304192: with prefetch on - duration: 297.021 ms
Sum: -951304192: with prefetch off - duration: 307.814 ms
Sum: -951304192: with prefetch on - duration: 287.201 ms
Sum: -951304192: with prefetch off - duration: 303.870 ms
Sum: -951304192: with prefetch on - duration: 286.962 ms
Sum: -951304192: with prefetch off - duration: 352.779 ms
Sum: -951304192: with prefetch on - duration: 283.245 ms
Sum: -951304192: with prefetch off - duration: 294.422 ms

looks like some speedup to me.

On Thu, 08 Dec 2005 Qingqing Zhou wrote :
 
 I found an interesting paper improving index speed by prefetching memory
 data to L1/L2 cache here (there is discussion about prefetching disk
 data to memory several days ago ice-breaker thread):
 http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf
 
 Also related technique used to speedup memcpy:
 http://people.redhat.com/arjanv/pIII.c
 
 I wonder if we could use it to speed up in-memory scan opertion for heap
 or index. Tom's patch has made scan can handle a page (vs. row) every
 time, which is a basis for this optimization.
 
 I write a program try to simulate it, but I am not good at micro
 optimization, and I just get a very weak but kind-of-stable improvement. I
 wonder if any people here are interested to take a look.
 
 Regards,
 Qingqing
 
 --
 
 Test results
 --
 Cache line size: 64
 CPU: P4 2.4G
 $#./prefetch 10 16
 Sum: -951304192: with prefetch on - duration: 42.163 ms
 Sum: -951304192: with prefetch off - duration: 42.838 ms
 Sum: -951304192: with prefetch on - duration: 44.044 ms
 Sum: -951304192: with prefetch off - duration: 42.792 ms
 Sum: -951304192: with prefetch on - duration: 42.324 ms
 Sum: -951304192: with prefetch off - duration: 42.803 ms
 Sum: -951304192: with prefetch on - duration: 42.189 ms
 Sum: -951304192: with prefetch off - duration: 42.801 ms
 Sum: -951304192: with prefetch on - duration: 42.155 ms
 Sum: -951304192: with prefetch off - duration: 42.827 ms
 Sum: -951304192: with prefetch on - duration: 42.179 ms
 Sum: -951304192: with prefetch off - duration: 42.798 ms
 Sum: -951304192: with prefetch on - duration: 42.180 ms
 Sum: -951304192: with prefetch off - duration: 42.804 ms
 Sum: -951304192: with prefetch on - duration: 42.193 ms
 Sum: -951304192: with prefetch off - duration: 42.827 ms
 Sum: -951304192: with prefetch on - duration: 42.164 ms
 Sum: -951304192: with prefetch off - duration: 42.810 ms
 Sum: -951304192: with prefetch on - duration: 42.182 ms
 Sum: -951304192: with prefetch off - duration: 42.826 ms
 
 Test program
 
 
 /*
  * prefetch.c
  *PostgreSQL warm-cache sequential scan simulator with prefetch
  */
 
 #include stdio.h
 #include stdlib.h
 #include memory.h
 #include errno.h
 #include sys/time.h
 
 typedef char bool;
 #define true  ((bool) 1)
 #define false ((bool) 0)
 
 #define BLCKSZ8192
 #define CACHESZ   64
 #define