Re: [HACKERS] Replication on the backend

2005-12-07 Thread Markus Schiltknecht
On Wed, 2005-12-07 at 01:04 -0800, J. Andrew Rogers wrote:
 Opteron boards get pretty damn close to Big Iron SMP fabric  
 performance in a cheap package.  Given how many companies have  
 announced plans to produce Opteron server boards with Infiniband  
 fabrics directly integrated into HyperTransport, I would say that  
 this is the future of server boards.

InfiniBand on-board? Wow, seems very interesting. Thank you for your
hints and numbers, very helpfull!

 And if postgres could actually use an infiniband fabric for  
 clustering a single database instance across Opteron servers, that  
 would be very impressive...

full ACK

Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Replication on the backend

2005-12-07 Thread J. Andrew Rogers


On Dec 6, 2005, at 9:09 PM, Gregory Maxwell wrote:

Eh, why would light limited delay be any slower than a disk on FC the
same distance away? :)

In any case, performance of PG on iscsi is just fine. You can't blame
the network... Doing multimaster replication is hard because the
locking primitives that are fine on a simple multiprocessor system
(with a VERY high bandwidth very low latency interconnect between
processors) just don't work across a network, so you're left finding
other methods and making them work...



Speed of light latency shows up pretty damn often in real networks,  
even relatively local ones.  The number of people that wonder why a  
transcontinental SLA of 10ms is not possible is astonishing.  The  
silicon fabrics are sufficiently fast that most well-designed  
networks are limited by how fast one can push photons through a  
fiber, which is significantly slower than photons through a vacuum.   
Silicon switch fabrics add latency measured in nanoseconds, which is  
effectively zero for many networks that leave the system board.


Compared to single system simple SMP, a local cluster built on a  
first-rate fabric will have about an order of magnitude higher  
latency but very similar bandwidth.  On the other hand, at those  
latencies you can increase the number of addressable processors with  
that kind of bandwidth by an order of magnitude, so it is a bit of a  
trade.  However, latency matters a lot such that one would have to be  
a lot smarter about partitioning synchronization across that fabric  
even though one would lose nothing in the bandwidth department.




But again, multimaster isn't hard because there of some inherently
slow property of networks.



Eh?  As far as I know, the difficulty of multi-master is almost  
entirely a product of the latency of real networks such that they are  
too slow for scalable distributed locks.  SMP is little more than a  
distributed lock manager implemented in silicon.  Therefore, multi- 
master is hard in practice because we cannot drive networks fast  
enough.  That said, current state-of-the-art network fabrics are  
within an order of magnitude of SMP fabrics such that they could be  
real contenders, particularly once you get north of 8-16 processors.


The really sweet potential is in Opteron system boards with  
Infiniband directly attached to HyperTransport.  At that level of  
bandwidth and latency, both per node and per switch fabric, the  
architecture possibilities start to become intriguing.



J. Andrew Rogers



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Oddity with extract microseconds?

2005-12-07 Thread Harald Fuchs
In article [EMAIL PROTECTED],
Christopher Kings-Lynne [EMAIL PROTECTED] writes:

mysql SELECT EXTRACT(MICROSECOND FROM '2003-01-02 10:30:01.00123');
 +---+
 | EXTRACT(MICROSECOND FROM '2003-01-02 10:30:01.00123') |
 +---+
 |  1230 |
 +---+
 1 row in set (0.00 sec)
 Does contrary behavior from MySQL count as evidence that PostgreSQL's
 behavior is correct? :-)

 No...I happen to think that their way is more consistent though.  Pity
 it's not in the spec.

I'd say the comparison with MySQL is useless because MySQL is unable
to store microseconds in a DATETIME or TIMESTAMP column, although you
can extract microseconds from a date/time literal.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Replication on the backend

2005-12-07 Thread J. Andrew Rogers


On Dec 6, 2005, at 11:42 PM, Markus Schiltknecht wrote:
Does anybody have latency / roundtrip measurements for current  
hardware?

I'm interested in:
1Gb Ethernet,
10 Gb Ethernet,
InfiniBand,
probably even p2p usb2 or firewire links?



In another secret life, I know a bit about supercomputing fabrics.   
The latency metrics have to be thoroughly qualified.


First, most of the RTT latency numbers for network fabrics are for 0  
byte packet sizes, which really does not apply to anyone shuffling  
real data around.  For small packets, high-performance fabrics (HTX  
Infiniband, Quadrics, etc) have approximately an order of magnitude  
less latency than vanilla Ethernet, though the performance specifics  
depend greatly on the actual usage.  For large packet sizes, the  
differences in latency become far less obvious.  However, for real  
packets a performant fabric will still look very good compared to  
disk systems.  Switched fiber fabrics have enough relatively  
inexpensive throughput now to saturate most disk systems and CPU I/O  
busses; only platforms like HyperTransport can really keep up.  It is  
worth pointing out that the latency of high-end network fabrics is  
similar to large NUMA fabrics, which exposes some of the limits of  
SMP scalability.  As a point of reference, an organization that knows  
what they are doing should have no problem getting 500 microsecond  
RTT on a vanilla metropolitan area GigE fiber network -- a few  
network operators actually do deliver this on a regional scale.  For  
a local cluster, a competent design can best this by orders of  
magnitude.


There are a number of silicon limitations, but a system that connects  
the fabric directly to HyperTransport can drive several GB/s with  
very respectable microsecond latencies if the rest of the system is  
up to it.  There are Opteron system boards now that will drive  
Infiniband directly from HyperTransport.  I know Arima/Rioworks makes  
some (great server boards generally), and several other companies are  
either making them or have announced them in the pipeline.  These  
Opteron boards get pretty damn close to Big Iron SMP fabric  
performance in a cheap package.  Given how many companies have  
announced plans to produce Opteron server boards with Infiniband  
fabrics directly integrated into HyperTransport, I would say that  
this is the future of server boards.


And if postgres could actually use an infiniband fabric for  
clustering a single database instance across Opteron servers, that  
would be very impressive...


J. Andrew Rogers



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Replication on the backend

2005-12-07 Thread Luke Lonergan
Andrew,
 
 And if postgres could actually use an infiniband fabric for 
 clustering a single database instance across Opteron servers, that 
 would be very impressive...

That's what we do with Bizgres MPP.  We've implemented an interconnect to do 
the data shuffling underneath the optimizer/executor and we currently use 
TCP/IP, though we could haul out SDP over Infiniband should we need it.
 
However, our optimizer effectively minimizes traffic over the interconnect now 
and that works well for all of the plans we've run so far.  It would be nice to 
characterize the improvements we could get from moving to 3x infiniband.
 
Regarding a direct Hypertransport to Infiniband bridge, have you looked at 
Pathscale? http://www.pathscale.com/  I know the fellow behind the scenes who 
designed it, and I think it's probably well thought out.  We were gunning for 
less than 1us RTT through the adapter and switch once, and I bet they are close.
 
- Luke


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Feature Request: Multi-octet raw

2005-12-07 Thread Martijn van Oosterhout
On Tue, Dec 06, 2005 at 08:54:42PM -0700, Trent Shipley wrote:
 It would be nice if Postgresql supported multi-octet raw data.  Certainly a 
 lot of what you would do with it would be similar to bytea, but the basic 
 string functions would be overloaded so that the unit of work would be a 
 multi-octet word.  

Well, PostgreSQL allows you to create your own types so you could just
make one that does what you want. The only tricky part would be the
syntax on creation because user-defined types can't take parameters.
OTOH, maybe by the time someone writes the code for such a type the
details will have been worked out.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgp3gVrY3egve.pgp
Description: PGP signature


Re: [HACKERS] SERIAL type feature request

2005-12-07 Thread Zoltan Boszormenyi

Hi,

Zoltan Boszormenyi írta:


Jan Wieck írta:


On 12/5/2005 1:03 PM, Zoltan Boszormenyi wrote:


Jan Wieck írta:


On 12/4/2005 5:10 PM, Zoltan Boszormenyi wrote:


I found this in the SQL2003 draft:


4.14.7 Identity columns

... An identity column has a start value, an increment, a maximum 
value, a minimum value,

and a cycle option. ...


The exact properties of a sequence. It would be a good idea to be 
able to provide

all these the same way PostgreSQL provides CREATE SEQUENCE.





I think nobody would object to implementing support for the SQL2003 
syntax. Most of that would be providing all the values that will 
get forwarded into the internal sequence generation during CREATE 
TABLE.


The other thing needed is an extension to the default value 
mechanism that overrides any given value to implement GENERATE 
ALLWAYS. Not too hard either.





Where can I find this syntax? (PDF file name, page#) Thanks.
I think I modify my feature request for the standard behaviour.




It's all in the Foundation paper inside this zip:

http://www.wiscorp.com/sql/sql_2003_standard.zip




Thanks, I found it. It's GENERATED { ALWAYS | BY DEFAULT  } AS 
IDENTITY, isn't it?
If I interpret it correctly, GENERATED ALWAYS AS IDENTITY means that 
no matter
what I give in INSERT INTO MYTABLE (serial_id, ...) VALUES (N, ...), 
the sequence
next value will be inserted into the database. I am all for it, it's 
much stronger than just watching

for the 0 value and would fit my needs.

The other behaviour is GENERATED  BY DEFAULT AS IDENTITY,
which is what PostgreSQL currently provides.

Best regards,
Zoltán Böszörményi



To reiterate it, I would like the following added to PostgreSQL 8.2 TODO 
(I may have got the optional parametes wrong...):


- Extend SERIAL type declaration and functionality with the SQL2003 
compliant sequence generation options:

   SERIAL [ GENERATED { ALWAYS | BY DEFAULT }
[ AS IDENTITY ( [ START WITH startvalue ] [ INCREMENT BY 
incrementvalue ]
   [ MAXVALUE maxvalue ] [ 
MINVALUE minvalue ] [ CYCLE | NO CYCLE ] ) ] ]



---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] About my new work at Command Prompt Inc.

2005-12-07 Thread Devrim GUNDUZ
Hi,

I'd like to inform the people who does not read Planet PostgreSQL

Command Prompt Inc.has just hired me for my community work I have been
doing so far, like PostgreSQL RPM stuff and other PostgreSQL related
RPMs, such as Slony-I, pgpool, PostGIS, etc) and website things. That
means I'll spend more time on these. 

Thanks to Joshua. I'm very happy to be a member of CMD. Let's see what
else I can do for CMD.

Regards

--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Oddity with extract microseconds?

2005-12-07 Thread Andrew Dunstan



Christopher Kings-Lynne wrote:

Why aren't 'minutes' considered too?  Because they aren't 'seconds'. 
Well, seconds aren't microseconds either.



Yeah, they are: it's just one field.  The other way of looking at it
(that everything is seconds) is served by extract(epoch).



Well, it's different in MySQL unfortunately - what does the standard 
say?  Out of interest, can someone try this for me in MySQL 5:


SELECT EXTRACT (MICROSECOND FROM '2003-01-02 10:30:00.00123');
SELECT EXTRACT (MICROSECOND FROM '2003-01-02 10:30:10.00123');



mysql 4.1.5 gives back 123 in both cases. I assume they haven't changed 
that, although anything is possible.


cheers

andrew



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[HACKERS] Foreign key trigger timing bug?

2005-12-07 Thread Bruce Momjian
I had an open 8.1 item that was:

o  fix foreign trigger timing issue

Would someone supply text for a TODO entry on this, as I don't think we
fixed it in 8.1.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Foreign key trigger timing bug?

2005-12-07 Thread Bruce Momjian

I think this is the foreign key trigger timing issue.

---

Darcy Buskermolen wrote:
 On Friday 09 September 2005 08:46, Stephan Szabo wrote:
  On Fri, 9 Sep 2005, Tom Lane wrote:
   Stephan Szabo [EMAIL PROTECTED] writes:
Is there a case other than a before trigger updating a row we will want
to act upon later in the statement where we'll get a row with xmax of
our transaction and cmax greater than the current command?
  
   The greater-cmax case could occur via any kind of function, not only a
   trigger, ie
  
 update tab set x = foo(x) where ...
  
   where foo() is a volatile function that internally updates the tab
   table.
 
  I *thought* I was missing a case, I just couldn't figure out what.
 
   I suppose you could say that this is horrible programming practice and
   anyone who tries it deserves whatever weird behavior ensues ... but
   it's not the case that every such situation involves a trigger.
 
  Well, the change I was thinking of would have made it an error if foo(x)
  updated a row that was then later selected by the update rather than the
  current behavior which I think would have ignored the already updated row,
  so that's probably not going to work.
 
 I see that this still is not addressed fulling in beta 3.  Can anybody give a 
 quick overview of where this is sitting, and if it's likely to make it's way 
 into 8.1 gold ?
 
 
  ---(end of broadcast)---
  TIP 3: Have you checked our extensive FAQ?
 
 http://www.postgresql.org/docs/faq
 
 -- 
 Darcy Buskermolen
 Wavefire Technologies Corp.
 
 http://www.wavefire.com
 ph: 250.717.0200
 fx: 250.763.1759
 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: Concurrent CREATE INDEX, try 2 (was Re: [HACKERS] Reducing

2005-12-07 Thread Jochem van Dieten
On 12/6/05, Hannu Krosing wrote:

 1) run a transaction repeatedly, trying to hit a point of no concurrent
 transactions, encance the odds by locking out starting other
 transactions for a few (tenths or hundredths of) seconds, if it
 succeeds, record SNAP1, commit and and continue, else rollback, then
 sleep a little and retry.

Which locks can be released by committing here?


 2) build index on all rows inserted before SNAP1

 3) run a transaction repeatedly, trying to hit a point of no concurrent
 transactions by locking out other transactions for a few (tenths or
 hundredths of) seconds, if it succeeds, record SNAP2, mark index as
 visible for inserts, commit. now all new transactions see the index and
 use it when inserting new tuples.

 4) scan over table, add all tuples between SNAP1 and SNAP2 to index

You can not guarantee that every tuple inserted in the table will be
visible to SNAP 2 if you take SNAP2 before the commit of the
insert-only index has dropped below the global XMIN-horizon.


 5) mark index as usable for query plans

How about:

- begin transaction X1
   - insert all visible tuples in an index
   - mark index incomplete
- commit

- wait for X1 to become visible to all running transactions (X1 is
known from the XMIN in pg_class / pg_index)

- begin transaction X2
   - insert all missing tuples in index
   - mark index complete
- commit

Jochem

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: Concurrent CREATE INDEX, try 2 (was Re: [HACKERS] Reducing

2005-12-07 Thread Greg Stark

Hannu Krosing [EMAIL PROTECTED] writes:

  But that said, realistically *any* solution has to obtain a lock at some 
  time
  to make the schema change. I would say pretty much any O(1) (constant time)
  outage is at least somewhat acceptable as contrasted with the normal index
  build which locks out other writers for at least O(n lg n) time. Anything on
  the order of 100ms is probably as good as it gets here.
 
 For me any delay less than the client timeout is acceptable and anything
 more than that is not. N sec is ok, N+1 is not. It's as simple as that.

I don't think the client timeout is directly relevant here. If your client
timeout is 20s and you take 19s, how many requests have queued up behind you?
If you normally process requests in under 200ms and receive 10 requests per
second (handling at least 2 simultaneously) then you now have 190 requests
queued up. Those requests take resources and will slow down your server. If
they slow things down too much then you will start failing to meet your 200ms
deadline.

It's more likely that your system is engineered to use queueing and
simultaneous dispatch to deal with spikes in load up to a certain margin. Say
you know it can deal with spikes in load of up to 2x the regular rate. Then
you can deal with service outage of up to the 200ms deadline. If you can deal
with spikes of up to 4x the regular rate then you can deal with an outage of
up to 600ms. 

Moreover even if you had the extra resources available to handle a 19s backlog
of requests, how long would it take you to clear that backlog? If you have a
narrow headroom on meeting the deadline in the first place, and now you have
even less headroom because of the resources dedicated to the queue, it'll take
you a long time to clear the backlog.

We periodically ran into problems with load spikes or other performance
problems causing things to get very slow and stay slow for a while. Letting
things settle out usually worked but occasionally we had to restart the whole
system to clear out the queue of requests.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [pgsql-www] About my new work at Command Prompt Inc.

2005-12-07 Thread Lamar Owen
Devrim Gunduz Wrote:
 Command Prompt Inc.has just hired me for my community work I have been
 doing so far, like PostgreSQL RPM stuff and other PostgreSQL related
 RPMs, such as Slony-I, pgpool, PostGIS, etc) and website things. That
 means I'll spend more time on these.

Congratulations, Devrim.

You're doing a fine job on all those fronts; I believe the decision we
made last year to pass the RPM maintenance to you was a wise one indeed,
as you have stepped up to the plate nicely, and have furthered the RPM's
considerably from where my efforts had taken them;  I just wanted to
mention that publicly to all those on these lists.
-- 
Lamar Owen
Director of Information Technology
Pisgah Astronomical Research Institute
1 PARI Drive
Rosman, NC  28772
(828)862-5554
www.pari.edu

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: Concurrent CREATE INDEX, try 2 (was Re: [HACKERS] Reducing

2005-12-07 Thread Hannu Krosing
Ühel kenal päeval, K, 2005-12-07 kell 13:36, kirjutas Greg Stark:
 Hannu Krosing [EMAIL PROTECTED] writes:
 
   But that said, realistically *any* solution has to obtain a lock at some 
   time
   to make the schema change. I would say pretty much any O(1) (constant 
   time)
   outage is at least somewhat acceptable as contrasted with the normal index
   build which locks out other writers for at least O(n lg n) time. Anything 
   on
   the order of 100ms is probably as good as it gets here.
  
  For me any delay less than the client timeout is acceptable and anything
  more than that is not. N sec is ok, N+1 is not. It's as simple as that.
 
 I don't think the client timeout is directly relevant here. 

It is relevant. It is the ultimate check of success or failure :)

 If your client
 timeout is 20s and you take 19s, how many requests have queued up behind you?
 If you normally process requests in under 200ms and receive 10 requests per
 second (handling at least 2 simultaneously) then you now have 190 requests
 queued up.

Again, I'm handling 20 to 200 simultaneously quite nicely.

 Those requests take resources and will slow down your server. If
 they slow things down too much then you will start failing to meet your 200ms
 deadline.

If I can't meet the deadline, I've got a problem. The rest is
implementation detail.

 It's more likely that your system is engineered to use queueing and
 simultaneous dispatch to deal with spikes in load up to a certain margin. Say
 you know it can deal with spikes in load of up to 2x the regular rate.

I know it can, just that the 3x spike lasts for 6 hours :P

 Then
 you can deal with service outage of up to the 200ms deadline. If you can deal
 with spikes of up to 4x the regular rate then you can deal with an outage of
 up to 600ms. 

Small local fluctuations happen all the time. As a rule of a thumb I
want to stay below 50% of resource usage on average for any noticable
period and will start looking for code optimisations or additional
hardware if this is crossed.

 Moreover even if you had the extra resources available to handle a 19s backlog
 of requests, how long would it take you to clear that backlog? If you have a
 narrow headroom on meeting the deadline in the first place, and now you have
 even less headroom because of the resources dedicated to the queue, it'll take
 you a long time to clear the backlog.

While it feels heroic to run at 90% capacity, it is not usually a good
policy. All kinds of unforeseen stuff happens all the time -
checkpoints, backups, vacuums, unexpected growth, system cronjobs, ...
With too little headroom you are screwed anyway.

What I am aiming at with this CONCURRENT CREATE INDEX proposal, is being
no more disruptive than other stuff that keeps happening anyway. That
would be the baseline. Anything better is definitely desirable, but
should not be a stopper for implementing the baseline functionality.

-
Hannu








---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Replication on the backend

2005-12-07 Thread Andrew Sullivan
On Tue, Dec 06, 2005 at 12:35:43AM -0500, Jan Wieck wrote:
 We do not plan to implement replication inside the backend. Replication 
 needs are so diverse that pluggable replication support makes a lot more 
 sense. To me it even makes more sense than keeping transaction support 
 outside of the database itself and add it via pluggable storage add-on.

And, as I say every single time this comes up, Oracle's and IBM's and
MS's and everybody else's replication systems are _also_ add ons.  If
you don't believe me, look at the license costs.  You can get a
system without it enabled, which means (by definition) it's a modular
extension.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism. 
--Brad Holland

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Foreign key trigger timing bug?

2005-12-07 Thread Darcy Buskermolen
On Wednesday 07 December 2005 09:33, Bruce Momjian wrote:
 I had an open 8.1 item that was:

   o  fix foreign trigger timing issue

Stephan Szabo had this to say to me when I was asking him about his progress 
on this issue a while back.

There are some fundamental issues right now between before
triggers and foreign keys based on how we act upon rows for the same
statement that have been modified in the before trigger (which is to say
that the outer statement does not act upon them).




 Would someone supply text for a TODO entry on this, as I don't think we
 fixed it in 8.1.

No it's not yet resolved.

-- 
Darcy Buskermolen
Wavefire Technologies Corp.

http://www.wavefire.com
ph: 250.717.0200
fx: 250.763.1759

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Foreign key trigger timing bug?

2005-12-07 Thread Stephan Szabo
On Wed, 7 Dec 2005, Bruce Momjian wrote:

 I had an open 8.1 item that was:

   o  fix foreign trigger timing issue

 Would someone supply text for a TODO entry on this, as I don't think we
 fixed it in 8.1.

I'd split this into two separate items now.

 Fix before delete triggers on cascaded deletes to run after the cascaded
delete is done.  This is odd, but seems to be what the spec requires.

 Fix problems with referential action caused before triggers that modify
rows that would also be modified by the referential action.  Right now,
this has a few symptoms, either you can get spurious seeming errors from
the constraint or you can end up with invalid data in the referencing
table. As far as I can see, the spec doesn't have much to say about this
because the spec doesn't seem to allow before triggers to modify tables.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] Reducing contention for the LockMgrLock

2005-12-07 Thread Tom Lane
We've suspected for awhile that once we'd fixed the buffer manager's use
of a single global BufMgrLock, the next contention hotspot would be the
lock manager's LockMgrLock.  I've now seen actual evidence of that in
profiling pgbench: using a modified backend that counts LWLock-related
wait operations, the LockMgrLock is responsible for an order of magnitude
more blockages than the next highest LWLock:

PID 12971 lwlock LockMgrLock: shacq 0 exacq 50630 blk 3354
PID 12979 lwlock LockMgrLock: shacq 0 exacq 49706 blk 3323
PID 12976 lwlock LockMgrLock: shacq 0 exacq 50567 blk 3304
PID 12962 lwlock LockMgrLock: shacq 0 exacq 50635 blk 3278
PID 12974 lwlock LockMgrLock: shacq 0 exacq 50599 blk 3251
PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243
PID 12973 lwlock LockMgrLock: shacq 0 exacq 50321 blk 3200
PID 12978 lwlock LockMgrLock: shacq 0 exacq 50266 blk 3177
PID 12977 lwlock LockMgrLock: shacq 0 exacq 50379 blk 3148
PID 12975 lwlock LockMgrLock: shacq 0 exacq 49790 blk 3124
PID 12971 lwlock WALInsertLock: shacq 0 exacq 24022 blk 408
PID 12972 lwlock WALInsertLock: shacq 0 exacq 24021 blk 393
PID 12976 lwlock WALInsertLock: shacq 0 exacq 24017 blk 390
PID 12977 lwlock WALInsertLock: shacq 0 exacq 24021 blk 388
PID 12973 lwlock WALInsertLock: shacq 0 exacq 24018 blk 379
PID 12962 lwlock WALInsertLock: shacq 0 exacq 24024 blk 377
PID 12974 lwlock WALInsertLock: shacq 0 exacq 24016 blk 367
PID 12975 lwlock WALInsertLock: shacq 0 exacq 24021 blk 366
PID 12978 lwlock WALInsertLock: shacq 0 exacq 24023 blk 354
PID 12979 lwlock WALInsertLock: shacq 0 exacq 24033 blk 321
PID 12973 lwlock ProcArrayLock: shacq 45214 exacq 6003 blk 241
PID 12971 lwlock ProcArrayLock: shacq 45355 exacq 6003 blk 225
(etc)

We had also seen evidence to this effect from OSDL:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php

So it seems it's time to start thinking about how to reduce contention
for the LockMgrLock.  There are no interesting read-only operations on the
shared lock table, so there doesn't seem to be any traction to be gained
by making some operations take just shared access to the LockMgrLock.

The best idea I've come up with after a bit of thought is to replace the
shared lock table with N independent tables representing partitions of the
lock space.  Each lock would be assigned to one of these partitions based
on, say, a hash of its LOCKTAG.  I'm envisioning N of 16 or so to achieve
(hopefully) about an order-of-magnitude reduction of contention.  There
would be a separate LWLock guarding each partition; the LWLock for a given
partition would be considered to protect the LOCK objects assigned to that
partition, all the PROCLOCK objects associated with each such LOCK, and
the shared-memory hash tables holding these objects (each partition would
need its own hash tables).  A PGPROC's lock-related fields are only
interesting when it is waiting for a lock, so we could say that the
LWLock for the partition containing the lock it is waiting for must be
held to examine/change these fields.

The per-PGPROC list of all PROCLOCKs belonging to that PGPROC is a bit
tricky to handle since it necessarily spans across partitions.  We might
be able to deal with this with suitable rules about when the list can be
touched, but I've not worked this out in detail.  Another possibility is
to break this list apart into N lists, one per partition, but that would
bloat the PGPROC array a bit, especially if we wanted larger N.

The basic LockAcquire and LockRelease operations would only need to
acquire the LWLock for the partition containing the lock they are
interested in; this is what gives us the contention reduction.
LockReleaseAll is also interesting from a performance point of view,
since it executes at every transaction exit.  If we divide PGPROC's
PROCLOCK list into N lists then it will be very easy for LockReleaseAll
to take only the partition locks it actually needs to release these locks;
if not, we might have to resort to scanning the list N times, once while
we hold the LWLock for each partition.

I think that CheckDeadLock will probably require taking all the partition
LWLocks (as long as it does this in a predetermined order there is no risk
of deadlock on the partition LWLocks).  But one hopes this is not a
performance-critical operation.  Ditto for GetLockStatusData.

One objection I can see to this idea is that having N lock hash tables
instead of one will eat a larger amount of shared memory in hashtable
overhead.  But the lock hashtables are fairly small relative to the
shared buffer array (given typical configuration parameters) so this
doesn't seem like a major problem.

Another objection is that LockReleaseAll will get slower (since it will
certainly call LWLockAcquire/Release more times) and in situations that
aren't heavily concurrent there won't be any compensating gain.  I think
this won't be a significant effect, but there's probably no way to tell
for sure without actually writing the code and 

Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-07 Thread Jonah H. Harris
Tom,

This would also explain some things we've seen during benchmarking here
at EnterpriseDB. I like your idea and, as I'm on my way out, will
think about it a bit tonight.

Similarly, I don't see the any forward-looking reason for keeping the
separate hash tables used for the LockMethodIds. Or, it may just
be that I haven't looked closely enough at what the differences are.

-Jonah
On 12/7/05, Tom Lane [EMAIL PROTECTED] wrote:
We've suspected for awhile that once we'd fixed the buffer manager's useof a single global BufMgrLock, the next contention hotspot would be thelock manager's LockMgrLock.I've now seen actual evidence of that in
profiling pgbench: using a modified backend that counts LWLock-relatedwait operations, the LockMgrLock is responsible for an order of magnitudemore blockages than the next highest LWLock:PID 12971 lwlock LockMgrLock: shacq 0 exacq 50630 blk 3354
PID 12979 lwlock LockMgrLock: shacq 0 exacq 49706 blk 3323PID 12976 lwlock LockMgrLock: shacq 0 exacq 50567 blk 3304PID 12962 lwlock LockMgrLock: shacq 0 exacq 50635 blk 3278PID 12974 lwlock LockMgrLock: shacq 0 exacq 50599 blk 3251
PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243PID 12973 lwlock LockMgrLock: shacq 0 exacq 50321 blk 3200PID 12978 lwlock LockMgrLock: shacq 0 exacq 50266 blk 3177PID 12977 lwlock LockMgrLock: shacq 0 exacq 50379 blk 3148
PID 12975 lwlock LockMgrLock: shacq 0 exacq 49790 blk 3124PID 12971 lwlock WALInsertLock: shacq 0 exacq 24022 blk 408PID 12972 lwlock WALInsertLock: shacq 0 exacq 24021 blk 393PID 12976 lwlock WALInsertLock: shacq 0 exacq 24017 blk 390
PID 12977 lwlock WALInsertLock: shacq 0 exacq 24021 blk 388PID 12973 lwlock WALInsertLock: shacq 0 exacq 24018 blk 379PID 12962 lwlock WALInsertLock: shacq 0 exacq 24024 blk 377PID 12974 lwlock WALInsertLock: shacq 0 exacq 24016 blk 367
PID 12975 lwlock WALInsertLock: shacq 0 exacq 24021 blk 366PID 12978 lwlock WALInsertLock: shacq 0 exacq 24023 blk 354PID 12979 lwlock WALInsertLock: shacq 0 exacq 24033 blk 321PID 12973 lwlock ProcArrayLock: shacq 45214 exacq 6003 blk 241
PID 12971 lwlock ProcArrayLock: shacq 45355 exacq 6003 blk 225(etc)We had also seen evidence to this effect from OSDL:http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php
So it seems it's time to start thinking about how to reduce contentionfor the LockMgrLock.There are no interesting read-only operations on theshared lock table, so there doesn't seem to be any traction to be gained
by making some operations take just shared access to the LockMgrLock.The best idea I've come up with after a bit of thought is to replace theshared lock table with N independent tables representing partitions of the
lock space.Each lock would be assigned to one of these partitions basedon, say, a hash of its LOCKTAG.I'm envisioning N of 16 or so to achieve(hopefully) about an order-of-magnitude reduction of contention.There
would be a separate LWLock guarding each partition; the LWLock for a givenpartition would be considered to protect the LOCK objects assigned to thatpartition, all the PROCLOCK objects associated with each such LOCK, and
the shared-memory hash tables holding these objects (each partition wouldneed its own hash tables).A PGPROC's lock-related fields are onlyinteresting when it is waiting for a lock, so we could say that the
LWLock for the partition containing the lock it is waiting for must beheld to examine/change these fields.The per-PGPROC list of all PROCLOCKs belonging to that PGPROC is a bittricky to handle since it necessarily spans across partitions.We might
be able to deal with this with suitable rules about when the list can betouched, but I've not worked this out in detail.Another possibility isto break this list apart into N lists, one per partition, but that would
bloat the PGPROC array a bit, especially if we wanted larger N.The basic LockAcquire and LockRelease operations would only need toacquire the LWLock for the partition containing the lock they areinterested in; this is what gives us the contention reduction.
LockReleaseAll is also interesting from a performance point of view,since it executes at every transaction exit.If we divide PGPROC'sPROCLOCK list into N lists then it will be very easy for LockReleaseAll
to take only the partition locks it actually needs to release these locks;if not, we might have to resort to scanning the list N times, once whilewe hold the LWLock for each partition.I think that CheckDeadLock will probably require taking all the partition
LWLocks (as long as it does this in a predetermined order there is no riskof deadlock on the partition LWLocks).But one hopes this is not aperformance-critical operation.Ditto for GetLockStatusData.
One objection I can see to this idea is that having N lock hash tablesinstead of one will eat a larger amount of shared memory in hashtableoverhead.But the lock hashtables are fairly small relative to theshared buffer array (given typical configuration parameters) so this
doesn't seem like a 

Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-07 Thread Simon Riggs
On Wed, 2005-12-07 at 16:59 -0500, Tom Lane wrote:
 I've now seen actual evidence of that in
 profiling pgbench: using a modified backend that counts LWLock-related
 wait operations, 

 So it seems it's time to start thinking about how to reduce contention
 for the LockMgrLock

You're right to be following up this thought.

My concern, longer term is on our ability to determine contention issues
in an agreed way. I've long been thinking about wait-time measurement -
I think its the only way to proceed.

There's always a next-bottleneck, so I'd like to first agree the
diagnostic probes so we can decide how to determine that. That way we
can all work on solutions for various workloads, and prove that they
work, in those cases.

My view would be that the LockMgrLock is not relevant for all workloads,
but I want even more to be able to discuss whether it is, or is not, on
an accepted basis before discussions begin.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-07 Thread Alvaro Herrera
Tom Lane wrote:

Interesting proposal.

 LockReleaseAll is also interesting from a performance point of view,
 since it executes at every transaction exit.  If we divide PGPROC's
 PROCLOCK list into N lists then it will be very easy for LockReleaseAll
 to take only the partition locks it actually needs to release these locks;
 if not, we might have to resort to scanning the list N times, once while
 we hold the LWLock for each partition.

On the other hand, each scan would be shorter than the previous one; and
it's not necessary to hold each and every partition's LWLock, only the
one found in the first entry of the list on each scan until the list is
empty.  So it's N scans only in the worst case of a PGPROC holding locks
on all partitions.

 One objection I can see to this idea is that having N lock hash tables
 instead of one will eat a larger amount of shared memory in hashtable
 overhead.  But the lock hashtables are fairly small relative to the
 shared buffer array (given typical configuration parameters) so this
 doesn't seem like a major problem.

Is hashtable overhead all that large?  Each table could be made
initially size-of-current-table/N entries.  One problem is that
currently the memory freed from a hashtable is not put back into shmem
freespace, is it?

 While at it, I'm inclined to get rid of the current assumption that there
 are logically separate hash tables for different LockMethodIds.  AFAICS all
 that's doing for us is creating a level of confusion; there's nothing on
 the horizon suggesting we'd ever actually make use of the flexibility.

Yeah, please.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] row is too big: size 8916, maximum size 8136

2005-12-07 Thread Jan Wieck

On 12/6/2005 9:03 PM, Euler Taveira de Oliveira wrote:


Hi,

I'm doing some tests with a 700 columns' table. But when I try to load
some data with INSERT or COPY I got that message. I verified that the
BLCKZ is limiting the tuple size but I couldn't have a clue why it's
not using TOAST. I'm using PostgreSQL 8.0.3 in Slackware 10.1 box.
Let me know if you want a test case or other useful information.


The external reference of a toasted attribute is 20 bytes in size.

I might be wrong, but at 700 columns you have to ask your developers 
some serious questions about their qualification.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] HOOKS for Synchronous Replication?

2005-12-07 Thread Christopher Kings-Lynne

Anyone remember this patch?

http://gorda.di.uminho.pt/community/pgsqlhooks/

The discussion seems to be pretty minimal:

http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php

Does anyone see a need to investigate it further?

Chris


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-07 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Is hashtable overhead all that large?  Each table could be made
 initially size-of-current-table/N entries.  One problem is that
 currently the memory freed from a hashtable is not put back into shmem
 freespace, is it?

Yeah; the problem is mainly that we'd have to allocate extra space to
allow for unevenness of usage across the multiple hashtables.  It's hard
to judge how large the effect would be without testing, but I think that
this problem would inhibit us from having dozens or hundreds of separate
partitions.

A possible response is to try to improve dynahash.c to make its memory
management more flexible, but I'd prefer not to get into that unless
it becomes really necessary.  A shared freespace pool would create a
contention bottleneck of its own...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing contention for the LockMgrLock

2005-12-07 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 My view would be that the LockMgrLock is not relevant for all workloads,
 but I want even more to be able to discuss whether it is, or is not, on
 an accepted basis before discussions begin.

Certainly.  I showed the evidence that it is currently a significant
problem for pgbench-like workloads, but pgbench is of course not
representative of everything.

My feeling about it is that different workloads are going to expose
different weak spots, and so as long as a given test case isn't
obviously artificial, whatever bottleneck it exposes is fair game
to work on.  pgbench seems reasonably representative of a class of
applications with relatively short transactions, so I don't doubt that
if pgbench has a problem with LockMgrLock contention, there are real-
world cases out there that do too.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] HOOKS for Synchronous Replication?

2005-12-07 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 Anyone remember this patch?
 http://gorda.di.uminho.pt/community/pgsqlhooks/
 The discussion seems to be pretty minimal:
 http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php
 Does anyone see a need to investigate it further?

I had hoped to see some comments from the Slony people about it.
I'd feel better about the validity of a set of hooks if more than
one project agreed that it was useful/appropriate ...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Vertical Partitioning with TOAST

2005-12-07 Thread Jim C. Nasby
This seems like a useful feature to add, allowing for easy built-in
verticle partitioning. Are there issues with the patch as-is? (Other
than it probably should have gone to -patches...)

On Thu, Dec 01, 2005 at 05:59:08PM +0900, Junji TERAMOTO wrote:
 Hi all,
 
 I wrote a experimental patch for a vertical partitioning
 function.
 
 I decided to use the code of TOAST to create the function
 easily. In a word, the row that the user specified is forcedly
 driven out with TOAST.
 
 The performance gain of 10% was seen by driving out c_data of the
 customer table in the DBT-2 benchmark in our environment.
 
 The mechanism of TOAST is an overdesigned system to use it for a
 vertical partitioning. Because the overhead of processing is large,
 the performance might down according to the environment.
 
 There are seriously a lot of things that should be considered if
 a vertical partitioning is mounted.
 For instance, TOAST index is omitted, and ctid is used for link.
 
 Your comments are welcome. Thanks.
 
 ---
 How To Use
 ---
 Use ALTER TABLE command.
 http://www.postgresql.org/docs/8.1/static/sql-altertable.html
 
  ALTER TABLE name ALTER COLUMN column SET STRAGE FORCEEXTERNAL;
 
 I do not understand whether FORCEEXTERNAL is an appropriate
 word. Please teach when there is a better word...
 
 
 -- 
 Junji Teramoto

 diff -purN postgresql-8.1.0.org/src/backend/access/heap/heapam.c 
 postgresql-8.1.0/src/backend/access/heap/heapam.c
 --- postgresql-8.1.0.org/src/backend/access/heap/heapam.c 2005-10-15 
 11:49:08.0 +0900
 +++ postgresql-8.1.0/src/backend/access/heap/heapam.c 2005-12-01 
 15:31:38.307713257 +0900
 @@ -1070,6 +1070,36 @@ heap_get_latest_tid(Relation relation,
   }   /* end of loop 
 */
  }
  
 +// Add by junji from here
 +/*
 + *   has_rel_forceexternal - Is there SET STORAGE FORCEEXTERNALed rows?
 + */
 +bool
 +has_rel_forceexternal(Relation relation)
 +{
 + TupleDesc   tupleDesc;
 + Form_pg_attribute *att;
 + int numAttrs;
 + int i;
 +
 + /*
 +  * Get the tuple descriptor and break down the tuple(s) into fields.
 +  */
 + tupleDesc = relation-rd_att;
 + att = tupleDesc-attrs;
 + numAttrs = tupleDesc-natts;
 +
 + for (i = 0; i  numAttrs; i++)
 + {
 + if (att[i]-attstorage == 'f')
 + return true;
 + }
 + 
 + return false;
 +}
 +// Add by junji to here
 +
 +
  /*
   *   heap_insert - insert tuple into a heap
   *
 @@ -1130,6 +1160,9 @@ heap_insert(Relation relation, HeapTuple
* out-of-line attributes from some other relation, invoke the toaster.
*/
   if (HeapTupleHasExternal(tup) ||
 +// Add by junji from here
 + (has_rel_forceexternal(relation)) ||
 +// Add by junji to here
   (MAXALIGN(tup-t_len)  TOAST_TUPLE_THRESHOLD))
   heap_tuple_toast_attrs(relation, tup, NULL);
  
 @@ -1762,6 +1795,9 @@ l2:
*/
   need_toast = (HeapTupleHasExternal(oldtup) ||
 HeapTupleHasExternal(newtup) ||
 +// Add by junji from here
 +   (has_rel_forceexternal(relation)) ||
 +// Add by junji to here
 (MAXALIGN(newtup-t_len)  
 TOAST_TUPLE_THRESHOLD));
  
   newtupsize = MAXALIGN(newtup-t_len);
 diff -purN postgresql-8.1.0.org/src/backend/access/heap/tuptoaster.c 
 postgresql-8.1.0/src/backend/access/heap/tuptoaster.c
 --- postgresql-8.1.0.org/src/backend/access/heap/tuptoaster.c 2005-10-15 
 11:49:09.0 +0900
 +++ postgresql-8.1.0/src/backend/access/heap/tuptoaster.c 2005-12-01 
 15:29:29.722579466 +0900
 @@ -512,6 +512,46 @@ toast_insert_or_update(Relation rel, Hea
   }
   }
  
 +// Add by junji from here
 + /*
 +  * We look for attributes of attstorage 'f'.
 +  */
 + if (rel-rd_rel-reltoastrelid != InvalidOid)
 + {
 + Datum   old_value;
 +
 + /*--
 +  * Search for the biggest yet inlined attribute with
 +  * attstorage equals 'x' or 'e'
 +  *--
 +  */
 + for (i = 0; i  numAttrs; i++)
 + {
 + if (toast_action[i] == 'p')
 + continue;
 + if (VARATT_IS_EXTERNAL(toast_values[i]))
 + continue;
 + if (att[i]-attstorage != 'f')
 + continue;
 +
 + /*
 +  * Store this external
 +  */
 + old_value = toast_values[i];
 + toast_action[i] = 'p';
 + toast_values[i] = toast_save_datum(rel, 
 toast_values[i]);
 + if (toast_free[i])
 + pfree(DatumGetPointer(old_value));
 +
 + 

Re: [HACKERS] generalizing the planner knobs

2005-12-07 Thread Jim C. Nasby
On Thu, Dec 01, 2005 at 12:32:12PM -0500, Qingqing Zhou wrote:
 
 Neil Conway [EMAIL PROTECTED] wrote
 
  This would also be useful when diagnosing bad query plans: for example,
  setting enable_seqscan=false often causes the planner to disregard the
  use of *any* sequential scan, anywhere in the plan. The ability to
  slightly bump up the cost of particular operations would allow more
  alternative plans to be examined.
 
 
 This method also has the problem of enable_seqscan=false in some 
 situations. I would vote we implement the final general solution like query 
 plan hints directly.

BTW, there's another end to the 'enable_seqscan=false' problem... it
sometimes doesn't work! Last I looked, enable_seqscan=false only added a
fixed overhead cost to a seqscan (100 IIRC). The problem is, some
queries will produce estimates for other methodes that are more
expensive than a seqscan even with the added burden. If instead of
adding a fixed amount enable_seqscan=false multiplied by some amount
then this would probably be impossible to occur.

(And before someone asks, no, I don't remember which query was actually
faster...)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Vertical Partitioning with TOAST

2005-12-07 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 This seems like a useful feature to add, allowing for easy built-in
 verticle partitioning. Are there issues with the patch as-is?

Other than the ones mentioned by the poster?

It seemed to me more like a not-too-successful experiment than something
ready for application.  If you take the viewpoint that this is just
another TOAST storage strategy, I think it's pretty useless.  A large
field value is going to get toasted anyway with the regular strategy,
and if your column happens to contain some values that are not large,
forcing them out-of-line anyway is simply silly.  (You could make a case
for making the threshold size user-controllable, but I don't see the
case for setting the threshold to zero, which is what this amounts to.)

The poster was not actually suggesting applying it in the form of a
force-external TOAST strategy; he was using this as a prototype to try
to interest people in the idea of out-of-line storage mechanisms with
lower overhead than TOAST.  But that part is all speculation not code.

Personally, I'd rather look into whether we couldn't speed up TOAST
without changing any of its basic assumptions.  The current
implementation isn't awful, but it was built to allow the existing table
and index mechanisms to be re-used for TOAST data.  Now that we know for
certain TOAST is a good idea, it would be reasonable to take a second
look at whether we could improve the performance with another round of
implementation effort.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing relation locking overhead

2005-12-07 Thread Jim C. Nasby
On Fri, Dec 02, 2005 at 03:25:58PM -0500, Greg Stark wrote:
 Postgres would have no trouble building an index of the existing data using
 only shared locks. The problem is that any newly inserted (or updated) records
 could be missing from such an index.
 
 To do it you would then have to gather up all those newly inserted records.
 And of course while you're doing that new records could be inserted. And so
 on. There's no guarantee it would ever finish, though I suppose you could
 detect the situation if the size of the new batch wasn't converging to 0 and
 throw an error.

Why throw an error? Just grab a lock that would prevent any new inserts
from occuring. Or at least make that an option.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing relation locking overhead

2005-12-07 Thread Jim C. Nasby
On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
 Tom Lane [EMAIL PROTECTED] writes:
  What's worse, once you have excluded writes you have to rescan the entire
  table to be sure you haven't missed anything. So in the scenarios where this
  whole thing is actually interesting, ie enormous tables, you're still
  talking about a fairly long interval with writes locked out. Maybe not as
  long as a complete REINDEX, but long.
 
 I was thinking you would set a flag to disable use of the FSM for
 inserts/updates while the reindex was running. So you would know where to find
 the new tuples, at the end of the table after the last tuple you read.

What about keeping a seperate list of new tuples? Obviously we'd only do
this when an index was being built on a table. Since it would probably
be problematic and expensive to check for this every time you accessed a
table, it would make sense to check only at the start of a transaction
and have an index build wait until all running transactions knew that an
index build was going to happen.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Reducing relation locking overhead

2005-12-07 Thread Hannu Krosing
Ühel kenal päeval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
 On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
  Tom Lane [EMAIL PROTECTED] writes:
   What's worse, once you have excluded writes you have to rescan the entire
   table to be sure you haven't missed anything. So in the scenarios where 
   this
   whole thing is actually interesting, ie enormous tables, you're still
   talking about a fairly long interval with writes locked out. Maybe not as
   long as a complete REINDEX, but long.
  
  I was thinking you would set a flag to disable use of the FSM for
  inserts/updates while the reindex was running. So you would know where to 
  find
  the new tuples, at the end of the table after the last tuple you read.
 
 What about keeping a seperate list of new tuples? Obviously we'd only do
 this when an index was being built on a table. 

The problem with separate list is that it can be huge. For example on a
table with 200 inserts/updates per second an index build lasting 6 hours
would accumulate total on 6*3600*200 = 432 new tuples.


Hannu



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing relation locking overhead

2005-12-07 Thread Jim C. Nasby
On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote:
 ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
  On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
   Tom Lane [EMAIL PROTECTED] writes:
What's worse, once you have excluded writes you have to rescan the 
entire
table to be sure you haven't missed anything. So in the scenarios where 
this
whole thing is actually interesting, ie enormous tables, you're still
talking about a fairly long interval with writes locked out. Maybe not 
as
long as a complete REINDEX, but long.
   
   I was thinking you would set a flag to disable use of the FSM for
   inserts/updates while the reindex was running. So you would know where to 
   find
   the new tuples, at the end of the table after the last tuple you read.
  
  What about keeping a seperate list of new tuples? Obviously we'd only do
  this when an index was being built on a table. 
 
 The problem with separate list is that it can be huge. For example on a
 table with 200 inserts/updates per second an index build lasting 6 hours
 would accumulate total on 6*3600*200 = 432 new tuples.

Sure, but it's unlikely that such a table would be very wide, so 4.3M
tuples would probably only amount to a few hundred MB of data. It's also
possible that this list could be vacuumed by whatever the regular vacuum
process is for the table.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] 8.1, OID's and plpgsql

2005-12-07 Thread Jim C. Nasby
On Wed, Dec 07, 2005 at 12:06:23AM -0500, Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  Tom Lane [EMAIL PROTECTED] writes:
  Rather than hard-wiring a special case for any of these things, I'd much
  rather see us implement INSERT...RETURNING and UPDATE...RETURNING as per
  previous suggestions.
 
  I wonder whether the ui tools need anything more low level than that. In
  general sticking their grubby fingers in the query the user entered seems
  wrong and they would have to tack on a RETURNING clause.
 
 That was mentioned before as a possible objection, but I'm not sure that
 I buy it.  The argument seems to be that a client-side driver would
 understand the query and table structure well enough to know what to do
 with a returned pkey value, but not well enough to understand how to
 tack on a RETURNING clause to request that value.  This seems a bit
 bogus.
 
 There may be some point in implementing a protocol-level equivalent of
 RETURNING just to reduce the overhead on both sides, but I think we
 ought to get the RETURNING functionality in place first and then worry
 about that...

Along those lines, I don't see anything on the TODO list about this...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Vertical Partitioning with TOAST

2005-12-07 Thread Jim C. Nasby
On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  This seems like a useful feature to add, allowing for easy built-in
  verticle partitioning. Are there issues with the patch as-is?
 
 Other than the ones mentioned by the poster?
 
 It seemed to me more like a not-too-successful experiment than something
 ready for application.  If you take the viewpoint that this is just
 another TOAST storage strategy, I think it's pretty useless.  A large
 field value is going to get toasted anyway with the regular strategy,
 and if your column happens to contain some values that are not large,
 forcing them out-of-line anyway is simply silly.  (You could make a case
 for making the threshold size user-controllable, but I don't see the
 case for setting the threshold to zero, which is what this amounts to.)

Valid point. I do think there's a lot of benefit to being able to set
the limit much lower than what it currently defaults to today. We have a
client that has a queue-type table that is updated very frequently. One
of the fields is text, that is not updated as frequently. Keeping this
table vacuumed well enough has proven to be problematic, because any
delay to vacuuming quickly results in a very large amount of bloat.
Moving that text field into a seperate table would most likely be a win.

Presumably this would need to be settable on at least a per-table basis.

Would adding such a variable be a good beginner TODO, or is it too
invasive?

 Personally, I'd rather look into whether we couldn't speed up TOAST
 without changing any of its basic assumptions.  The current
 implementation isn't awful, but it was built to allow the existing table
 and index mechanisms to be re-used for TOAST data.  Now that we know for
 certain TOAST is a good idea, it would be reasonable to take a second
 look at whether we could improve the performance with another round of
 implementation effort.

I've often wondered about all the overhead of storing toast data in what
amounts to a regular table. Sounds like another TODO...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match