Re: [PERFORM] Best settings to load a fresh database

2009-08-07 Thread Euler Taveira de Oliveira
Kenneth Marshall escreveu:
 I have found that increasing maintenance_work_mem speeds
 index rebuilds, turn off synchronous_commit or fsync if
 you really can afford to start over. Another big help is
 to use the parallel pg_restore from PostgreSQL 8.4.0 to
 perform the restore.
 
And make sure archive mode is turned off. Otherwise, you can't use the WAL
bypass facility.


-- 
  Euler Taveira de Oliveira
  http://www.timbira.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SQL select query becomes slow when using limit (with no offset)

2009-08-07 Thread Kees van Dieren
Thanks for your response.

I think your analysis is correct, When there are more than 100 rows that
match this query, limit 100 is fast.

However, we often have less than hundred rows, so this is not sufficient for
us.

This suggestion ('OFFSET 0' trick) did not show differences in response time
(runs in 947ms).

One thing that helps, is limiting the set by adding this to where clause:
and events_events.dateTime  '2009-07-24'.
(query now runs in approx 500ms)

The workaround we implemented,  is query caching in our application (Java,
with JPA / Hibernate second level query cache). This actually solves the
problem for us, but I'd prefer to get this query perform in postgres as
well. I'd think that the Postgresql query planner should be smarter in
handling LIMIT statements.

Would it get attention if I submit this to
http://www.postgresql.org/support/submitbug ? (in fact it is not really a
bug, but an improvement request).

Best regards,

Kees


2009/8/5 Russell Smith mr-r...@pws.com.au

 Kees van Dieren wrote:
  Hi Folks,
 
  Thanks for your response.
 
  I have added the following index (suggested by other post):
 
  CREATE INDEX events_events_cleared_eventtype
ON events_events
USING btree
(eventtype_id, cleared)
WHERE cleared = false;
 
  Also with columns in reversed order.
 
  No changes in response time noticed.
 
  Index on cleared column already is there (indices are in sql file
  attached to initial post.). eventtype_id has a foreign key constraint,
  which adds an index automatically I believe?
 
  The explain analyze results for both queries:
  explain analyze select events_events.id http://events_events.id FROM
  events_events
  left join events_event_types on
  events_events.eventType_id=events_event_types.id
  http://events_event_types.id
  where events_event_types.severity=70
  and not events_events.cleared
  order by events_events.dateTime DESC LIMIT 100
  
  Limit  (cost=0.00..125.03 rows=100 width=16) (actual
  time=0.046..3897.094 rows=77 loops=1)
-  Nested Loop  (cost=0.00..120361.40 rows=96269 width=16) (actual
  time=0.042..3896.881 rows=77 loops=1)
  -  Index Scan Backward using events_events_datetime_ind on
  events_events  (cost=0.00..18335.76 rows=361008 width=24) (actual
  time=0.025..720.345 rows=360637 loops=1)
Filter: (NOT cleared)
  -  Index Scan using events_event_types_pkey on
  events_event_types  (cost=0.00..0.27 rows=1 width=8) (actual
  time=0.003..0.003 rows=0 loops=360637)
Index Cond: (events_event_types.id
  http://events_event_types.id = events_events.eventtype_id)
Filter: (events_event_types.severity = 70)
  Total runtime: 3897.268 ms
 
 The plan here is guessing that we will find the 100 rows we want pretty
 quickly by scanning the dateTime index.  As we aren't expecting to have
 to look through many rows to find 100 that match the criteria.  With no
 cross column statistics it's more a guess than a good calculation.  So
 the guess is bad and we end up scanning 360k rows from the index before
 we find what we want.   My skills are not up to giving specific advise
 on how to avert this problem.  Maybe somebody else can help there.
  explain analyze select events_events.id http://events_events.id FROM
  events_events
  left join events_event_types on
  events_events.eventType_id=events_event_types.id
  http://events_event_types.id
  where events_event_types.severity=70
  and not events_events.cleared
  order by events_events.dateTime DESC
  
  Sort  (cost=20255.18..20495.85 rows=96269 width=16) (actual
  time=1084.842..1084.951 rows=77 loops=1)
Sort Key: events_events.datetime
Sort Method:  quicksort  Memory: 20kB
-  Hash Join  (cost=2.09..12286.62 rows=96269 width=16) (actual
  time=1080.789..1084.696 rows=77 loops=1)
  Hash Cond: (events_events.eventtype_id =
  events_event_types.id http://events_event_types.id)
  -  Seq Scan on events_events  (cost=0.00..9968.06
  rows=361008 width=24) (actual time=0.010..542.946 rows=360637 loops=1)
Filter: (NOT cleared)
  -  Hash  (cost=1.89..1.89 rows=16 width=8) (actual
  time=0.077..0.077 rows=16 loops=1)
-  Seq Scan on events_event_types  (cost=0.00..1.89
  rows=16 width=8) (actual time=0.010..0.046 rows=16 loops=1)
  Filter: (severity = 70)
  Total runtime: 1085.145 ms
 
  Any suggestions?
 This plan is faster as you avoid the index scan.  The planner is
 preferring to do a tablescan to find what it needs.  This is much faster
 than the 360k random I/O index lookups.  You can force this type of plan
 with a subquery and the OFFSET 0 trick, but I'm not sure it's the best
 solution.

 eg

 explain analyze SELECT * FROM
(SELECT events_events.id http://events_events.id FROM events_events
 LEFT JOIN events_event_types on
 events_events.eventType_id=events_event_types.id
 http://events_event_types.id
WHERE events_event_types.severity=70
 

Re: [PERFORM] SQL select query becomes slow when using limit (with no offset)

2009-08-07 Thread Robert Haas
On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dierenkeesvandie...@gmail.com wrote:
 Would it get attention if I submit this to
 http://www.postgresql.org/support/submitbug ? (in fact it is not really a
 bug, but an improvement request).

I think that many of the people who read that mailing list also read
this one, including, critically, Tom Lane, and you're not the first
person to run into a problem caused by lack of cross-column
statistics.  I don't think you're going to get very far by submitting
this as a bug.  There are already several people interested in this
problem, but as most of us don't get paid to hack on PostgreSQL, it's
a question of finding enough round tuits; this is not an easy thing to
fix.

If you are sufficiently bothered by this problem that you are willing
to pay someone to fix it for you, there are several companies with
whom you can contract to get this feature developed and committed for
the next release of PostgreSQL.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postgresql and syslog

2009-08-07 Thread Michael Nacos
we have actually gone the opposite way and switched to using syslog
for logging purposes some time ago, with no performance issues.

syslog files are easily read by a lot of applications out there. We have
been using rsyslog for aggregating logs from multiple servers, splunk
for analysis purposes and pgfouine for routine reports.

I would be very surprised if logging had a significant overhead any method
you choose. there's probably something very wrong with your setup if this
is the case.

just another dimension, Michael


Re: [PERFORM] postgresql and syslog

2009-08-07 Thread Alvaro Herrera
Michael Nacos escribió:

 I would be very surprised if logging had a significant overhead any method
 you choose. there's probably something very wrong with your setup if this
 is the case.

Either something very wrong, or the load is extremely high.  In the
latter case perhaps it would make sense to ship syslog to a remote
machine.  Since it uses UDP sockets, it wouldn't block when overloaded
but rather lose messages (besides, it means it has low overhead).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SQL select query becomes slow when using limit (with no offset)

2009-08-07 Thread Scott Carey

On 8/7/09 5:53 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dierenkeesvandie...@gmail.com
 wrote:
 Would it get attention if I submit this to
 http://www.postgresql.org/support/submitbug ? (in fact it is not really a
 bug, but an improvement request).
 
 I think that many of the people who read that mailing list also read
 this one, including, critically, Tom Lane, and you're not the first
 person to run into a problem caused by lack of cross-column
 statistics.  

Critically, it should be understood that this general problem is not just
born from lack of cross-column statistics.

It is also one of using the statistical expected value to calculate cost
without consideration of the variance.  Often, the cost of a plan varies
widely and nonlinearly with a small change in the expected value the stats
used to estimate cost.

The problem is acute with LIMIT and various other boundary conditions where
there is a steep 'cost cliff' for certain plan types.  When LIMIT is
applied, the planner changes its estimates, but does not take into account
the _greatly_ increased uncertainty of those estimates.

Imagine a case where the planner's estimate is 100% correct, and on average
one side of a join will have 2 tuples.  The planner chooses nested loops to
do that join.  But the distribution of the number of tuples at this node is
skewed, so although the expected value is 2, a values of 10 and 0 are both
common.  When values of 10 occur, the execution time goes up significantly.
An alternate plan might be slower for the case where the actual values in
execution equal the expected values, but faster in the average case!
The average cost of a plan is NOT that cost of the query with average
statistics, due to variance, nonlinearity, and skew.  And even if they were
equal, it might not be preferable to choose the plan that is best on average
over the one that is best at the 90th percentile.

Getting back to the case above with the nestloop -- if the planner estimated
the cost of the nestloop join versus other joins with some idea of the
variance in the estimations it could favor more 'stable' execution plans.

So in short, cross-column statistics don't have to be gathered and used to
make this problem less acute.  The planner just has to be more aware of
variance and the things that lead to it, such as column correlation.
Thus with a LIMIT applied, the expected value for the number of tuples
scanned before a match will shrink, but the uncertainty of this estimate
grows significantly as well, so the right plan to choose is one that hedges
against the uncertainty, not the one that assumes the expected value is
correct.
Gathering and using cross column statistics will change the expected value
for some plans, but more importantly will allow the uncertainty to be
reduced.  Better stats go hand in hand with the uncertainty analysis because
one can never have all cross column statistics, across all tables and all
join-function spaces, analyzed.  Stats gathering can never be complete or
without flaws.  The planner can never be perfect.


   

 I don't think you're going to get very far by submitting
 this as a bug.  There are already several people interested in this
 problem, but as most of us don't get paid to hack on PostgreSQL, it's
 a question of finding enough round tuits; this is not an easy thing to
 fix.
 
 If you are sufficiently bothered by this problem that you are willing
 to pay someone to fix it for you, there are several companies with
 whom you can contract to get this feature developed and committed for
 the next release of PostgreSQL.
 
 ...Robert
 
 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Need suggestions on kernel settings for dedicated FreeBSD/Postgresql machine

2009-08-07 Thread Culley Harrelson
Hi Everyone,

I manage a freeBSD server that is dedicated to postgresql.  The
machine has 4 gigs of ram and there is a single database powering a
web application that is hosted on a neighboring machine.  The web
application is mostly reading the database but there are considerable
writes and I don't want to tune the machine exclusively for writes.  I
realize more information would be needed to optimally tune the machine
but I am seeking advice on making some sane kernel settings for a
general purpose database on a dedicated system.  Currently I have:

$ cat /etc/sysctl.conf

kern.ipc.shmmax=268435456
kern.ipc.shmall=65536

and

$ cat /boot/loader.conf
kern.ipc.semmni=256
kern.ipc.semmns=512
kern.ipc.semmnu=256

In postgresql.conf I have:

max_connections = 180
shared_buffers = 28MB

I would like to increase this to 256 connections and make sure the
kernel settings are giving postgresql enough breathing room without.
I suspect my settings are conservative and since the machine is
dedicated to postgresql I would like to give it more resources if they
could be used.  Any suggestions?

culley

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] PG-related ACM Article: The Pathologies of Big Data

2009-08-07 Thread Josh Kupershmidt
Just stumbled across this recent article published in the
Communications of the ACM:

http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext

The author shares some insights relating to difficulties processing a
6.75 billion-row
table, a dummy table representing census-type data for everyone on earth, in
Postgres.

I'd really like to replicate the author's experiment, but it's not clear from
the article what his table definition looks like. He claims to be using a
16-byte record to store the several columns he needs for each row, so perhaps
he's using a user-defined type?

The author implies with his definition of big data that the dataset he
analyzed is ... too large to be placed in a relational database... . From
Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
run on 6.75 billion rows. This amount of time for the query didn't seem
surprising to me given how many rows he has to process, but in a recent post
on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
ran the same SELECT query in 105 minutes! This would be very impressive (a
10-fold improvement over Postgres) if true.

The author complained that on larger tables [Postgres' planner] switched to
sorting by grouping columns, which he blamed for the slow query execution. I
don't personally see this plan as a problem, but maybe someone can enlighten
me.

One intriguing tidbit I picked up from the article: in modern systems, as
demonstrated in the figure, random access to memory is typically slower than
sequential access to disk. In hindsight, this seems plausible (since modern
disks can sustain sequential reads at well over 100MB/sec).

Anyway, it would be very interesting to attempt to speed up the author's query
if at all possible.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data

2009-08-07 Thread Greg Stark
On Fri, Aug 7, 2009 at 9:17 PM, Josh Kupershmidtschmi...@gmail.com wrote:
 Just stumbled across this recent article published in the
 Communications of the ACM:

 http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext

 The author shares some insights relating to difficulties processing a
 6.75 billion-row
 table, a dummy table representing census-type data for everyone on earth, in
 Postgres.

 I'd really like to replicate the author's experiment, but it's not clear from
 the article what his table definition looks like. He claims to be using a
 16-byte record to store the several columns he needs for each row, so perhaps
 he's using a user-defined type?

or four integers, or who knows. Postgres's per-row overhead is 24
bytes plus a 16-bit line pointer so you're talking about 42 bytes per
row. There's per-page overhead and alignment but in this case it
shouldn't be much.



 The author implies with his definition of big data that the dataset he
 analyzed is ... too large to be placed in a relational database... . From
 Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
 run on 6.75 billion rows. This amount of time for the query didn't seem
 surprising to me given how many rows he has to process, but in a recent post
 on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
 ran the same SELECT query in 105 minutes! This would be very impressive (a
 10-fold improvement over Postgres) if true.

6.75 billion * 42 bytes is 283.5GB.

Assuming you stick that on a single spindle capable of 100MB/s:

You have: 283.5GB / (100MB/s)
You want: min
* 47.25

So something's not adding up.

 One intriguing tidbit I picked up from the article: in modern systems, as
 demonstrated in the figure, random access to memory is typically slower than
 sequential access to disk. In hindsight, this seems plausible (since modern
 disks can sustain sequential reads at well over 100MB/sec).

Sure, but the slowest PCIe bus can sustain 1GB/s and your memory
bandwidth is probably at least 8GB/s.

-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data

2009-08-07 Thread Scott Marlowe
On Fri, Aug 7, 2009 at 2:17 PM, Josh Kupershmidtschmi...@gmail.com wrote:
 Just stumbled across this recent article published in the
 Communications of the ACM:

 http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext

 The author shares some insights relating to difficulties processing a
 6.75 billion-row
 table, a dummy table representing census-type data for everyone on earth, in
 Postgres.

 I'd really like to replicate the author's experiment, but it's not clear from
 the article what his table definition looks like. He claims to be using a
 16-byte record to store the several columns he needs for each row, so perhaps
 he's using a user-defined type?

 The author implies with his definition of big data that the dataset he
 analyzed is ... too large to be placed in a relational database... . From
 Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
 run on 6.75 billion rows. This amount of time for the query didn't seem
 surprising to me given how many rows he has to process, but in a recent post
 on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
 ran the same SELECT query in 105 minutes! This would be very impressive (a
 10-fold improvement over Postgres) if true.

Well, from the article, I got the feeling he never showed up here on
the list to ask for help, and he just assumed he knew enough about
postgresql to say it couldn't scale well.  I just checked the
archives, and his name doesn't show up.

When you look at his slides, this one makes we wonder about a few points:

http://deliveryimages.acm.org/10.1145/154/1536632/figs/f3.jpg

He was using 8 15kSAS in RAID-5.  Just the fact that he's using RAID-5
to test makes me wonder, but for his mostly-read workload it's useful.
 But on his machine he was only getting 53MB/second sequential reads?
That makes no sense.  I was getting 50MB/s from a 4 disk SATA RAID on
older 120G hard drives years ago.  SAS drives haven't been around that
long really, so I can't imagine having 7 disks (1 for parity) and only
getting 53/7 or 7.5MB/second from them.  That's horrible.  I had 9 Gig
5.25 full height drives faster than that back in the day, on eight bit
scsi controllers.  His memory read speed was pretty bad too at only
350MB/s.  I have a 12 drive RAID-10 that can outrun his memory reads.
So I tend to think his OS was setup poorly, or his hardware was
broken, or something like that.

 The author complained that on larger tables [Postgres' planner] switched to
 sorting by grouping columns, which he blamed for the slow query execution. I
 don't personally see this plan as a problem, but maybe someone can enlighten
 me.

I'm sure that if he was on faster hardware it might have been quite a
bit faster.  I'd love to try his test on a real server with RAID-10
and lots of memory.  I'm certain I could get the run time down by a
couple factors.

I wonder if he cranked up work_mem? I wonder if he even upped shared_buffers?

 One intriguing tidbit I picked up from the article: in modern systems, as
 demonstrated in the figure, random access to memory is typically slower than
 sequential access to disk. In hindsight, this seems plausible (since modern
 disks can sustain sequential reads at well over 100MB/sec).

This is generally always true.  But his numbers are off by factors for
a modern system.  Pentium IIs could sequentially read in the several
hundreds of megs per second from memory.  Any modern piece of kit,
including my laptop, can do much much better than 350Meg/second from
memory.

I wonder if he'd make his work available to mess with, as it seems he
did a pretty poor job setting up his database server / OS for this
test.  At the very least I wonder if he has a colleague on this list
who might point him to us so we can try to help him improve the dismal
performance he seems to be getting.  Or maybe he could just google
postgresql performance tuning and take it from there...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data

2009-08-07 Thread Scott Marlowe
Oh I just noticed his graphic is values per second but he had
originally said they were 16 bit values.  Even if they were 32 or 64
bit values, I'd expect way more than what he's getting there.

On Fri, Aug 7, 2009 at 6:40 PM, Scott Marlowescott.marl...@gmail.com wrote:
 Well, from the article, I got the feeling he never showed up here on
 the list to ask for help, and he just assumed he knew enough about
 postgresql to say it couldn't scale well.  I just checked the
 archives, and his name doesn't show up.

 When you look at his slides, this one makes we wonder about a few points:

 http://deliveryimages.acm.org/10.1145/154/1536632/figs/f3.jpg


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data

2009-08-07 Thread Scott Carey
Well, there is CPU overhead for reading postgres pages and tuples.  On a
disk subsystem that gets 1GB/sec sequential reads, I can't get more than
about 700MB/sec of I/O and on a select count(*) query on very large tables
with large rows (600 bytes) and its closer to 300MB/sec if the rows are
smaller (75 bytes). In both cases it is CPU bound with little i/o wait and
disk utilization under 65% in iostat.

I also get over 13GB/sec to RAM from a single thread (Nehalem processor).

I don't see how on any recent hardware, random access to RAM is slower than
sequential from disk.  RAM access, random or not, is measured in GB/sec...



On 8/7/09 5:42 PM, Scott Marlowe scott.marl...@gmail.com wrote:

 Oh I just noticed his graphic is values per second but he had
 originally said they were 16 bit values.  Even if they were 32 or 64
 bit values, I'd expect way more than what he's getting there.
 
 On Fri, Aug 7, 2009 at 6:40 PM, Scott Marlowescott.marl...@gmail.com wrote:
 Well, from the article, I got the feeling he never showed up here on
 the list to ask for help, and he just assumed he knew enough about
 postgresql to say it couldn't scale well.  I just checked the
 archives, and his name doesn't show up.
 
 When you look at his slides, this one makes we wonder about a few points:
 
 http://deliveryimages.acm.org/10.1145/154/1536632/figs/f3.jpg
 
 
 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data

2009-08-07 Thread Scott Marlowe
On Fri, Aug 7, 2009 at 7:34 PM, Scott Careysc...@richrelevance.com wrote:
 Well, there is CPU overhead for reading postgres pages and tuples.  On a
 disk subsystem that gets 1GB/sec sequential reads, I can't get more than
 about 700MB/sec of I/O and on a select count(*) query on very large tables
 with large rows (600 bytes) and its closer to 300MB/sec if the rows are
 smaller (75 bytes). In both cases it is CPU bound with little i/o wait and
 disk utilization under 65% in iostat.

 I also get over 13GB/sec to RAM from a single thread (Nehalem processor).

 I don't see how on any recent hardware, random access to RAM is slower than
 sequential from disk.  RAM access, random or not, is measured in GB/sec...

I don't think anybody's arguing that.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SQL select query becomes slow when using limit (with no offset)

2009-08-07 Thread Robert Haas
On Fri, Aug 7, 2009 at 5:09 PM, Scott Careysc...@richrelevance.com wrote:
 On 8/7/09 5:53 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dierenkeesvandie...@gmail.com
 wrote:
 Would it get attention if I submit this to
 http://www.postgresql.org/support/submitbug ? (in fact it is not really a
 bug, but an improvement request).

 I think that many of the people who read that mailing list also read
 this one, including, critically, Tom Lane, and you're not the first
 person to run into a problem caused by lack of cross-column
 statistics.

 Critically, it should be understood that this general problem is not just
 born from lack of cross-column statistics.

 It is also one of using the statistical expected value to calculate cost
 without consideration of the variance.  Often, the cost of a plan varies
 widely and nonlinearly with a small change in the expected value the stats
 used to estimate cost.

 The problem is acute with LIMIT and various other boundary conditions where
 there is a steep 'cost cliff' for certain plan types.  When LIMIT is
 applied, the planner changes its estimates, but does not take into account
 the _greatly_ increased uncertainty of those estimates.

 Imagine a case where the planner's estimate is 100% correct, and on average
 one side of a join will have 2 tuples.  The planner chooses nested loops to
 do that join.  But the distribution of the number of tuples at this node is
 skewed, so although the expected value is 2, a values of 10 and 0 are both
 common.  When values of 10 occur, the execution time goes up significantly.
 An alternate plan might be slower for the case where the actual values in
 execution equal the expected values, but faster in the average case!
 The average cost of a plan is NOT that cost of the query with average
 statistics, due to variance, nonlinearity, and skew.  And even if they were
 equal, it might not be preferable to choose the plan that is best on average
 over the one that is best at the 90th percentile.

 Getting back to the case above with the nestloop -- if the planner estimated
 the cost of the nestloop join versus other joins with some idea of the
 variance in the estimations it could favor more 'stable' execution plans.

 So in short, cross-column statistics don't have to be gathered and used to
 make this problem less acute.  The planner just has to be more aware of
 variance and the things that lead to it, such as column correlation.
 Thus with a LIMIT applied, the expected value for the number of tuples
 scanned before a match will shrink, but the uncertainty of this estimate
 grows significantly as well, so the right plan to choose is one that hedges
 against the uncertainty, not the one that assumes the expected value is
 correct.

This is a good analysis.  I think I proposed some kind of idea about
tracking uncertainty in the planner a while back, but I never got very
far with it.  The problem is, first, figuring out how to estimate the
uncertainty, and second, figuring out what to do with the result once
you've estimated it.  The concept of hedging against uncertainty is
the right one, I think, but it's not obvious how to fit that into the
cost-comparison algorithm that the planner uses.  A sticking point
here too is that the comparison of path costs is already a hot spot;
making it more complex will likely have a noticeable negative impact
on query planning time.  For queries against huge databases that may
not matter much, but for OLTP queries it certainly does.

There are a couple of methods that have been proposed to deal with
this in the past.  The most obvious one that's been talked about a
couple of times is switching a nested loop to a hash join if the
number of iterations exceeds some bound, which would require some
executor support.

Empirically, it seems to me that the planner generally follows a
pretty consistent pattern.  If the inner relation is tiny or the
number of loops is estimated to be very small, it uses a nested loop.
When the inner rel is a bit bigger, or the number of iterations is
nontrivial, it switches to a hash join.  When the inner relation gets
too big to fit in work_mem, or just big enough that hashing it looks
too slow, it switches to a nested loop with inner index-scan or,
especially if a useful sort is available, a merge join.

Just handling better the case where we pick a straight nested loop
rather than a hash join would help a lot of people.  Some basic
conservatism about the number of outer rows would be useful here (in
particular, we should probably assume that there will be at least 2
when costing a nest loop, unless the outer side is known unique), and
it's also worth thinking about the fact that a hash join won't build
the table unless there is at least 1 outer row, which I don't think
the current costing algorithm takes into account.  Or maybe we should
estimate the amount by which the nested loop figures to beat out the
hash join and only accepted