Re: [PERFORM] Best settings to load a fresh database
Kenneth Marshall escreveu: I have found that increasing maintenance_work_mem speeds index rebuilds, turn off synchronous_commit or fsync if you really can afford to start over. Another big help is to use the parallel pg_restore from PostgreSQL 8.4.0 to perform the restore. And make sure archive mode is turned off. Otherwise, you can't use the WAL bypass facility. -- Euler Taveira de Oliveira http://www.timbira.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SQL select query becomes slow when using limit (with no offset)
Thanks for your response. I think your analysis is correct, When there are more than 100 rows that match this query, limit 100 is fast. However, we often have less than hundred rows, so this is not sufficient for us. This suggestion ('OFFSET 0' trick) did not show differences in response time (runs in 947ms). One thing that helps, is limiting the set by adding this to where clause: and events_events.dateTime '2009-07-24'. (query now runs in approx 500ms) The workaround we implemented, is query caching in our application (Java, with JPA / Hibernate second level query cache). This actually solves the problem for us, but I'd prefer to get this query perform in postgres as well. I'd think that the Postgresql query planner should be smarter in handling LIMIT statements. Would it get attention if I submit this to http://www.postgresql.org/support/submitbug ? (in fact it is not really a bug, but an improvement request). Best regards, Kees 2009/8/5 Russell Smith mr-r...@pws.com.au Kees van Dieren wrote: Hi Folks, Thanks for your response. I have added the following index (suggested by other post): CREATE INDEX events_events_cleared_eventtype ON events_events USING btree (eventtype_id, cleared) WHERE cleared = false; Also with columns in reversed order. No changes in response time noticed. Index on cleared column already is there (indices are in sql file attached to initial post.). eventtype_id has a foreign key constraint, which adds an index automatically I believe? The explain analyze results for both queries: explain analyze select events_events.id http://events_events.id FROM events_events left join events_event_types on events_events.eventType_id=events_event_types.id http://events_event_types.id where events_event_types.severity=70 and not events_events.cleared order by events_events.dateTime DESC LIMIT 100 Limit (cost=0.00..125.03 rows=100 width=16) (actual time=0.046..3897.094 rows=77 loops=1) - Nested Loop (cost=0.00..120361.40 rows=96269 width=16) (actual time=0.042..3896.881 rows=77 loops=1) - Index Scan Backward using events_events_datetime_ind on events_events (cost=0.00..18335.76 rows=361008 width=24) (actual time=0.025..720.345 rows=360637 loops=1) Filter: (NOT cleared) - Index Scan using events_event_types_pkey on events_event_types (cost=0.00..0.27 rows=1 width=8) (actual time=0.003..0.003 rows=0 loops=360637) Index Cond: (events_event_types.id http://events_event_types.id = events_events.eventtype_id) Filter: (events_event_types.severity = 70) Total runtime: 3897.268 ms The plan here is guessing that we will find the 100 rows we want pretty quickly by scanning the dateTime index. As we aren't expecting to have to look through many rows to find 100 that match the criteria. With no cross column statistics it's more a guess than a good calculation. So the guess is bad and we end up scanning 360k rows from the index before we find what we want. My skills are not up to giving specific advise on how to avert this problem. Maybe somebody else can help there. explain analyze select events_events.id http://events_events.id FROM events_events left join events_event_types on events_events.eventType_id=events_event_types.id http://events_event_types.id where events_event_types.severity=70 and not events_events.cleared order by events_events.dateTime DESC Sort (cost=20255.18..20495.85 rows=96269 width=16) (actual time=1084.842..1084.951 rows=77 loops=1) Sort Key: events_events.datetime Sort Method: quicksort Memory: 20kB - Hash Join (cost=2.09..12286.62 rows=96269 width=16) (actual time=1080.789..1084.696 rows=77 loops=1) Hash Cond: (events_events.eventtype_id = events_event_types.id http://events_event_types.id) - Seq Scan on events_events (cost=0.00..9968.06 rows=361008 width=24) (actual time=0.010..542.946 rows=360637 loops=1) Filter: (NOT cleared) - Hash (cost=1.89..1.89 rows=16 width=8) (actual time=0.077..0.077 rows=16 loops=1) - Seq Scan on events_event_types (cost=0.00..1.89 rows=16 width=8) (actual time=0.010..0.046 rows=16 loops=1) Filter: (severity = 70) Total runtime: 1085.145 ms Any suggestions? This plan is faster as you avoid the index scan. The planner is preferring to do a tablescan to find what it needs. This is much faster than the 360k random I/O index lookups. You can force this type of plan with a subquery and the OFFSET 0 trick, but I'm not sure it's the best solution. eg explain analyze SELECT * FROM (SELECT events_events.id http://events_events.id FROM events_events LEFT JOIN events_event_types on events_events.eventType_id=events_event_types.id http://events_event_types.id WHERE events_event_types.severity=70
Re: [PERFORM] SQL select query becomes slow when using limit (with no offset)
On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dierenkeesvandie...@gmail.com wrote: Would it get attention if I submit this to http://www.postgresql.org/support/submitbug ? (in fact it is not really a bug, but an improvement request). I think that many of the people who read that mailing list also read this one, including, critically, Tom Lane, and you're not the first person to run into a problem caused by lack of cross-column statistics. I don't think you're going to get very far by submitting this as a bug. There are already several people interested in this problem, but as most of us don't get paid to hack on PostgreSQL, it's a question of finding enough round tuits; this is not an easy thing to fix. If you are sufficiently bothered by this problem that you are willing to pay someone to fix it for you, there are several companies with whom you can contract to get this feature developed and committed for the next release of PostgreSQL. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] postgresql and syslog
we have actually gone the opposite way and switched to using syslog for logging purposes some time ago, with no performance issues. syslog files are easily read by a lot of applications out there. We have been using rsyslog for aggregating logs from multiple servers, splunk for analysis purposes and pgfouine for routine reports. I would be very surprised if logging had a significant overhead any method you choose. there's probably something very wrong with your setup if this is the case. just another dimension, Michael
Re: [PERFORM] postgresql and syslog
Michael Nacos escribió: I would be very surprised if logging had a significant overhead any method you choose. there's probably something very wrong with your setup if this is the case. Either something very wrong, or the load is extremely high. In the latter case perhaps it would make sense to ship syslog to a remote machine. Since it uses UDP sockets, it wouldn't block when overloaded but rather lose messages (besides, it means it has low overhead). -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SQL select query becomes slow when using limit (with no offset)
On 8/7/09 5:53 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dierenkeesvandie...@gmail.com wrote: Would it get attention if I submit this to http://www.postgresql.org/support/submitbug ? (in fact it is not really a bug, but an improvement request). I think that many of the people who read that mailing list also read this one, including, critically, Tom Lane, and you're not the first person to run into a problem caused by lack of cross-column statistics. Critically, it should be understood that this general problem is not just born from lack of cross-column statistics. It is also one of using the statistical expected value to calculate cost without consideration of the variance. Often, the cost of a plan varies widely and nonlinearly with a small change in the expected value the stats used to estimate cost. The problem is acute with LIMIT and various other boundary conditions where there is a steep 'cost cliff' for certain plan types. When LIMIT is applied, the planner changes its estimates, but does not take into account the _greatly_ increased uncertainty of those estimates. Imagine a case where the planner's estimate is 100% correct, and on average one side of a join will have 2 tuples. The planner chooses nested loops to do that join. But the distribution of the number of tuples at this node is skewed, so although the expected value is 2, a values of 10 and 0 are both common. When values of 10 occur, the execution time goes up significantly. An alternate plan might be slower for the case where the actual values in execution equal the expected values, but faster in the average case! The average cost of a plan is NOT that cost of the query with average statistics, due to variance, nonlinearity, and skew. And even if they were equal, it might not be preferable to choose the plan that is best on average over the one that is best at the 90th percentile. Getting back to the case above with the nestloop -- if the planner estimated the cost of the nestloop join versus other joins with some idea of the variance in the estimations it could favor more 'stable' execution plans. So in short, cross-column statistics don't have to be gathered and used to make this problem less acute. The planner just has to be more aware of variance and the things that lead to it, such as column correlation. Thus with a LIMIT applied, the expected value for the number of tuples scanned before a match will shrink, but the uncertainty of this estimate grows significantly as well, so the right plan to choose is one that hedges against the uncertainty, not the one that assumes the expected value is correct. Gathering and using cross column statistics will change the expected value for some plans, but more importantly will allow the uncertainty to be reduced. Better stats go hand in hand with the uncertainty analysis because one can never have all cross column statistics, across all tables and all join-function spaces, analyzed. Stats gathering can never be complete or without flaws. The planner can never be perfect. I don't think you're going to get very far by submitting this as a bug. There are already several people interested in this problem, but as most of us don't get paid to hack on PostgreSQL, it's a question of finding enough round tuits; this is not an easy thing to fix. If you are sufficiently bothered by this problem that you are willing to pay someone to fix it for you, there are several companies with whom you can contract to get this feature developed and committed for the next release of PostgreSQL. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Need suggestions on kernel settings for dedicated FreeBSD/Postgresql machine
Hi Everyone, I manage a freeBSD server that is dedicated to postgresql. The machine has 4 gigs of ram and there is a single database powering a web application that is hosted on a neighboring machine. The web application is mostly reading the database but there are considerable writes and I don't want to tune the machine exclusively for writes. I realize more information would be needed to optimally tune the machine but I am seeking advice on making some sane kernel settings for a general purpose database on a dedicated system. Currently I have: $ cat /etc/sysctl.conf kern.ipc.shmmax=268435456 kern.ipc.shmall=65536 and $ cat /boot/loader.conf kern.ipc.semmni=256 kern.ipc.semmns=512 kern.ipc.semmnu=256 In postgresql.conf I have: max_connections = 180 shared_buffers = 28MB I would like to increase this to 256 connections and make sure the kernel settings are giving postgresql enough breathing room without. I suspect my settings are conservative and since the machine is dedicated to postgresql I would like to give it more resources if they could be used. Any suggestions? culley -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] PG-related ACM Article: The Pathologies of Big Data
Just stumbled across this recent article published in the Communications of the ACM: http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext The author shares some insights relating to difficulties processing a 6.75 billion-row table, a dummy table representing census-type data for everyone on earth, in Postgres. I'd really like to replicate the author's experiment, but it's not clear from the article what his table definition looks like. He claims to be using a 16-byte record to store the several columns he needs for each row, so perhaps he's using a user-defined type? The author implies with his definition of big data that the dataset he analyzed is ... too large to be placed in a relational database... . From Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when run on 6.75 billion rows. This amount of time for the query didn't seem surprising to me given how many rows he has to process, but in a recent post on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres ran the same SELECT query in 105 minutes! This would be very impressive (a 10-fold improvement over Postgres) if true. The author complained that on larger tables [Postgres' planner] switched to sorting by grouping columns, which he blamed for the slow query execution. I don't personally see this plan as a problem, but maybe someone can enlighten me. One intriguing tidbit I picked up from the article: in modern systems, as demonstrated in the figure, random access to memory is typically slower than sequential access to disk. In hindsight, this seems plausible (since modern disks can sustain sequential reads at well over 100MB/sec). Anyway, it would be very interesting to attempt to speed up the author's query if at all possible. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data
On Fri, Aug 7, 2009 at 9:17 PM, Josh Kupershmidtschmi...@gmail.com wrote: Just stumbled across this recent article published in the Communications of the ACM: http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext The author shares some insights relating to difficulties processing a 6.75 billion-row table, a dummy table representing census-type data for everyone on earth, in Postgres. I'd really like to replicate the author's experiment, but it's not clear from the article what his table definition looks like. He claims to be using a 16-byte record to store the several columns he needs for each row, so perhaps he's using a user-defined type? or four integers, or who knows. Postgres's per-row overhead is 24 bytes plus a 16-bit line pointer so you're talking about 42 bytes per row. There's per-page overhead and alignment but in this case it shouldn't be much. The author implies with his definition of big data that the dataset he analyzed is ... too large to be placed in a relational database... . From Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when run on 6.75 billion rows. This amount of time for the query didn't seem surprising to me given how many rows he has to process, but in a recent post on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres ran the same SELECT query in 105 minutes! This would be very impressive (a 10-fold improvement over Postgres) if true. 6.75 billion * 42 bytes is 283.5GB. Assuming you stick that on a single spindle capable of 100MB/s: You have: 283.5GB / (100MB/s) You want: min * 47.25 So something's not adding up. One intriguing tidbit I picked up from the article: in modern systems, as demonstrated in the figure, random access to memory is typically slower than sequential access to disk. In hindsight, this seems plausible (since modern disks can sustain sequential reads at well over 100MB/sec). Sure, but the slowest PCIe bus can sustain 1GB/s and your memory bandwidth is probably at least 8GB/s. -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data
On Fri, Aug 7, 2009 at 2:17 PM, Josh Kupershmidtschmi...@gmail.com wrote: Just stumbled across this recent article published in the Communications of the ACM: http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext The author shares some insights relating to difficulties processing a 6.75 billion-row table, a dummy table representing census-type data for everyone on earth, in Postgres. I'd really like to replicate the author's experiment, but it's not clear from the article what his table definition looks like. He claims to be using a 16-byte record to store the several columns he needs for each row, so perhaps he's using a user-defined type? The author implies with his definition of big data that the dataset he analyzed is ... too large to be placed in a relational database... . From Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when run on 6.75 billion rows. This amount of time for the query didn't seem surprising to me given how many rows he has to process, but in a recent post on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres ran the same SELECT query in 105 minutes! This would be very impressive (a 10-fold improvement over Postgres) if true. Well, from the article, I got the feeling he never showed up here on the list to ask for help, and he just assumed he knew enough about postgresql to say it couldn't scale well. I just checked the archives, and his name doesn't show up. When you look at his slides, this one makes we wonder about a few points: http://deliveryimages.acm.org/10.1145/154/1536632/figs/f3.jpg He was using 8 15kSAS in RAID-5. Just the fact that he's using RAID-5 to test makes me wonder, but for his mostly-read workload it's useful. But on his machine he was only getting 53MB/second sequential reads? That makes no sense. I was getting 50MB/s from a 4 disk SATA RAID on older 120G hard drives years ago. SAS drives haven't been around that long really, so I can't imagine having 7 disks (1 for parity) and only getting 53/7 or 7.5MB/second from them. That's horrible. I had 9 Gig 5.25 full height drives faster than that back in the day, on eight bit scsi controllers. His memory read speed was pretty bad too at only 350MB/s. I have a 12 drive RAID-10 that can outrun his memory reads. So I tend to think his OS was setup poorly, or his hardware was broken, or something like that. The author complained that on larger tables [Postgres' planner] switched to sorting by grouping columns, which he blamed for the slow query execution. I don't personally see this plan as a problem, but maybe someone can enlighten me. I'm sure that if he was on faster hardware it might have been quite a bit faster. I'd love to try his test on a real server with RAID-10 and lots of memory. I'm certain I could get the run time down by a couple factors. I wonder if he cranked up work_mem? I wonder if he even upped shared_buffers? One intriguing tidbit I picked up from the article: in modern systems, as demonstrated in the figure, random access to memory is typically slower than sequential access to disk. In hindsight, this seems plausible (since modern disks can sustain sequential reads at well over 100MB/sec). This is generally always true. But his numbers are off by factors for a modern system. Pentium IIs could sequentially read in the several hundreds of megs per second from memory. Any modern piece of kit, including my laptop, can do much much better than 350Meg/second from memory. I wonder if he'd make his work available to mess with, as it seems he did a pretty poor job setting up his database server / OS for this test. At the very least I wonder if he has a colleague on this list who might point him to us so we can try to help him improve the dismal performance he seems to be getting. Or maybe he could just google postgresql performance tuning and take it from there... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data
Oh I just noticed his graphic is values per second but he had originally said they were 16 bit values. Even if they were 32 or 64 bit values, I'd expect way more than what he's getting there. On Fri, Aug 7, 2009 at 6:40 PM, Scott Marlowescott.marl...@gmail.com wrote: Well, from the article, I got the feeling he never showed up here on the list to ask for help, and he just assumed he knew enough about postgresql to say it couldn't scale well. I just checked the archives, and his name doesn't show up. When you look at his slides, this one makes we wonder about a few points: http://deliveryimages.acm.org/10.1145/154/1536632/figs/f3.jpg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data
Well, there is CPU overhead for reading postgres pages and tuples. On a disk subsystem that gets 1GB/sec sequential reads, I can't get more than about 700MB/sec of I/O and on a select count(*) query on very large tables with large rows (600 bytes) and its closer to 300MB/sec if the rows are smaller (75 bytes). In both cases it is CPU bound with little i/o wait and disk utilization under 65% in iostat. I also get over 13GB/sec to RAM from a single thread (Nehalem processor). I don't see how on any recent hardware, random access to RAM is slower than sequential from disk. RAM access, random or not, is measured in GB/sec... On 8/7/09 5:42 PM, Scott Marlowe scott.marl...@gmail.com wrote: Oh I just noticed his graphic is values per second but he had originally said they were 16 bit values. Even if they were 32 or 64 bit values, I'd expect way more than what he's getting there. On Fri, Aug 7, 2009 at 6:40 PM, Scott Marlowescott.marl...@gmail.com wrote: Well, from the article, I got the feeling he never showed up here on the list to ask for help, and he just assumed he knew enough about postgresql to say it couldn't scale well. I just checked the archives, and his name doesn't show up. When you look at his slides, this one makes we wonder about a few points: http://deliveryimages.acm.org/10.1145/154/1536632/figs/f3.jpg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG-related ACM Article: The Pathologies of Big Data
On Fri, Aug 7, 2009 at 7:34 PM, Scott Careysc...@richrelevance.com wrote: Well, there is CPU overhead for reading postgres pages and tuples. On a disk subsystem that gets 1GB/sec sequential reads, I can't get more than about 700MB/sec of I/O and on a select count(*) query on very large tables with large rows (600 bytes) and its closer to 300MB/sec if the rows are smaller (75 bytes). In both cases it is CPU bound with little i/o wait and disk utilization under 65% in iostat. I also get over 13GB/sec to RAM from a single thread (Nehalem processor). I don't see how on any recent hardware, random access to RAM is slower than sequential from disk. RAM access, random or not, is measured in GB/sec... I don't think anybody's arguing that. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SQL select query becomes slow when using limit (with no offset)
On Fri, Aug 7, 2009 at 5:09 PM, Scott Careysc...@richrelevance.com wrote: On 8/7/09 5:53 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dierenkeesvandie...@gmail.com wrote: Would it get attention if I submit this to http://www.postgresql.org/support/submitbug ? (in fact it is not really a bug, but an improvement request). I think that many of the people who read that mailing list also read this one, including, critically, Tom Lane, and you're not the first person to run into a problem caused by lack of cross-column statistics. Critically, it should be understood that this general problem is not just born from lack of cross-column statistics. It is also one of using the statistical expected value to calculate cost without consideration of the variance. Often, the cost of a plan varies widely and nonlinearly with a small change in the expected value the stats used to estimate cost. The problem is acute with LIMIT and various other boundary conditions where there is a steep 'cost cliff' for certain plan types. When LIMIT is applied, the planner changes its estimates, but does not take into account the _greatly_ increased uncertainty of those estimates. Imagine a case where the planner's estimate is 100% correct, and on average one side of a join will have 2 tuples. The planner chooses nested loops to do that join. But the distribution of the number of tuples at this node is skewed, so although the expected value is 2, a values of 10 and 0 are both common. When values of 10 occur, the execution time goes up significantly. An alternate plan might be slower for the case where the actual values in execution equal the expected values, but faster in the average case! The average cost of a plan is NOT that cost of the query with average statistics, due to variance, nonlinearity, and skew. And even if they were equal, it might not be preferable to choose the plan that is best on average over the one that is best at the 90th percentile. Getting back to the case above with the nestloop -- if the planner estimated the cost of the nestloop join versus other joins with some idea of the variance in the estimations it could favor more 'stable' execution plans. So in short, cross-column statistics don't have to be gathered and used to make this problem less acute. The planner just has to be more aware of variance and the things that lead to it, such as column correlation. Thus with a LIMIT applied, the expected value for the number of tuples scanned before a match will shrink, but the uncertainty of this estimate grows significantly as well, so the right plan to choose is one that hedges against the uncertainty, not the one that assumes the expected value is correct. This is a good analysis. I think I proposed some kind of idea about tracking uncertainty in the planner a while back, but I never got very far with it. The problem is, first, figuring out how to estimate the uncertainty, and second, figuring out what to do with the result once you've estimated it. The concept of hedging against uncertainty is the right one, I think, but it's not obvious how to fit that into the cost-comparison algorithm that the planner uses. A sticking point here too is that the comparison of path costs is already a hot spot; making it more complex will likely have a noticeable negative impact on query planning time. For queries against huge databases that may not matter much, but for OLTP queries it certainly does. There are a couple of methods that have been proposed to deal with this in the past. The most obvious one that's been talked about a couple of times is switching a nested loop to a hash join if the number of iterations exceeds some bound, which would require some executor support. Empirically, it seems to me that the planner generally follows a pretty consistent pattern. If the inner relation is tiny or the number of loops is estimated to be very small, it uses a nested loop. When the inner rel is a bit bigger, or the number of iterations is nontrivial, it switches to a hash join. When the inner relation gets too big to fit in work_mem, or just big enough that hashing it looks too slow, it switches to a nested loop with inner index-scan or, especially if a useful sort is available, a merge join. Just handling better the case where we pick a straight nested loop rather than a hash join would help a lot of people. Some basic conservatism about the number of outer rows would be useful here (in particular, we should probably assume that there will be at least 2 when costing a nest loop, unless the outer side is known unique), and it's also worth thinking about the fact that a hash join won't build the table unless there is at least 1 outer row, which I don't think the current costing algorithm takes into account. Or maybe we should estimate the amount by which the nested loop figures to beat out the hash join and only accepted