Re: Out of Memory errors are frustrating as heck!

2019-10-08 Thread Merlin Moncure
On Tue, Oct 8, 2019 at 12:44 PM Merlin Moncure  wrote:
> On Sun, Apr 14, 2019 at 3:51 PM Gunther  wrote:
> >
> > For weeks now, I am banging my head at an "out of memory" situation. There 
> > is only one query I am running on an 8 GB system, whatever I try, I get 
> > knocked out on this out of memory. It is extremely impenetrable to 
> > understand and fix this error. I guess I could add a swap file, and then I 
> > would have to take the penalty of swapping. But how can I actually address 
> > an out of memory condition if the system doesn't tell me where it is 
> > happening?
> > We can't really see anything too worrisome. There is always lots of memory 
> > used by cache, which could have been mobilized. The only possible 
> > explanation I can think of is that in that moment of the crash the memory 
> > utilization suddenly skyrocketed in less than a second, so that the 2 
> > second vmstat interval wouldn't show it??? Nah.
> >
> > I have already much reduced work_mem, which has helped in some other cases 
> > before. Now I am going to reduce the shared_buffers now, but that seems 
> > counter-intuitive because we are sitting on all that cache memory unused!
> >
> > Might this be a bug? It feels like a bug. It feels like those out of memory 
> > issues should be handled more gracefully (garbage collection attempt?) and 
> > that somehow there should be more information so the person can do anything 
> > about it.
>
> I kind of agree that nothing according to vmstat suggests you have a
> problem.  One thing you left out is the precise mechanics of the
> failure; is the database getting nuked by the oom killer?  Do you have
> the logs?

oops, I missed quite a bit of context upthread.  sorry for repeat noise.

merlin




Re: Out of Memory errors are frustrating as heck!

2019-10-08 Thread Merlin Moncure
On Sun, Apr 14, 2019 at 3:51 PM Gunther  wrote:
>
> For weeks now, I am banging my head at an "out of memory" situation. There is 
> only one query I am running on an 8 GB system, whatever I try, I get knocked 
> out on this out of memory. It is extremely impenetrable to understand and fix 
> this error. I guess I could add a swap file, and then I would have to take 
> the penalty of swapping. But how can I actually address an out of memory 
> condition if the system doesn't tell me where it is happening?
> We can't really see anything too worrisome. There is always lots of memory 
> used by cache, which could have been mobilized. The only possible explanation 
> I can think of is that in that moment of the crash the memory utilization 
> suddenly skyrocketed in less than a second, so that the 2 second vmstat 
> interval wouldn't show it??? Nah.
>
> I have already much reduced work_mem, which has helped in some other cases 
> before. Now I am going to reduce the shared_buffers now, but that seems 
> counter-intuitive because we are sitting on all that cache memory unused!
>
> Might this be a bug? It feels like a bug. It feels like those out of memory 
> issues should be handled more gracefully (garbage collection attempt?) and 
> that somehow there should be more information so the person can do anything 
> about it.

I kind of agree that nothing according to vmstat suggests you have a
problem.  One thing you left out is the precise mechanics of the
failure; is the database getting nuked by the oom killer?  Do you have
the logs?

*) what are values of shared_buffers and work_mem and maintenance_work_mem?

*) Is this a 32 bit build? (I'm guessing no, but worth asking)

*) I see that you've disabled swap.  Maybe it should be enabled?

*) Can you get the query to run through?  an 'explain analyze' might
point to gross misses in plan; say, sort memory overuse

*) If you're still getting failures, maybe we need to look at sampling
frequency of memory usage.

*) iowait is super high.

*) I see optimization potential in this query; explain analyze would
help here too.

merlin




Re: Out of Memory errors are frustrating as heck!

2019-10-06 Thread Tomas Vondra

On Fri, Aug 23, 2019 at 09:17:38AM -0400, Gunther wrote:
Hi all, I am connecting to a discussion back from April this year. My 
data has grown and now I am running into new out of memory situations. 
Meanwhile the world turned from 11.2 to 11.5 which I just installed 
only to find the same out of memory error.




As Tom already said, this seems like a quite independent issue. Next
time it'd be better to share it in a new thread, not to mix it up with
the old discussion.

Have any of the things discussed and proposed, especially this last 
one by Tomas Vondra, been applied to the 11 releases? Should I try 
these older patches from April?




Unfortunately, no. We're still discussing what would be the right fix
(it's rather tricky and the patches I shared were way too experimental
for that). But I'm pretty sure whatever we end up doing it's going to be
way too invasive for backpatch. I.e. the older branches will likely have
this issue until EOL.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-10-06 Thread Tomas Vondra

On Sat, Aug 24, 2019 at 11:40:09AM -0400, Gunther wrote:
Thanks Tom, yes I'd say it's using a lot of memory, but wouldn't call 
it "leak" as it doesn't grow during the 30 min or so that this query 
runs. It explodes to 4GB and then stays flat until done.




Well, the memory context stats you've shared however show this:

total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 
used

That's only ~1.3GB, and ~1.1GB of that is the expression context. So
when you say 4GB, when does that happen and can you share stats showing
state at that point?

Yes, and this time the query is super complicated with many joins and 
tables involved. The query plan has 100 lines. Not easy to share for 
reproduce and I have my issue under control by adding some swap just 
in case. The swap space was never actually used.




Still, without the query plan we can hardly do any guesses about what
might be the issue.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-08-23 Thread Tom Lane
Gunther  writes:
> Hi all, I am connecting to a discussion back from April this year. My 
> data has grown and now I am running into new out of memory situations. 

It doesn't look like this has much of anything to do with the hash-table
discussion.  The big hog is an ExprContext:

> ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); 
> 1107289928 used

So there's something leaking in there, but this isn't enough info
to guess what.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-08-23 Thread Gunther
Hi all, I am connecting to a discussion back from April this year. My 
data has grown and now I am running into new out of memory situations. 
Meanwhile the world turned from 11.2 to 11.5 which I just installed only 
to find the same out of memory error.


Have any of the things discussed and proposed, especially this last one 
by Tomas Vondra, been applied to the 11 releases? Should I try these 
older patches from April?


regards,
-Gunther

For what it is worth, this is what I am getting:

TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 
used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 
416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 
blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 
total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 
8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information 
cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used 
RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 
1296 used MessageContext: 8388608 total in 11 blocks; 3094872 free (4 
chunks); 5293736 used JoinRelHashTable: 16384 total in 2 blocks; 5576 
free (1 chunks); 10808 used Operator class cache: 8192 total in 1 
blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total 
in 3 blocks; 12720 free (8 chunks); 20048 used TransactionAbortContext: 
32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 
8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 
8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalContext: 
1024 total in 1 blocks; 624 free (0 chunks); 400 used: ExecutorState: 
202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used 
HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used 
HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 
10607168 used HashTableContext: 8192 total in 1 blocks; 7688 free (1 
chunks); 504 used HashBatchContext: 13079304 total in 336 blocks; 7936 
free (0 chunks); 13071368 used TupleSort main: 49208 total in 3 blocks; 
8552 free (7 chunks); 40656 used Caller tuples: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 
free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free 
(0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 
blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 
total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 
total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp 
Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan 
HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 
used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used 

Re: Out of Memory errors are frustrating as heck!

2019-04-28 Thread Tomas Vondra

On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote:


...

I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.



OK, so I took a stab at this, and overall it seems to be workable. The
patches I have are nowhere near committable, but I think the approach
works fairly well - the memory is kept in check, and the performance is
comparable to the "ballancing" approach tested before.

To explain it a bit, the idea is that we can compute how many BufFile
structures we can keep in memory - we can't use more than work_mem/2 for
that, because then we'd mostly eliminate space for the actual data. For
example with 4MB, we know we can keep 128 batches - we need 128 for
outer and inner side, so 256 in total, and 256*8kB = 2MB.

And then, we just increase the number of batches but instead of adding
the BufFile entries, we split batches into slices that we can keep in
memory (say, the 128 batches). And we keep BufFiles for the current one
and an "overflow file" for the other slices. After processing a slice,
we simply switch to the next one, and use the overflow file as a temp
file for the first batch - we redistribute it into the other batches in
the slice and another overflow file.

That's what the v3 patch (named 'single overflow file') does. I does
work, but unfortunately it significantly inflates the amount of data
written to temporary files. Assume we need e.g. 1024 batches, but only
128 fit into memory. That means we'll need 8 slices, and during the
first pass we'll handle 1/8 of the data and write 7/8 to the overflow
file.  Then after processing the slice and switching to the next one, we
repeat this dance - 1/8 gets processed, 6/8 written to another overflow
file. So essentially we "forward" about

   7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5

of data between slices, and we need to re-shuffle data in each slice,
which amounts to additional 1x data. That's pretty significant overhead,
as will be clear from the measurements I'll present shortly.

But luckily, there's a simple solution to this - instead of writing the
data into a single overflow file, we can create one overflow file for
each slice. That will leave us with the ~1x of additional writes when
distributing data into batches in the current slice, but it eliminates
the main source of write amplification - awalanche-like forwarding of
data between slices.

This relaxes the memory limit a bit again, because we can't really keep
the number of overflow files constrained by work_mem, but we should only
need few of them (much less than when adding one file per batch right
away). For example with 128 in-memory batches, this reduces the amount
of necessary memory 128x.

And this is what v4 (per-slice overflow file) does, pretty much.


Two more comments, regarding memory accounting in previous patches. It
was a bit broken, because we actually need 2x the number of BufFiles. We
needed nbatch files for outer side and nbatch files for inner side, but
we only considered one of those - both when deciding when to increase
the number of batches / increase spaceAllowed, and when reporting the
memory usage. So with large number of batches the reported amount of
used memory was roughly 1/2 of the actual value :-/

The memory accounting was a bit bogus for another reason - spaceUsed
simply tracks the amount of memory for hash table contents. But at the
end we were simply adding the current space for BufFile stuff, ignoring
the fact that that's likely much larger than when the spacePeak value
got stored. For example we might have kept early spaceUsed when it was
almost work_mem, and then added the final large BufFile allocation.

I've fixed both issues in the patches attached to this message. It does
not make a huge difference in practice, but it makes it easier to
compare values between patches.


Now, some test results - I've repeated the simple test with uniform data
set, which is pretty much ideal for hash joins (no unexlectedly large
batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M
and 50M rows in the large table (which gets picked for the "hash" side),
and measured how much memory gets used, how many batches, how long it
takes and how much data gets written to temp files.

See the hashjoin-test.sh script for more details.

So, here are the results with work_mem = 4MB (so the number of in-memory
batches for the last two entries is 128). The columns are:

* nbatch - the final number of batches
* memory - memory usage, as reported by explain analyze
* time - duration of the query (without explain analyze) in seconds
* size - size of the large table
* temp - amount of data written to temp files
* amplif - write amplification (temp / size)


 1M rows
 

Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 07:09:00PM -0400, Gunther wrote:

  On 4/23/2019 16:43, Justin Pryzby wrote:

It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.

  I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to
  16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with
  vm.overcommit_memory = 2 as it is.)

We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.

  That may be so, but mis-estimates happen. And I can still massively
  improve this query logically I am sure.  In fact it sticks out like a sore
  thumb, logically it makes no sense to churn over 100 million rows here,
  but the point is that hopefully PostgreSQL runs stable in such outlier
  situations, comes back and presents you with 2 hours of work time, 40 GB
  temp space, or whatever, and then we users can figure out how to make it
  work better. The frustrating thing it to get out of memory and we not
  knowing what we can possibly do about it.



Sure. And I think the memory balancing algorithm implemented in the v2
patch is a step in that direction. I think we can do better in terms of
memory consumption (essentially keeping it closer to work_mem) but it's
unlikely to be any faster.

In a way this is similar to underestimates in hash aggregate, except
that in that case we don't have any spill-to-disk fallback at all.


  From my previous attempt with this tmp_r and tmp_q table, I also know that
  the Sort/Uniqe step is taking  a lot of extra time. I can cut that out too
  by addressing the causes of the "repeated result" rows. But again, that is
  all secondary optimizations.

Merge Cond: (((documentinformationsubject.documentinternalid)::text = 
(documentinformationsubject_1.documentinternalid)::text) AND 
((documentinformationsubject.documentid)::text = 
(documentinformationsubject_1.documentid)::text) AND 
((documentinformationsubject.actinternalid)::text = 
(documentinformationsubject_1.actinternalid)::text))

If they're completely redundant and you can get the same result after dropping
one or two of those conditions, then you should.

  I understand. You are saying by reducing the amount of columns in the join
  condition, somehow you might be able to reduce the size of the hash
  temporary table?



No. When estimating the join result size with multiple join clauses, the
optimizer essentially has to compute 


1:P((x1 = y1) && (x2 = y2) && (x3 = y3))

so it assumes statistical independence of those conditions and splits
that into

2:P(x1 = y1) * P(x2 = y2) * P(x3 = y3)

But when those conditions are dependent - for example when (x1=y1) means
that ((x2=y2) && (x3=y3)) - this results into significant underestimate.
E.g. let's assume that each of those conditions matches 1/100 rows, but
that essentially x1=x2=x3 and y1=y2=y3. Then (1) is 1/100 but (2) ends
up being 1/100, so 1x off.

Chances are this is what's happenning with the inner side of the hash
join, which is estimated to return 14425 but ends up returning 37826042.

There's one trick you might try, though - using indexes on composite types:

 create table t1 (a int, b int);
 create table t2 (a int, b int);
 
 
 insert into t1 select mod(i,1000), mod(i,1000)

   from generate_series(1,10) s(i);
 
 insert into t2 select mod(i,1000), mod(i,1000)

   from generate_series(1,10) s(i);
 
 analyze t1;

 analyze t2;
 
 explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
 
QUERY PLAN

 
  Merge Join  (cost=19495.72..21095.56 rows= width=16)
  (actual time=183.043..10360.276 rows=1000 loops=1)
Merge Cond: ((t1.a = t2.a) AND (t1.b = t2.b))
...
 
 create type composite_id as (a int, b int);
 
 create index on t1 (((a,b)::composite_id));

 create index on t2 (((a,b)::composite_id));
 
 analyze t1;

 analyze t2;
 
 explain analyze select * from t1 join t2

   on ((t1.a,t1.b)::composite_id = (t2.a,t2.b)::composite_id);
 QUERY PLAN
 --
  Merge Join  (cost=0.83..161674.40 rows=460 width=16)
  (actual time=0.020..12726.767 rows=1000 loops=1)
Merge Cond: (ROW(t1.a, t1.b)::composite_id = ROW(t2.a, t2.b)::composite_id)

Obviously, that's not exactly free - you have to pay price for the index
creation, maintenance and storage.


...

  Unique/Sort actual time   6,150,303.060 ms = 6,150 s <~ 2 h.
  Hash Right Join actual time 325,240.679 ms.

  So really all time is wasted in that sort, no need for you guys to worry
  about anything else with these 2 hours.  Tomas just stated the same thing.

Right. Chances are that with a bettwe estimate the optimizer would pick
merge join instead. I wonder if that would be 

Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Gunther

On 4/23/2019 16:43, Justin Pryzby wrote:

It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.


I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to 
16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with 
vm.overcommit_memory = 2 as it is.)



We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.


That may be so, but mis-estimates happen. And I can still massively 
improve this query logically I am sure.  In fact it sticks out like a 
sore thumb, logically it makes no sense to churn over 100 million rows 
here, but the point is that hopefully PostgreSQL runs stable in such 
outlier situations, comes back and presents you with 2 hours of work 
time, 40 GB temp space, or whatever, and then we users can figure out 
how to make it work better. The frustrating thing it to get out of 
memory and we not knowing what we can possibly do about it.


From my previous attempt with this tmp_r and tmp_q table, I also know 
that the Sort/Uniqe step is taking  a lot of extra time. I can cut that 
out too by addressing the causes of the "repeated result" rows. But 
again, that is all secondary optimizations.



Merge Cond: (((documentinformationsubject.documentinternalid)::text = 
(documentinformationsubject_1.documentinternalid)::text) AND 
((documentinformationsubject.documentid)::text = 
(documentinformationsubject_1.documentid)::text) AND 
((documentinformationsubject.actinternalid)::text = 
(documentinformationsubject_1.actinternalid)::text))

If they're completely redundant and you can get the same result after dropping
one or two of those conditions, then you should.
I understand. You are saying by reducing the amount of columns in the 
join condition, somehow you might be able to reduce the size of the hash 
temporary table?

Alternately, if they're correlated but not redundant, you can use PG10
"dependency" statistics (CREATE STATISTICS) on the correlated columns (and
ANALYZE).


I think documentId and documentInternalId is 1:1 they are both primary / 
alternate keys. So I could go with only one of them, but since I end up 
needing both elsewhere inside the query I like to throw them all into 
the natural join key, so that I don't have to deal with the duplicate 
result columns.


Now running:

integrator=# set enable_nestloop to off; SET integrator=# explain 
analyze select * from reports.v_BusinessOperation; WARNING: 
ExecHashIncreaseNumBatches: nbatch=8 spaceAllowed=16777216 WARNING: 
ExecHashIncreaseNumBatches: nbatch=16 spaceAllowed=16777216 WARNING: 
ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=16777216 WARNING: 
ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=16777216 WARNING: 
ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=16777216 WARNING: 
ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=16777216 WARNING: 
ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=16777216 WARNING: 
ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824 WARNING: 
ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648 WARNING: 
ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296 WARNING: 
ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592 WARNING: 
ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184 WARNING: 
ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368 WARNING: 
ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736


I am waiting now, probably for that Sort/Unique to finish I think that 
the vast majority of the time spent is in this sort


Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual 
time=6150303.060..6895451.210 rows=435274 loops=1) -> Sort 
(cost=5551524.36..5551610.91 rows=34619 width=1197) (actual 
time=6150303.058..6801372.192 rows=113478386 loops=1) Sort Key: 
documentinformationsubject.documentinternalid, 
documentinformationsubject.is_current, 
documentinformationsubject.documentid, 
documentinformationsubject.documenttypecode, 
documentinformationsubject.subjectroleinternalid, 
documentinformationsubject.subjectentityinternalid, 
documentinformationsubject.subjectentityid, 
documentinformationsubject.subjectentityidroot, 
documentinformationsubject.subjectentityname, 
documentinformationsubject.subjectentitytel, 
documentinformationsubject.subjectentityemail, 
documentinformationsubject.otherentityinternalid, 
documentinformationsubject.confidentialitycode, 
documentinformationsubject.actinternalid, 
documentinformationsubject.code_code, 
documentinformationsubject.code_displayname, q.code_code, 
q.code_displayname, an.extension, an.root, 
documentinformationsubject_2.subjectentitycode, 
documentinformationsubject_2.subjectentitycodesystem, 
documentinformationsubject_2.effectivetime_low, 
documentinformationsubject_2.effectivetime_high, 
documentinformationsubject_2.statuscode, 
documentinformationsubject_2.code_code, 

Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote:

  On 4/21/2019 23:09, Tomas Vondra wrote:

What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.

  Thanks for this, and I am trying this now.

  So far it is promising.

  I see the memory footprint contained under 1 GB. I see it go up, but also
  down again. CPU, IO, all being live.

foo=# set enable_nestloop to off;
SET
foo=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: nbatch=32 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=64 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=128 spaceAllowed=4194304
WARNING:  ExecHashIncreaseNumBatches: nbatch=256 spaceAllowed=6291456
WARNING:  ExecHashIncreaseNumBatches: nbatch=512 spaceAllowed=12582912
WARNING:  ExecHashIncreaseNumBatches: nbatch=1024 spaceAllowed=25165824
WARNING:  ExecHashIncreaseNumBatches: nbatch=2048 spaceAllowed=50331648
WARNING:  ExecHashIncreaseNumBatches: nbatch=4096 spaceAllowed=100663296
WARNING:  ExecHashIncreaseNumBatches: nbatch=8192 spaceAllowed=201326592
WARNING:  ExecHashIncreaseNumBatches: nbatch=16384 spaceAllowed=402653184
WARNING:  ExecHashIncreaseNumBatches: nbatch=32768 spaceAllowed=805306368
WARNING:  ExecHashIncreaseNumBatches: nbatch=65536 spaceAllowed=1610612736

  Aand, it's a winner!



Good ;-)


 Unique  (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual 
time=6150303.060..6895451.210 rows=435274 loops=1)
   ->  Sort  (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual 
time=6150303.058..6801372.192 rows=113478386 loops=1)
 Sort Key: ...
 Sort Method: external merge  Disk: 40726720kB
 ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 
width=1197) (actual time=325240.679..1044194.775 rows=113478386 loops=1)
   Hash Cond: ...
...
 Planning Time: 40.559 ms
 Execution Time: 6896581.566 ms
(70 rows)


  For the first time this query has succeeded now. Memory was bounded. The
  time of nearly hours is crazy, but things sometimes take that long. The
  important thing was not to get an out of memory error.



TBH I don't think there's much we can do to improve this further - it's
a rather desperate effort to keep the memory usage as low as possible,
without any real guarantees.

Also, the hash join only takes about 1000 seconds out of the 6900 total.
So even if we got it much faster, the query would still take almost two
hours, give or take.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 03:43:48PM -0500, Justin Pryzby wrote:

On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote:

On 4/21/2019 23:09, Tomas Vondra wrote:
>What I think might work better is the attached v2 of the patch, with a
Thanks for this, and I am trying this now.

...

Aand, it's a winner!

Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual 
time=6150303.060..6895451.210 rows=435274 loops=1)
-> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual 
time=6150303.058..6801372.192 rows=113478386 loops=1)
Sort Method: external merge Disk: 40726720kB

For the first time this query has succeeded now. Memory was bounded. The
time of nearly hours is crazy, but things sometimes take that long


It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.



That's unlikely to reduce the amount of data written to temporary files,
it just means there will be fewer larger files - in total it's still
going to be ~40GB. And it's not guaranteed it'll improve performance,
because work_mem=4MB might fit into CPU caches and larger values almost
certainly won't. I don't think there's much to gain, really.


We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.



Right. Chances are that with a bettwe estimate the optimizer would pick
merge join instead. I wonder if that would be significantly faster.


Merge Cond: (((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text) AND
((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text) AND
((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text))


If they're completely redundant and you can get the same result after
dropping one or two of those conditions, then you should.

Alternately, if they're correlated but not redundant, you can use PG10
"dependency" statistics (CREATE STATISTICS) on the correlated columns
(and ANALYZE).



That's not going to help, because we don't use functional dependencies
in join estimation yet.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Justin Pryzby
On Tue, Apr 23, 2019 at 04:37:50PM -0400, Gunther wrote:
> On 4/21/2019 23:09, Tomas Vondra wrote:
> >What I think might work better is the attached v2 of the patch, with a
> Thanks for this, and I am trying this now.
...
> Aand, it's a winner!
> 
> Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual 
> time=6150303.060..6895451.210 rows=435274 loops=1)
> -> Sort (cost=5551524.36..5551610.91 rows=34619 width=1197) (actual 
> time=6150303.058..6801372.192 rows=113478386 loops=1) 
> Sort Method: external merge Disk: 40726720kB
> 
> For the first time this query has succeeded now. Memory was bounded. The
> time of nearly hours is crazy, but things sometimes take that long

It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
query time.

We didn't address it yet, but your issue was partially caused by a misestimate.
It's almost certainly because these conditions are correlated, or maybe
redundant.

> Merge Cond: (((documentinformationsubject.documentinternalid)::text = 
> (documentinformationsubject_1.documentinternalid)::text) AND 
> ((documentinformationsubject.documentid)::text = 
> (documentinformationsubject_1.documentid)::text) AND 
> ((documentinformationsubject.actinternalid)::text = 
> (documentinformationsubject_1.actinternalid)::text))

If they're completely redundant and you can get the same result after dropping
one or two of those conditions, then you should.

Alternately, if they're correlated but not redundant, you can use PG10
"dependency" statistics (CREATE STATISTICS) on the correlated columns (and
ANALYZE).

On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote:
>  Hash Right Join  (cost=4203858.53..5475530.71 rows=34619 width=4) (actual 
> time=309603.384..459480.863 rows=113478386 loops=1)
...
>  ->  Hash  (cost=1310249.63..1310249.63 rows=13 width=111) (actual 
> time=51077.049..51077.049 rows=236 loops=1)
...
>->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual 
> time=199792.446..199792.446 rows=113478127 loops=1)
>  Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  
> Memory Usage: 189207kB
>  ->  Gather Merge  (cost=2845073.40..2861845.87 rows=34619 width=74) 
> (actual time=107620.262..156256.432 rows=113478127 loops=1)
>Workers Planned: 2
>Workers Launched: 2
>->  Merge Left Join  (cost=2844073.37..2856849.96 rows=14425 
> width=74) (actual time=107570.719..126113.792 rows=37826042 loops=3)
>  Merge Cond: 
> (((documentinformationsubject.documentinternalid)::text = 
> (documentinformationsubject_1.documentinternalid)::text) AND 
> ((documentinformationsubject.documentid)::text = 
> (documentinformationsubject_1.documentid)::text) AND 
> ((documentinformationsubject.actinternalid)::text = 
> (documentinformationsubject_1.actinternalid)::text))
>  ->  Sort  (cost=1295969.26..1296005.32 rows=14425 
> width=111) (actual time=57700.723..58134.751 rows=231207 loops=3)
>Sort Key: 
> documentinformationsubject.documentinternalid, 
> documentinformationsubject.documentid, 
> documentinformationsubject.actinternalid
>Sort Method: external merge  Disk: 26936kB
>Worker 0:  Sort Method: external merge  Disk: 
> 27152kB
>Worker 1:  Sort Method: external merge  Disk: 
> 28248kB
>->  Parallel Seq Scan on 
> documentinformationsubject  (cost=0.00..1294972.76 rows=14425 width=111) 
> (actual time=24866.656..57424.420 rows=231207 loops=3)
>  Filter: (((participationtypecode)::text = 
> ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND 
> ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = 
> '2.16.840.1.113883.3.26.1.1'::text))
>  Rows Removed by Filter: 2584355
>  ->  Materialize  (cost=1548104.12..1553157.04 
> rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 
> loops=3)
>->  Sort  (cost=1548104.12..1550630.58 
> rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 
> loops=3)
>  Sort Key: 
> documentinformationsubject_1.documentinternalid, 
> documentinformationsubject_1.documentid, 
> documentinformationsubject_1.actinternalid
>  Sort Method: external merge  Disk: 122192kB
>  Worker 0:  Sort Method: external merge  
> Disk: 122192kB
>  Worker 1:  Sort Method: external merge  
> Disk: 122192kB
>  ->  Seq Scan on documentinformationsubject 
> documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=111) 
> (actual time=20366.166..47751.267 rows=1031106 loops=3)
>Filter: 

Re: Out of Memory errors are frustrating as heck!

2019-04-23 Thread Gunther

On 4/21/2019 23:09, Tomas Vondra wrote:

What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.


Thanks for this, and I am trying this now.

So far it is promising.

I see the memory footprint contained under 1 GB. I see it go up, but 
also down again. CPU, IO, all being live.


foo=# set enable_nestloop to off; SET foo=# explain analyze select * 
from reports.v_BusinessOperation; WARNING: ExecHashIncreaseNumBatches: 
nbatch=32 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches: 
nbatch=64 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches: 
nbatch=128 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches: 
nbatch=256 spaceAllowed=6291456 WARNING: ExecHashIncreaseNumBatches: 
nbatch=512 spaceAllowed=12582912 WARNING: ExecHashIncreaseNumBatches: 
nbatch=1024 spaceAllowed=25165824 WARNING: ExecHashIncreaseNumBatches: 
nbatch=2048 spaceAllowed=50331648 WARNING: ExecHashIncreaseNumBatches: 
nbatch=4096 spaceAllowed=100663296 WARNING: ExecHashIncreaseNumBatches: 
nbatch=8192 spaceAllowed=201326592 WARNING: ExecHashIncreaseNumBatches: 
nbatch=16384 spaceAllowed=402653184 WARNING: ExecHashIncreaseNumBatches: 
nbatch=32768 spaceAllowed=805306368 WARNING: ExecHashIncreaseNumBatches: 
nbatch=65536 spaceAllowed=1610612736


Aand, it's a winner!

Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual 
time=6150303.060..6895451.210 rows=435274 loops=1) -> Sort 
(cost=5551524.36..5551610.91 rows=34619 width=1197) (actual 
time=6150303.058..6801372.192 rows=113478386 loops=1) Sort Key: 
documentinformationsubject.documentinternalid, 
documentinformationsubject.is_current, 
documentinformationsubject.documentid, 
documentinformationsubject.documenttypecode, 
documentinformationsubject.subjectroleinternalid, 
documentinformationsubject.subjectentityinternalid, 
documentinformationsubject.subjectentityid, 
documentinformationsubject.subjectentityidroot, 
documentinformationsubject.subjectentityname, 
documentinformationsubject.subjectentitytel, 
documentinformationsubject.subjectentityemail, 
documentinformationsubject.otherentityinternalid, 
documentinformationsubject.confidentialitycode, 
documentinformationsubject.actinternalid, 
documentinformationsubject.code_code, 
documentinformationsubject.code_displayname, q.code_code, 
q.code_displayname, an.extension, an.root, 
documentinformationsubject_2.subjectentitycode, 
documentinformationsubject_2.subjectentitycodesystem, 
documentinformationsubject_2.effectivetime_low, 
documentinformationsubject_2.effectivetime_high, 
documentinformationsubject_2.statuscode, 
documentinformationsubject_2.code_code, agencyid.extension, 
agencyname.trivialname, documentinformationsubject_1.subjectentitycode, 
documentinformationsubject_1.subjectentityinternalid Sort Method: 
external merge Disk: 40726720kB -> Hash Right Join 
(cost=4255031.53..5530808.71 rows=34619 width=1197) (actual 
time=325240.679..1044194.775 rows=113478386 loops=1) Hash Cond: 
(((q.documentinternalid)::text = 
(documentinformationsubject.documentinternalid)::text) AND 
((r.targetinternalid)::text = 
(documentinformationsubject.actinternalid)::text)) -> Hash Right Join 
(cost=1341541.37..2612134.36 rows=13 width=341) (actual 
time=81093.327..81093.446 rows=236 loops=1) Hash Cond: 
(((documentinformationsubject_2.documentinternalid)::text = 
(q.documentinternalid)::text) AND 
((documentinformationsubject_2.actinternalid)::text = 
(q.actinternalid)::text)) -> Gather (cost=31291.54..1301884.52 rows=1 
width=219) (actual time=41920.563..41929.780 rows=0 loops=1) Workers 
Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join 
(cost=30291.54..1300884.42 rows=1 width=219) (actual 
time=41915.960..41915.960 rows=0 loops=3) Hash Cond: 
((documentinformationsubject_2.otherentityinternalid)::text = 
(agencyid.entityinternalid)::text) -> Parallel Hash Left Join 
(cost=28606.13..1299199.00 rows=1 width=204) (actual 
time=41862.767..41862.769 rows=0 loops=3) Hash Cond: 
((documentinformationsubject_2.otherentityinternalid)::text = 
(agencyname.entityinternalid)::text) -> Parallel Seq Scan on 
documentinformationsubject documentinformationsubject_2 
(cost=0.00..1268800.85 rows=1 width=177) (actual 
time=40805.337..40805.337 rows=0 loops=3) Filter: 
((participationtypecode)::text = 'AUT'::text) Rows Removed by Filter: 
2815562 -> Parallel Hash (cost=24733.28..24733.28 rows=166628 width=64) 
(actual time=981.000..981.001 rows=133303 loops=3) Buckets: 65536 
Batches: 16 Memory Usage: 3136kB -> Parallel Seq Scan on bestname 
agencyname (cost=0.00..24733.28 rows=166628 width=64) (actual 
time=0.506..916.816 rows=133303 loops=3) -> Parallel Hash 
(cost=1434.07..1434.07 rows=20107 width=89) (actual time=52.350..52.350 
rows=11393 loops=3) Buckets: 65536 Batches: 1 Memory Usage: 4680kB -> 

Re: Out of Memory errors are frustrating as heck!

2019-04-22 Thread Tom Lane
Jeff Janes  writes:
> Is there a reason to not just elog the HJDEBUG stuff?

Yes --- it'd be expensive (a "no op" elog is far from free) and
useless to ~ 99.999% of users.

Almost all the conditionally-compiled debug support in the PG executor
is legacy leftovers from Berkeley days.  If it were useful more often
than once in a blue moon, we probably would have made it more easily
reachable long ago.  I'm a bit surprised we haven't just ripped it
out, TBH.  When I find myself needing extra debug output, it's almost
never the case that any of that old code does what I need.

There might be a case for changing it all to print to stderr not
stdout, so that it plays more nicely with elog/ereport output
when you do have it turned on, though.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-22 Thread Jeff Janes
On Sat, Apr 20, 2019 at 4:48 PM Tom Lane  wrote:

> Gunther  writes:
> > and checked my log file and there was nothing before the call
> > MemoryContextStats(TopPortalContext) so I don't understand where this
> > printf stuff is ending up.
>
> It's going to stdout, which is likely block-buffered whereas stderr
> is line-buffered, so data from the latter will show up in your log
> file much sooner.  You might consider adding something to startup
> to switch stdout to line buffering.
>

Is there a reason to not just elog the HJDEBUG stuff?  With some of the
other DEBUG defines, we will probably be using them before the logging
system is set up, but surely we won't be doing Hash Joins that early?

I think we could just get rid of the conditional compilation and elog this
at DEBUG1 or DEBUG2.  Or keep the conditional compilation and elog it at
LOG.

Cheers,

Jeff


Re: Out of Memory errors are frustrating as heck!

2019-04-22 Thread Tomas Vondra

On Mon, Apr 22, 2019 at 10:07:52AM +0200, Gaetano Mendola wrote:

  Batch splitting shouldn't be followed by a hash function change? 


What would be the value? That can help with hash collisions, but that's
not the issue with the data sets discussed in this thread. The issue
reported originally is about underestimates, and the sample data set has
a large number of duplicate values (a single value representing ~10% of
the data set). Neither of those issues is about hash collisions.

The data set I used to demonstrate how the algorithms work is pretty
perfect, with uniform distribution and no hash collisions.

Furthermore, I don't think we can just change the hash function, for a
couple of technical reasons.

Firstly, it's not like we totally redistribute the whole dataset from N
old batches to (2*N) new ones. By using the same 32-bit hash value and
cconsidering one extra bit, the tuples either stay in the same batch
(when the new bit is 0) or move to a single new batch (when it's 1). So
each batch is split in 1/2. By changing the hash function this would no
longer be true, and we'd redistribute pretty much the whole data set.

The other issue is even more significant - we don't redistribute the
tuples immediately. We only redistribute the current batch, but leave
the other batches alone and handle them when we actually get to them.
This is possible, because the tuples never move backwards - when
splitting batch K, the tuples either stay in K or move to 2K. Or
something like that, I'm too lazy to recall the exact formula now.

And if I recall correctly, I think we can increment the number of
batches while already performing the join, after some rows were already
processed. That would probably be no longer true if we just switched the
hash function, because it might move rows backwards (to the already
processed region).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-22 Thread Gaetano Mendola
Batch splitting shouldn't be followed by a hash function change?

On Mon, Apr 22, 2019, 05:09 Tomas Vondra 
wrote:

> On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote:
> >On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
> >> Jeff Janes  writes:
> >> > The growEnabled stuff only prevents infinite loops.  It doesn't
> prevent
> >> > extreme silliness.
> >>
> >> > If a single 32 bit hash value has enough tuples by itself to not fit
> in
> >> > work_mem, then it will keep splitting until that value is in a batch
> by
> >> > itself before shutting off
> >>
> >> I suspect, however, that we might be better off just taking the
> existence
> >> of the I/O buffers into account somehow while deciding whether it's
> worth
> >> growing further.  That is, I'm imagining adding a second independent
> >> reason for shutting off growEnabled, along the lines of "increasing
> >> nbatch any further will require an unreasonable amount of buffer
> memory".
> >> The question then becomes how to define "unreasonable".
> >
> >On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:
> >> I think the question the code needs to be asking is "If we double the
> >> number of batches, does the amount of memory we need drop?" And the
> >> memory needs to account both for the buffers and per-batch data.
> >>
> >> I don't think we can just stop increasing the number of batches when the
> >> memory for BufFile exceeds work_mem, because that entirely ignores the
> >> fact that by doing that we force the system to keep the per-batch stuff
> >> in memory (and that can be almost arbitrary amount).
> >...
> >> Of course, this just stops enforcing work_mem at some point, but it at
> >> least attempts to minimize the amount of memory used.
> >
> >This patch defines reasonable as "additional BatchFiles will not
> themselves
> >exceed work_mem; OR, exceeded work_mem already but additional BatchFiles
> are
> >going to save us RAM"...
> >
>
> OK.
>
> >I think the first condition is insensitive and not too important to get
> right,
> >it only allows work_mem to be exceeded by 2x, which maybe already happens
> for
> >multiple reasons, related to this thread and otherwise.  It'd be fine to
> slap
> >on a factor of /2 or /4 or /8 there too.
> >
>
> TBH I'm not quite sure I understand all the conditions in the patch - it
> seems unnecessarily complicated. And I don't think it actually minimizes
> the amount of memory used for hash table + buffers, because it keeps the
> same spaceAllowed (which pushes nbatches up). At some point it actually
> makes to bump spaceAllowed and make larger batches instead of adding
> more batches, and the patch does not seem to do that.
>
> Also, the patch does this:
>
> if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed)
> {
> ExecHashIncreaseNumBatches(hashtable);
> }
> else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed)
> {
> /* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing
> nbatches to increase */
> /* I think this branch would be hit almost same as below branch */
> ExecHashIncreaseNumBatches(hashtable);
> }
> ...
>
> but the reasoning for the second branch seems wrong, because
>
> (spaceUsed/2 >= spaceAllowed)
>
> is not enough to guarantee that we actually save memory by doubling the
> number of batches. To do that, we need to make sure that
>
>(spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock))
>
> But that may not be true - it certainly is not guaranteed by not getting
> into the first branch.
>
> Consider an ideal example with uniform distribution:
>
> create table small (id bigint, val text);
> create table large (id bigint, val text);
>
> insert into large select 10 * random(), md5(i::text)
> from generate_series(1, 7) s(i);
>
> insert into small select 10 * random(), md5(i::text)
> from generate_series(1, 1) s(i);
>
> vacuum analyze large;
> vacuum analyze small;
>
> update pg_class set (relpages, reltuples) = (100, 1)
> where relname = 'large';
>
> update pg_class set (relpages, reltuples) = (1, 100)
> where relname = 'small';
>
> set work_mem = '1MB';
>
> explain analyze select * from small join large using (id);
>
> A log after each call to ExecHashIncreaseNumBatches says this:
>
> nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384
> nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768
> nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536
> nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072
> nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144
> nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288
> nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576
> nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152
> nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tomas Vondra

On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote:

On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:

Jeff Janes  writes:
> The growEnabled stuff only prevents infinite loops.  It doesn't prevent
> extreme silliness.

> If a single 32 bit hash value has enough tuples by itself to not fit in
> work_mem, then it will keep splitting until that value is in a batch by
> itself before shutting off

I suspect, however, that we might be better off just taking the existence
of the I/O buffers into account somehow while deciding whether it's worth
growing further.  That is, I'm imagining adding a second independent
reason for shutting off growEnabled, along the lines of "increasing
nbatch any further will require an unreasonable amount of buffer memory".
The question then becomes how to define "unreasonable".


On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:

I think the question the code needs to be asking is "If we double the
number of batches, does the amount of memory we need drop?" And the
memory needs to account both for the buffers and per-batch data.

I don't think we can just stop increasing the number of batches when the
memory for BufFile exceeds work_mem, because that entirely ignores the
fact that by doing that we force the system to keep the per-batch stuff
in memory (and that can be almost arbitrary amount).

...

Of course, this just stops enforcing work_mem at some point, but it at
least attempts to minimize the amount of memory used.


This patch defines reasonable as "additional BatchFiles will not themselves
exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
going to save us RAM"...



OK.


I think the first condition is insensitive and not too important to get right,
it only allows work_mem to be exceeded by 2x, which maybe already happens for
multiple reasons, related to this thread and otherwise.  It'd be fine to slap
on a factor of /2 or /4 or /8 there too.



TBH I'm not quite sure I understand all the conditions in the patch - it
seems unnecessarily complicated. And I don't think it actually minimizes
the amount of memory used for hash table + buffers, because it keeps the
same spaceAllowed (which pushes nbatches up). At some point it actually
makes to bump spaceAllowed and make larger batches instead of adding
more batches, and the patch does not seem to do that.

Also, the patch does this:

   if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed)
   {
   ExecHashIncreaseNumBatches(hashtable);
   }
   else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed)
   {
   /* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches 
to increase */
   /* I think this branch would be hit almost same as below branch */
   ExecHashIncreaseNumBatches(hashtable);
   }
   ...

but the reasoning for the second branch seems wrong, because

   (spaceUsed/2 >= spaceAllowed)

is not enough to guarantee that we actually save memory by doubling the
number of batches. To do that, we need to make sure that

  (spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock))

But that may not be true - it certainly is not guaranteed by not getting
into the first branch.

Consider an ideal example with uniform distribution:

   create table small (id bigint, val text);
   create table large (id bigint, val text);

   insert into large select 10 * random(), md5(i::text)
   from generate_series(1, 7) s(i);

   insert into small select 10 * random(), md5(i::text)
   from generate_series(1, 1) s(i);

   vacuum analyze large;
   vacuum analyze small;

   update pg_class set (relpages, reltuples) = (100, 1)
   where relname = 'large';

   update pg_class set (relpages, reltuples) = (1, 100)
   where relname = 'small';

   set work_mem = '1MB';

   explain analyze select * from small join large using (id);

A log after each call to ExecHashIncreaseNumBatches says this:

   nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384
   nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768
   nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536
   nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072
   nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144
   nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288
   nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576
   nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152
   nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152
   nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304
   nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608
   nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216
   nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432
   nbatch=8192 spaceUsed=2097200 spaceAllowed=1048576 BufFile=67108864
   nbatch=16384 spaceUsed=2097200 spaceAllowed=1048576 BufFile=134217728

So we've 

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Gunther

Now to Justin's patch. First undo Tomas' patch and apply:

$ mv src/include/executor/hashjoin.h.orig src/include/executor/hashjoin.h
$ mv src/backend/executor/nodeHash.c.orig src/backend/executor/nodeHash.c
$ mv src/backend/executor/nodeHashjoin.c.orig 
src/backend/executor/nodeHashjoin.c
$ patch -p1 <../limit-hash-nbatches-v2.patch
patching file src/backend/executor/nodeHash.c
Hunk #1 succeeded at 570 (offset -3 lines).
Hunk #2 succeeded at 917 (offset -3 lines).
Hunk #3 succeeded at 930 (offset -3 lines).
Hunk #4 succeeded at 1037 (offset -3 lines).
Hunk #5 succeeded at 1658 (offset -4 lines).

$ make
$ make install
$ pg_ctl -c restart

and go ...

lots of CPU% again and very limited memory use as of yet.

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
11054 postgres  20   0 1302772  90316  58004 R  94.4  1.1   4:38.05 postgres: 
postgres integrator [local] EXPLAIN
11055 postgres  20   0 1280532  68076  57168 R  97.7  0.9   2:03.54 postgres: 
parallel worker for PID 11054
11056 postgres  20   0 1280532  67964  57124 S   0.0  0.9   2:08.28 postgres: 
parallel worker for PID 11054

that's a pretty decent sign so far. Slight increase ... but still 
relatively steady ...


  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
11054 postgres  20   0 1379704 167140  58004 R  95.0  2.1   5:56.28 postgres: 
postgres integrator [local] EXPLAIN
11055 postgres  20   0 1280532  68076  57168 S  25.6  0.9   2:36.59 postgres: 
parallel worker for PID 11054
11056 postgres  20   0 1280532  67964  57124 R  61.8  0.9   2:29.65 postgres: 
parallel worker for PID 11054

aaand break out ...

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
9 postgres  20   0 1271660   1.0g   1.0g D   0.0 13.4   0:03.10 postgres: 
parallel worker for PID 11054
11054 postgres  20   0 1380940   1.0g 950508 D   0.0 13.4   6:56.09 postgres: 
postgres integrator [local] EXPLAIN
8 postgres  20   0 1271660 884540 882724 D   0.0 11.2   0:02.84 postgres: 
parallel worker for PID 11054

and crash:

foo=# explain analyze select * from reports.v_BusinessOperation;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

what happened? ouch, no space left on root device, too much logging? 
Maybe the core dump ... Log file content is simple:


2019-04-22 00:07:56.104 UTC [11048] LOG:  database system was shut down at 
2019-04-22 00:07:55 UTC
2019-04-22 00:07:56.108 UTC [11046] LOG:  database system is ready to accept 
connections
Hashjoin 0x2122458: initial nbatch = 16, nbuckets = 8192
Hashjoin 0x2122458: increasing nbatch to 32 because space = 4128933
Hashjoin 0x2122458: freed 148 of 10584 tuples, space now 4071106
Hashjoin 0x2122458: increasing nbatch to 64 because space = 4128826
Hashjoin 0x2122458: freed 544 of 10584 tuples, space now 3916296
Hashjoin 0x2122458: increasing nbatch to 128 because space = 4128846
Hashjoin 0x2122458: freed 10419 of 10585 tuples, space now 65570
Hashjoin 0x2122458: increasing nbatch to 256 because space = 4128829
Hashjoin 0x2122458: freed 10308 of 10734 tuples, space now 161815
Hashjoin 0x2122458: increasing nbatch to 512 because space = 4128908
Hashjoin 0x2122458: freed 398 of 10379 tuples, space now 3977787
Hashjoin 0x3ac9918: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac91a8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac93c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41018: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41048: initial nbatch = 1, nbuckets = 1024
2019-04-22 00:16:55.273 UTC [11046] LOG:  background worker "parallel worker" 
(PID 9) was terminated by signal 11: Segmentation fault
2019-04-22 00:16:55.273 UTC [11046] DETAIL:  Failed process was running: 
explain analyze select * from reports.v_BusinessOperation;
2019-04-22 00:16:55.273 UTC [11046] LOG:  terminating any other active server 
processes
2019-04-22 00:16:55.274 UTC [11058] WARNING:  terminating connection because of 
crash of another server process
2019-04-22 00:16:55.274 UTC [11058] DETAIL:  The postmaster has commanded this 
server process to roll back the current transaction and exit, because another 
server process exited abnormally and possibly corrupted shared memory.
2019-04-22 00:16:55.274 UTC [11058] HINT:  In a moment you should be able to 
reconnect to the database and repeat your command.
2019-04-22 00:16:55.277 UTC [11052] LOG:  could not write temporary statistics file 
"pg_stat/db_16903.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG:  could not close temporary statistics file 
"pg_stat/db_0.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG:  could not close temporary statistics file 
"pg_stat/global.tmp": No space left on device
2019-04-22 00:16:55.315 UTC [11046] LOG:  all server processes terminated; 
reinitializing

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Gunther

After applying Tomas' patch 0002 as corrected, over 0001, same thing:

integrator=# set enable_nestloop to off;
SET
integrator=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 64 to 
128
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 128 to 
256
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 256 to 
512
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 512 to 
1024
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 
2048
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 
4096
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 
8192
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 
16384
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16384 
to 32768
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32768 
to 65536
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 65536 
to 131072
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 
to 262144
ERROR:  out of memory
DETAIL:  Failed on request of size 8272 in memory context "HashBatchFiles".

And from the log:

2019-04-21 23:29:33.497 UTC [8890] LOG:  database system was shut down at 
2019-04-21 23:29:33 UTC
2019-04-21 23:29:33.507 UTC [] LOG:  database system is ready to accept 
connections
Hashjoin 0x1732b88: initial nbatch = 16, nbuckets = 8192
2019-04-21 23:31:54.447 UTC [8896] WARNING:  ExecHashIncreaseNumBatches: 
increasing number of batches from 16 to 32
2019-04-21 23:31:54.447 UTC [8896] LOG:  ExecHashIncreaseNumBatches === 
context stats start ===
2019-04-21 23:31:54.447 UTC [8896] STATEMENT:  explain analyze select * from 
reports.v_BusinessOperation;
TopMemoryContext: 120544 total in 7 blocks; 10016 free (6 chunks); 110528 used
  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 
used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 
chunks); 7776 used
  TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 
used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 
used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 
used:
  ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 
used
TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 
3973120 used
TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 
3973120 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashBatchFiles: 258032 total in 31 blocks; 6208 free (0 chunks); 251824 
used
HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used
  HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 
4271416 used
TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 
used
  Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
...
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
  CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 
812656 used
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_toast_2619_index
index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: 
entity_id_fkidx
...
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_attribute_relid_attnum_index
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: 
pg_class_oid_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tomas Vondra

On Sun, Apr 21, 2019 at 07:25:15PM -0400, Gunther wrote:

  After applying Tomas' corrected patch 0001, and routing HJDEBUG messages
  to stderr:

integrator=# set enable_nestloop to off;
SET
integrator=# explain analyze select * from reports.v_BusinessOperation;

...
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 
to 262144
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 262144 
to 524288
ERROR:  out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".

  Now 

TopMemoryContext: 4347672 total in 9 blocks; 41688 free (18 chunks); 4305984 
used
...
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 
used:
  ExecutorState: 2449896 total in 16 blocks; 1795000 free (3158 chunks); 
654896 used
TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 
40120 used
...
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashBatchFiles: 2242545904 total in 266270 blocks; 3996232 free (14164 
chunks); 2238549672 used
HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used
  HashBatchContext: 168165080 total in 5118 blocks; 7936 free (0 
chunks); 168157144 used
TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 
326632 used
  Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 
1026968 used
...
Grand total: 2424300520 bytes in 271910 blocks; 7332360 free (17596 chunks); 
2416968160 used



IMO this pretty much proves that the memory allocated for BufFile really
is the root cause of the issues with this query. 524288 batches means
up to 1048576 BufFiles, which is a bit more than ~8GB of RAM. However
those for the inner relation were not allycated yet, so at this point
only about 4GB might be allocated. And it seems ~1/2 of them did not
receive any tuples, so only about 2GB got allocated so far.

The second batch will probably make it fail much sooner, because it
allocates the BufFile stuff eagerly (both for inner and outer side).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Gunther
After applying Tomas' corrected patch 0001, and routing HJDEBUG messages 
to stderr:


integrator=# set enable_nestloop to off;
SET
integrator=# explain analyze select * from reports.v_BusinessOperation;
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 64 to 
128
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 128 to 
256
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 256 to 
512
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 512 to 
1024
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 
2048
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 
4096
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 
8192
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 
16384
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 16384 
to 32768
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 32768 
to 65536
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 65536 
to 131072
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 131072 
to 262144
WARNING:  ExecHashIncreaseNumBatches: increasing number of batches from 262144 
to 524288
ERROR:  out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".

Now

TopMemoryContext: 4347672 total in 9 blocks; 41688 free (18 chunks); 4305984 
used
  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 
used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 
chunks); 7776 used
  TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 
used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 
used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 
used:
  ExecutorState: 2449896 total in 16 blocks; 1795000 free (3158 chunks); 
654896 used
HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashTableContext: 16384 total in 2 blocks; 3008 free (6 chunks); 13376 
used
  HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 
used
HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used
  HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 
used
HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used
  HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 
used
HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
  HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
HashBatchFiles: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
  HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
TupleSort main: 1073512 total in 11 blocks; 246792 free (39 chunks); 
826720 used
TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 
40120 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashBatchFiles: 2242545904 total in 266270 blocks; 3996232 free (14164 
chunks); 2238549672 used
HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used
  HashBatchContext: 168165080 total in 5118 blocks; 7936 free (0 
chunks); 168157144 used
TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 
326632 used
  Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 
1026968 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tom Lane
Tomas Vondra  writes:
> On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
>> Jeff Janes  writes:
>>> If a single 32 bit hash value has enough tuples by itself to not fit in
>>> work_mem, then it will keep splitting until that value is in a batch by
>>> itself before shutting off

>> Right, that's the code's intention.  If that's not good enough for this
>> case, we'll need to understand the details a bit better before we can
>> design a better(?) heuristic.

> I think we only disable growing when there are no other values in the
> batch, but that seems rather easy to defeat - all you need is a single
> tuple with a hash that falls into the same batch, and it's over. Maybe
> we should make this a bit less accurate - say, if less than 5% memory
> gets freed, don't add more batches.

Yeah, something like that, but it's hard to design it without seeing some
concrete misbehaving examples.

I think though that this is somewhat independent of the problem that we're
not including the I/O buffers in our reasoning about memory consumption.

> An alternative would be spilling the extra tuples into a special
> overflow file, as I explained earlier. That would actually enforce
> work_mem I think.

Well, no, it won't.  If you have umpteen gigabytes of RHS tuples with the
exact same hash code, nothing we can do here is going to prevent you from
having to process those in a single table load.  (If it's a plain inner
join, maybe you could break that into subsections anyway ... but that
won't work for left or full joins where you need per-tuple match status.)

I think our ambition here should just be to not have the code go crazy
trying to keep its memory consumption under work_mem when it's ultimately
going to fail to do so anyhow.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Justin Pryzby
On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
> Jeff Janes  writes:
> > The growEnabled stuff only prevents infinite loops.  It doesn't prevent
> > extreme silliness.
> 
> > If a single 32 bit hash value has enough tuples by itself to not fit in
> > work_mem, then it will keep splitting until that value is in a batch by
> > itself before shutting off
> 
> I suspect, however, that we might be better off just taking the existence
> of the I/O buffers into account somehow while deciding whether it's worth
> growing further.  That is, I'm imagining adding a second independent
> reason for shutting off growEnabled, along the lines of "increasing
> nbatch any further will require an unreasonable amount of buffer memory".
> The question then becomes how to define "unreasonable".

On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:
> I think the question the code needs to be asking is "If we double the
> number of batches, does the amount of memory we need drop?" And the
> memory needs to account both for the buffers and per-batch data.
> 
> I don't think we can just stop increasing the number of batches when the
> memory for BufFile exceeds work_mem, because that entirely ignores the
> fact that by doing that we force the system to keep the per-batch stuff
> in memory (and that can be almost arbitrary amount).
...
> Of course, this just stops enforcing work_mem at some point, but it at
> least attempts to minimize the amount of memory used.

This patch defines reasonable as "additional BatchFiles will not themselves
exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
going to save us RAM"...

I think the first condition is insensitive and not too important to get right,
it only allows work_mem to be exceeded by 2x, which maybe already happens for
multiple reasons, related to this thread and otherwise.  It'd be fine to slap
on a factor of /2 or /4 or /8 there too. 

The current patch doesn't unset growEnabled, since there's no point at which
the hashtable should grow without bound: if hash tables are *already* exceeding
work_mem by 2x as big, nbatches should be doubled.

Justin
>From 7f303c3635adabe2d75f5bd3cb6a6b2ab5102012 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 21 Apr 2019 05:04:09 -0500
Subject: [PATCH v1] account for size of BatchFile structure in hashJoin

Avoid runaway number of Batches, overhead of which was previously not counted.
The size of BatchFiles is now included in instrumentation for output of explain
analyze.

Analyzed-By: Tom Lane, Tomas Vondra, Justin Pryzby
Discussion: bc138e9f-c89e-9147-5395-61d51a757...@gusw.net
---
 src/backend/executor/nodeHash.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index ee93030..9a87265 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1670,10 +1670,33 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
+
+		/*
+		 * If addition of 2x as many batch files would cause the structure itself
+		 * (without the hashtable) to exceed spaceAllowed, then do not grow...unless the
+		 * spaceUsed has already exceeded Allowed by more than 2x...
+		 * Note, spaceUsed doesn't include the overhead of BatchFile structure.
+		 */
 		if (hashtable->spaceUsed +
 			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
-			> hashtable->spaceAllowed)
-			ExecHashIncreaseNumBatches(hashtable);
+			// + hashtable->nbatch * sizeof(PGAlignedBlock)
+			> hashtable->spaceAllowed) {
+			if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed) {
+ExecHashIncreaseNumBatches(hashtable);
+			} else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed) {
+/* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches to increase */
+/* I think this branch would be hit almost same as below branch */
+ExecHashIncreaseNumBatches(hashtable);
+			} else if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceUsed) {
+/*
+ * spaceUsed is so large that it will save RAM to allow the BatchFile structures themselves to exceed work_mem.
+ * Increasing batches and exceeding work mem will use less space and be less bad than not increasing...
+ */
+elog(WARNING, "exceeding work mem: used:%ld allowed:%ld add:%ld",
+		hashtable->spaceUsed, hashtable->spaceAllowed, hashtable->nbatch*sizeof(PGAlignedBlock));
+ExecHashIncreaseNumBatches(hashtable);
+			}
+		}
 	}
 	else
 	{
@@ -2693,6 +2716,8 @@ ExecHashGetInstrumentation(HashInstrumentation *instrument,
 	instrument->nbatch = hashtable->nbatch;
 	instrument->nbatch_original = hashtable->nbatch_original;
 	instrument->space_peak = hashtable->spacePeak;
+	/* Finally, account for size of structure: */
+	instrument->space_peak += 

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tomas Vondra

On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:

Jeff Janes  writes:

The growEnabled stuff only prevents infinite loops.  It doesn't prevent
extreme silliness.



If a single 32 bit hash value has enough tuples by itself to not fit in
work_mem, then it will keep splitting until that value is in a batch by
itself before shutting off


Right, that's the code's intention.  If that's not good enough for this
case, we'll need to understand the details a bit better before we can
design a better(?) heuristic.



I think we only disable growing when there are no other values in the
batch, but that seems rather easy to defeat - all you need is a single
tuple with a hash that falls into the same batch, and it's over. Maybe
we should make this a bit less accurate - say, if less than 5% memory
gets freed, don't add more batches.


I suspect, however, that we might be better off just taking the existence
of the I/O buffers into account somehow while deciding whether it's worth
growing further.  That is, I'm imagining adding a second independent
reason for shutting off growEnabled, along the lines of "increasing
nbatch any further will require an unreasonable amount of buffer memory".
The question then becomes how to define "unreasonable".



I think the question the code needs to be asking is "If we double the
number of batches, does the amount of memory we need drop?" And the
memory needs to account both for the buffers and per-batch data.

I don't think we can just stop increasing the number of batches when the
memory for BufFile exceeds work_mem, because that entirely ignores the
fact that by doing that we force the system to keep the per-batch stuff
in memory (and that can be almost arbitrary amount).

What I think we should be doing instead is instead make the threshold
dynamic - instead of just checking at work_mem, we need to increment the
number of batches when the total amount of memory exceeds

   Max(work_mem, 3 * memory_for_buffiles)

This is based on the observation that by increasing the number of
batches, we double memory_for_buffiles and split the per-batch data in
half. By adding more batches, we'd actually increase the amount of
memory used.

Of course, this just stops enforcing work_mem at some point, but it at
least attempts to minimize the amount of memory used.

An alternative would be spilling the extra tuples into a special
overflow file, as I explained earlier. That would actually enforce
work_mem I think.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tom Lane
Jeff Janes  writes:
> The growEnabled stuff only prevents infinite loops.  It doesn't prevent
> extreme silliness.

> If a single 32 bit hash value has enough tuples by itself to not fit in
> work_mem, then it will keep splitting until that value is in a batch by
> itself before shutting off

Right, that's the code's intention.  If that's not good enough for this
case, we'll need to understand the details a bit better before we can
design a better(?) heuristic.

I suspect, however, that we might be better off just taking the existence
of the I/O buffers into account somehow while deciding whether it's worth
growing further.  That is, I'm imagining adding a second independent
reason for shutting off growEnabled, along the lines of "increasing
nbatch any further will require an unreasonable amount of buffer memory".
The question then becomes how to define "unreasonable".

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Jeff Janes
On Sat, Apr 20, 2019 at 4:26 PM Tom Lane  wrote:

> Tomas Vondra  writes:
> > Considering how rare this issue likely is, we need to be looking for a
> > solution that does not break the common case.
>
> Agreed.  What I think we need to focus on next is why the code keeps
> increasing the number of batches.  It seems like there must be an undue
> amount of data all falling into the same bucket ... but if it were simply
> a matter of a lot of duplicate hash keys, the growEnabled shutoff
> heuristic ought to trigger.
>

The growEnabled stuff only prevents infinite loops.  It doesn't prevent
extreme silliness.

If a single 32 bit hash value has enough tuples by itself to not fit in
work_mem, then it will keep splitting until that value is in a batch by
itself before shutting off (or at least until the split-point bit of
whatever else is in the that bucket happens to be the same value as the
split-point-bit of the degenerate one, so by luck nothing or everything
gets moved)

Probabilistically we keep splitting until every batch, other than the one
containing the degenerate value, has about one tuple in it.

Cheers,

Jeff


Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Tomas Vondra

On Sun, Apr 21, 2019 at 03:08:22AM -0500, Justin Pryzby wrote:

On Sun, Apr 21, 2019 at 01:03:50AM -0400, Gunther wrote:

On 4/20/2019 21:14, Tomas Vondra wrote:
>Maybe. But before wasting any more time on the memory leak investigation,
>I suggest you first try the patch moving the BufFile allocations to a
>separate context. That'll either confirm or disprove the theory.

OK, fair enough. So, first patch 0001-* applied, recompiled and

2019-04-21 04:08:04.364 UTC [11304] LOG:  server process (PID 11313) was 
terminated by signal 11: Segmentation fault

...

turns out the MemoryContext is NULL:

(gdb) p context
$1 = (MemoryContext) 0x0


I updated Tomas' patch to unconditionally set the context.
(Note, oldctx vs oldcxt is fairly subtle but I think deliberate?)



I don't follow - is there a typo confusing oldctx vs. oldcxt? I don't
think so, but I might have missed something. (I always struggle with which
spelling is the right one).

I think the bug is actually such simpler - the memory context was created
only in ExecuteIncreaseNumBatches() when starting with nbatch=1. But when
the initial nbatch value was higher (i.e. when starting with 2 or more
batches), it was left NULL. That was OK for testing with the contrived
data set, but it may easily break on other examples.

So here is an updated patch - hopefully this version works. I don't have
time to do much more testing now, though. But it compiles.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 3292e67135ef67f4bcf1ec0595fe79497c49d47c Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 21 Apr 2019 13:31:14 +0200
Subject: [PATCH 1/2] move BufFile stuff into separate context

---
 src/backend/executor/nodeHash.c | 54 +
 src/backend/executor/nodeHashjoin.c |  7 
 src/include/executor/hashjoin.h |  1 +
 3 files changed, 62 insertions(+)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6ffaa751f2..f7c92e78e9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -498,6 +498,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, 
bool keepNulls)
hashtable->skewTuples = 0;
hashtable->innerBatchFile = NULL;
hashtable->outerBatchFile = NULL;
+   hashtable->fileCtx = NULL;
hashtable->spaceUsed = 0;
hashtable->spacePeak = 0;
hashtable->spaceAllowed = space_allowed;
@@ -527,6 +528,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, 
bool keepNulls)

"HashBatchContext",

ALLOCSET_DEFAULT_SIZES);
 
+   hashtable->fileCtx = AllocSetContextCreate(CurrentMemoryContext,
+
"HashBatchFiles",
+
ALLOCSET_DEFAULT_SIZES);
+
/* Allocate data that will live for the life of the hashjoin */
 
oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
@@ -559,10 +564,14 @@ ExecHashTableCreate(HashState *state, List 
*hashOperators, bool keepNulls)
 
if (nbatch > 1 && hashtable->parallel_state == NULL)
{
+   MemoryContext oldctx;
+
/*
 * allocate and initialize the file arrays in hashCxt (not 
needed for
 * parallel case which uses shared tuplestores instead of raw 
files)
 */
+   oldctx = MemoryContextSwitchTo(hashtable->fileCtx);
+
hashtable->innerBatchFile = (BufFile **)
palloc0(nbatch * sizeof(BufFile *));
hashtable->outerBatchFile = (BufFile **)
@@ -570,6 +579,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, 
bool keepNulls)
/* The files will not be opened until needed... */
/* ... but make sure we have temp tablespaces established for 
them */
PrepareTempTablespaces();
+
+   MemoryContextSwitchTo(oldctx);
}
 
MemoryContextSwitchTo(oldcxt);
@@ -900,6 +911,11 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
nbatch = oldnbatch * 2;
Assert(nbatch > 1);
 
+   elog(WARNING, "ExecHashIncreaseNumBatches: increasing number of batches 
from %d to %d", oldnbatch, nbatch);
+
+   elog(LOG, "ExecHashIncreaseNumBatches === context stats start 
===");
+   MemoryContextStats(TopMemoryContext);
+
 #ifdef HJDEBUG
printf("Hashjoin %p: increasing nbatch to %d because space = %zu\n",
   hashtable, nbatch, hashtable->spaceUsed);
@@ -909,6 +925,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 
if (hashtable->innerBatchFile == NULL)
{
+   

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Justin Pryzby
I was able to reproduce in a somewhat contrived way:

sh -c 'ulimit -v 1024000 ; /usr/local/pgsql/bin/postgres -D ./pg12dev5 
-cport=1234' &

postgres=# SET work_mem='64kB';SET client_min_messages =debug1;SET 
log_statement_stats=on;explain(analyze) WITH v AS MATERIALIZED (SELECT * FROM 
generate_series(1,999)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 
AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND 
i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k JOIN v ON k=i;

HashTableContext: 8192 total in 1 blocks; 7696 free (7 chunks); 496 used
  hash batch files: 852065104 total in 101318 blocks; 951896 free (20 
chunks); 851113208 used
  HashBatchContext: 73888 total in 4 blocks; 24280 free (6 chunks); 
49608 used

2019-04-21 04:11:02.521 CDT [4156] ERROR:  out of memory
2019-04-21 04:11:02.521 CDT [4156] DETAIL:  Failed on request of size 8264 in 
memory context "hash batch files".




Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Justin Pryzby
On Sun, Apr 21, 2019 at 01:03:50AM -0400, Gunther wrote:
> On 4/20/2019 21:14, Tomas Vondra wrote:
> >Maybe. But before wasting any more time on the memory leak investigation,
> >I suggest you first try the patch moving the BufFile allocations to a
> >separate context. That'll either confirm or disprove the theory.
> 
> OK, fair enough. So, first patch 0001-* applied, recompiled and
> 
> 2019-04-21 04:08:04.364 UTC [11304] LOG:  server process (PID 11313) was 
> terminated by signal 11: Segmentation fault
...
> turns out the MemoryContext is NULL:
> 
> (gdb) p context
> $1 = (MemoryContext) 0x0

I updated Tomas' patch to unconditionally set the context.
(Note, oldctx vs oldcxt is fairly subtle but I think deliberate?)

Justin
>From ff76679b27c9decc93f7a5527367d98d3e99ddb8 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 21 Apr 2019 02:54:10 -0500
Subject: [PATCH v2] move BufFile stuff into separate context

Author: Tomas Vondra
---
 src/backend/executor/nodeHash.c | 42 +
 src/backend/executor/nodeHashjoin.c |  3 +++
 src/include/executor/hashjoin.h |  1 +
 3 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 64eec91..4361aac 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -499,6 +499,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
+	hashtable->fileCtx = NULL;
 	hashtable->spaceUsed = 0;
 	hashtable->spacePeak = 0;
 	hashtable->spaceAllowed = space_allowed;
@@ -527,6 +528,9 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->batchCxt = AllocSetContextCreate(hashtable->hashCxt,
 "HashBatchContext",
 ALLOCSET_DEFAULT_SIZES);
+	hashtable->fileCtx = AllocSetContextCreate(hashtable->hashCxt,
+			 "hash batch files",
+			 ALLOCSET_DEFAULT_SIZES);
 
 	/* Allocate data that will live for the life of the hashjoin */
 
@@ -562,10 +566,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 
 	if (nbatch > 1 && hashtable->parallel_state == NULL)
 	{
-		/*
-		 * allocate and initialize the file arrays in hashCxt (not needed for
-		 * parallel case which uses shared tuplestores instead of raw files)
-		 */
+		MemoryContext oldctx;
+		oldctx = MemoryContextSwitchTo(hashtable->fileCtx);
 		hashtable->innerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
@@ -573,6 +575,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
+		MemoryContextSwitchTo(oldctx);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -903,6 +906,11 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	nbatch = oldnbatch * 2;
 	Assert(nbatch > 1);
 
+	elog(WARNING, "ExecHashIncreaseNumBatches: increasing number of batches from %d to %d", oldnbatch, nbatch);
+
+	elog(LOG, "ExecHashIncreaseNumBatches === context stats start ===");
+	MemoryContextStats(TopMemoryContext);
+
 #ifdef HJDEBUG
 	printf("Hashjoin %p: increasing nbatch to %d because space = %zu\n",
 		   hashtable, nbatch, hashtable->spaceUsed);
@@ -910,6 +918,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 
 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
 
+	MemoryContextSwitchTo(hashtable->fileCtx);
 	if (hashtable->innerBatchFile == NULL)
 	{
 		/* we had no file arrays before */
@@ -1002,12 +1011,19 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			}
 			else
 			{
+MemoryContext oldctx;
+
 /* dump it out */
 Assert(batchno > curbatch);
+
+oldctx = MemoryContextSwitchTo(hashtable->fileCtx);
+
 ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
 	  hashTuple->hashvalue,
 	  >innerBatchFile[batchno]);
 
+MemoryContextSwitchTo(oldctx);
+
 hashtable->spaceUsed -= hashTupleSize;
 nfreed++;
 			}
@@ -1045,6 +1061,9 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			   hashtable);
 #endif
 	}
+
+	elog(LOG, "ExecHashIncreaseNumBatches === context stats end ===");
+	MemoryContextStats(TopMemoryContext);
 }
 
 /*
@@ -1660,13 +1679,20 @@ ExecHashTableInsert(HashJoinTable hashtable,
 	}
 	else
 	{
+		MemoryContext oldctx;
+
 		/*
 		 * put the tuple into a temp file for later batches
 		 */
 		Assert(batchno > hashtable->curbatch);
+
+		oldctx = MemoryContextSwitchTo(hashtable->fileCtx);
+
 		ExecHashJoinSaveTuple(tuple,
 			  hashvalue,
 			  >innerBatchFile[batchno]);
+
+		MemoryContextSwitchTo(oldctx);
 	}
 
 	if (shouldFree)
@@ -2508,10 +2534,18 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
 		}
 		else
 		{
+			MemoryContext 

Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Gunther
I  am now running Justin's patch after undoing Tomas' patches and any of 
my own hacks (which would not have interfered with Tomas' patch)


On 4/20/2019 15:30, Justin Pryzby wrote:

With my patch, the number of batches is nonlinear WRT work_mem, and reaches a
maximum for moderately small work_mem.  The goal is to choose the optimal
number of batches to minimize the degree to which work_mem is exceeded.


Now I seem to be in that slow massive growth phase or maybe still in an 
earlier step, but I can see the top RES behavior different already.  The 
size lingers around 400 MB.  But then it's growing too, at high CPU%, 
goes past 700, 800, 900 MB now 1.5 GB, 1.7 GB, 1.8 GB, 1.9 GB, 2.0 GB, 
2.1, and still 98% CPU. 2.4 GB, wow, it has never been that big ... and 
BOOM!


TopMemoryContext: 120544 total in 7 blocks; 9760 free (7 chunks); 110784 used
  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 
used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 
chunks); 7776 used
  TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 
used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 
used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
PortalContext: 1482752 total in 184 blocks; 11216 free (9 chunks); 1471536 
used:
  ExecutorState: 2361536 total in 27 blocks; 1827536 free (3163 chunks); 
534000 used
TupleSort main: 3957712 total in 22 blocks; 246792 free (39 chunks); 
3710920 used
TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 
3973120 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashTableContext: 8192 total in 1 blocks; 7336 free (6 chunks); 856 used
  HashBatchContext: 2523874568 total in 76816 blocks; 7936 free (0 
chunks); 2523866632 used
TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 
used
  Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
  CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 
812656 used
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 

Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 08:33:46PM -0400, Gunther wrote:

On 4/20/2019 16:01, Tomas Vondra wrote:

For me, this did the trick:
 update pg_class set (relpages, reltuples) = (100, 1) where 
relname = 'tmp_r';
 update pg_class set (relpages, reltuples) = (1, 100) where 
relname = 'tmp_q';



YES! For me too. My EXPLAIN ANALYZE actually succeeded.

Hash Right Join  (cost=11009552.27..11377073.28 rows=11 width=4271) (actual 
time=511580.110..1058354.140 rows=113478386 loops=1)
  Hash Cond: (((q.documentinternalid)::text = (r.documentinternalid)::text) AND 
((q.actinternalid)::text = (r.actinternalid)::text))
  ->  Seq Scan on tmp_q q  (cost=0.00..210021.00 rows=2100 width=3417) 
(actual time=1.148..1.387 rows=236 loops=1)
  ->  Hash  (cost=11009552.11..11009552.11 rows=11 width=928) (actual 
time=511577.002..511577.002 rows=113478127 loops=1)
Buckets: 16384 (originally 1024)  Batches: 131072 (originally 1)  
Memory Usage: 679961kB
->  Seq Scan on tmp_r r  (cost=0.00..11009552.11 rows=11 width=928) 
(actual time=4.077..344558.954 rows=113478127 loops=1)
Planning Time: 0.725 ms
Execution Time: 1064128.721 ms

But it used a lot of resident memory, and now it seems like I actually 
have a leak! Because after the command returned as shown above, the 
memory is still allocated:


 PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
7100 postgres  20   0 2164012   1.1g 251364 S   0.0 14.5  23:27.23 postgres: 
postgres integrator [local] idle

and let's do the memory map dump:

2019-04-20 22:09:52.522 UTC [7100] LOG:  duration: 1064132.171 ms  statement: 
explain analyze
   SELECT *
 FROM tmp_q q
 RIGHT OUTER JOIN tmp_r r
   USING(documentInternalId, actInternalId);
TopMemoryContext: 153312 total in 8 blocks; 48168 free (70 chunks); 105144 used
 HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
 Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 
used
 TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
 Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 
used
 RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
 MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used
 Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
 smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
 TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 
used
 Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
 TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
 Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
 CacheMemoryContext: 1154080 total in 20 blocks; 151784 free (1 chunks); 
1002296 used
   index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: 
pg_class_tblspc_relfilenode_index
   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_toast_2619_index
   index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: 
entity_id_fkidx
   index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: 
entity_id_idx
   ...
   index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: 
pg_attribute_relid_attnum_index
   index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: 
pg_class_oid_index
 WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 
used
 PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
 MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used
 LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
 Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
 ErrorContext: 8192 total in 1 blocks; 7936 free (3 chunks); 256 used
Grand total: 2082624 bytes in 240 blocks; 382760 free (175 chunks); 1699864 used

strange, it shows no leak here. Now I run this test again, to see if 
the memory grows further in top? This time I also leave the DISTINCT 
step in the query. I am trying to hit the out of memory situation. 
Well, I clearly saw memory growing now:




Unfortunately, interpreting RES is way more complicated. The trouble is
PostgreSQL does not get memory from kernel directly, it gets it through
glibc. So when we do free(), it's not guaranteed kernel gets it.

Also, I think glibc has multiple ways of getting memory from the kernel.
It can either to mmap or sbrk, and AFAIK it's easy to cause "islands" that
make it impossible to undo sbrk after freeing memory.

Memory leaks in PostgreSQL are usually about allocating memory in the
wrong context, and so are visible in MemoryContextStats. Permanent leaks
that don't show there are quite rare.



Also, I think while we might have focused in on a peculiar planning 
situation where a very unfortunate plan is chosen which stresses the 
memory situation, the real reason for the final 

Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tom Lane
Justin Pryzby  writes:
> On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote:
>> Maybe we just need to account for the per-batch buffers while estimating
>> the amount of memory used during planning.  That would force this case
>> into a mergejoin instead, given that work_mem is set so small.

> Do you mean by adding disable_cost if work_mem is so small that it's estimated
> to be exceeded ?

No, my point is that ExecChooseHashTableSize fails to consider the
I/O buffers at all while estimating hash table size.  It's not
immediately obvious how to factor that in, but we should.

If Tomas is right that there's also an underestimate of the number
of rows here, that might not solve Gunther's immediate problem; but
it seems like a clear costing oversight.

There's also the angle that the runtime code acts as though increasing
the number of batches is free, while it clearly isn't when you think
about the I/O buffers.  So at some point we should probably stop
increasing the number of batches on the grounds of needing too many
buffers.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Justin Pryzby
On Sat, Apr 20, 2019 at 04:46:03PM -0400, Tom Lane wrote:
> Tomas Vondra  writes:
> > I think it's really a matter of underestimate, which convinces the planner
> > to hash the larger table. In this case, the table is 42GB, so it's
> > possible it actually works as expected. With work_mem = 4MB I've seen 32k
> > batches, and that's not that far off, I'd day. Maybe there are more common
> > values, but it does not seem like a very contrived data set.
> 
> Maybe we just need to account for the per-batch buffers while estimating
> the amount of memory used during planning.  That would force this case
> into a mergejoin instead, given that work_mem is set so small.

Do you mean by adding disable_cost if work_mem is so small that it's estimated
to be exceeded ?

Justin




Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 10:36:50PM +0200, Tomas Vondra wrote:

On Sat, Apr 20, 2019 at 04:26:34PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

Considering how rare this issue likely is, we need to be looking for a
solution that does not break the common case.


Agreed.  What I think we need to focus on next is why the code keeps
increasing the number of batches.  It seems like there must be an undue
amount of data all falling into the same bucket ... but if it were simply
a matter of a lot of duplicate hash keys, the growEnabled shutoff
heuristic ought to trigger.



I think it's really a matter of underestimate, which convinces the planner
to hash the larger table. In this case, the table is 42GB, so it's
possible it actually works as expected. With work_mem = 4MB I've seen 32k
batches, and that's not that far off, I'd day. Maybe there are more common
values, but it does not seem like a very contrived data set.



Actually, I might have spoken too soon. I've dne some stats on the sample
data. There are 113478127 rows in total, and while most UUIDs are unique,
there are UUIDs that represent ~10% of the data. So maybe there really is
something broken in disabling the growth.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

test=# select c1, count(*) as c2 from (select count(*) as c1 from tmp_r group 
by documentinternalid) foo group by c1;
c1|c2 
--+---
1 | 67884
2 | 33033
3 |  9478
4 | 11205
5 |  2725
6 |  3457
7 |   677
8 |  1992
9 |   961
   10 |   857
   11 |   199
   12 |  1397
   13 |81
   14 |   289
   15 |   630
   16 |   561
   17 |69
   18 |   360
   19 |42
   20 |   367
   21 |   232
   22 |   121
   23 |49
   24 |   380
   25 |   113
   26 |85
   27 |93
   28 |   117
   29 |21
   30 |   136
   31 |32
   32 |   102
   33 |24
   34 |17
   35 |24
   36 |   145
   37 |37
   38 |15
   39 |17
   40 |54
   41 |15
   42 |65
   43 | 8
   44 |48
   45 |26
   46 |12
   47 | 5
   48 |88
   49 |80
   50 |26
   51 |10
   52 |10
   53 | 1
   54 |68
   55 | 6
   56 |31
   57 | 6
   58 | 7
   59 | 3
   60 |43
   61 | 1
   62 | 5
   63 | 7
   64 |24
   65 |10
   66 |24
   68 | 8
   69 | 2
   70 | 9
   71 | 2
   72 |31
   76 | 7
   77 | 5
   78 | 7
   79 | 3
   80 |22
   81 | 7
   82 | 4
   84 |26
   85 | 2
   86 | 4
   88 | 6
   90 |17
   91 | 5
   94 | 5
   96 |23
   98 |86
   99 | 9
  100 | 5
  102 | 2
  104 | 5
  105 | 2
  106 | 1
  108 |12
  110 |17
  112 | 6
  113 | 1
  114 | 9
  120 | 6
  121 | 8
  122 | 1
  124 |10
  126 | 3
  128 | 6
  129 | 2
  132 |10
  135 | 2
  136 | 1
  138 | 2
  140 | 2
  142 | 1
  144 | 9
  146 | 1
  147 |15
  148 | 2
  150 | 5
  152 | 4
  154 | 2
  156 | 3
  158 | 1
  160 | 9
  162 | 3
  168 | 2
  174 | 3
  176 | 3
  177 | 1
  178 | 1
  180 | 4
  181 | 2
  183 | 1
  186 | 2
  188 | 5
  192 | 2
  194 | 4
  196 |35
  198 | 2
  200 | 3
  208 | 4
  213 | 1
  216 | 4
  220 | 2
  223 | 2
  224 | 9
  227 | 1
  228 | 4
  234 | 1
  240 | 4
  241 | 1
  245 | 9
  248 | 1
  252 | 5
  256 | 7
  268 | 1
  272 | 3
  280 | 2
  282 | 1
  283 | 3
  288 | 4
  294 |24
  300 | 1
  311 | 1
  315 | 1
  320 | 2
  336 | 4
  343 | 1
  352 | 1
  356 | 5
  357 | 2
  360 | 2
  372 | 2
  375 | 1
  378 | 1
  384 | 3
  392 | 9
  393 | 1
  396 | 2
  400 | 3
  403 | 1
  405 | 1
  418 | 1
  432 | 3
  441 |11
  442 | 1
  443 | 1
  448 | 2
  450 | 1
  454 | 1
  456 | 2
  462 | 2
  475 | 1
  476 | 1
  480 | 2
  488 | 2
  490 | 6
  495 |  

Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tom Lane
Gunther  writes:
> and checked my log file and there was nothing before the call 
> MemoryContextStats(TopPortalContext) so I don't understand where this 
> printf stuff is ending up.

It's going to stdout, which is likely block-buffered whereas stderr
is line-buffered, so data from the latter will show up in your log
file much sooner.  You might consider adding something to startup
to switch stdout to line buffering.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 04:26:34PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

Considering how rare this issue likely is, we need to be looking for a
solution that does not break the common case.


Agreed.  What I think we need to focus on next is why the code keeps
increasing the number of batches.  It seems like there must be an undue
amount of data all falling into the same bucket ... but if it were simply
a matter of a lot of duplicate hash keys, the growEnabled shutoff
heuristic ought to trigger.



I think it's really a matter of underestimate, which convinces the planner
to hash the larger table. In this case, the table is 42GB, so it's
possible it actually works as expected. With work_mem = 4MB I've seen 32k
batches, and that's not that far off, I'd day. Maybe there are more common
values, but it does not seem like a very contrived data set.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tom Lane
Tomas Vondra  writes:
> Considering how rare this issue likely is, we need to be looking for a
> solution that does not break the common case.

Agreed.  What I think we need to focus on next is why the code keeps
increasing the number of batches.  It seems like there must be an undue
amount of data all falling into the same bucket ... but if it were simply
a matter of a lot of duplicate hash keys, the growEnabled shutoff
heuristic ought to trigger.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Gunther

On Tue, Apr 16, 2019 at 11:46:51PM -0500, Justin Pryzby wrote:


I wonder if it'd be useful to compile with
./configure CFLAGS=-DHJDEBUG=1

Could you try this, too ?


OK, doing it now, here is what I'm getting in the log file now. I am 
surprised I get so few rows here when there


2019-04-20 17:12:16.077 UTC [7093] LOG:  database system was shut down at 
2019-04-20 17:12:15 UTC
2019-04-20 17:12:16.085 UTC [7091] LOG:  database system is ready to accept 
connections
Hashjoin 0x118e0c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x118e0f8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1194e78: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x119b518: initial nbatch = 16, nbuckets = 65536
Hashjoin 0x1194e78: initial nbatch = 1, nbuckets = 65536
Hashjoin 0x119bb38: initial nbatch = 16, nbuckets = 65536
TopMemoryContext: 4347672 total in 9 blocks; 41784 free (19 chunks); 4305888 
used
  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 free (0 
chunks); 6736 used
  TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used
  Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 
used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 
used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 32768 total in 3 blocks; 13488 free (1 chunks); 19280 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 
used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 
used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
PortalContext: 1474560 total in 183 blocks; 6152 free (8 chunks); 1468408 
used:
  ExecutorState: 2234501600 total in 266274 blocks; 3696112 free (17274 
chunks); 2230805488 used
HashTableContext: 32768 total in 3 blocks; 17272 free (8 chunks); 15496 
used
  HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 
used
HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used
  HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 
used
HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used
  HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 
used
HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
  HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
  HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 
40120 used
TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 
40120 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 
8389408 used
  HashBatchContext: 177003344 total in 5387 blocks; 7936 free (0 
chunks); 176995408 used
TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 
326632 used
  Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 
1026968 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 

Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 02:30:09PM -0500, Justin Pryzby wrote:

On Sun, Apr 14, 2019 at 11:24:59PM -0400, Tom Lane wrote:

Gunther  writes:
>ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 
chunks); 2230341056 used

Oooh, that looks like a memory leak right enough.  The ExecutorState
should not get that big for any reasonable query.


On Tue, Apr 16, 2019 at 11:30:19AM -0400, Tom Lane wrote:

Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the


On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote:

   ->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual 
time=199792.446..199792.446 rows=113478127 loops=1)
 Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  
Memory Usage: 189207kB


Is it significant that there are ~2x as many ExecutorState blocks as there are
batches ?  266261/131072 => 2.03...



IMO that confirms this is the issue with BufFile I just described, because 
the struct is >8K, so it's allocated as a separate block (it exceeds the 
threshold in AllocSet). And we have two BufFile(s) for each batch, because

we need to batch both the inner and outer relations.


If there was 1 blocks leaked when batch=2, and 2 blocks leaked when batch=4,
and 4 blocks leaked when batch=131072, then when batch=16, there'd be 64k
leaked blocks, and 131072 total blocks.

I'm guessing Tom probably already thought of this, but:
2230341056/266261 => ~8376


Well, the BufFile is 8272 on my system, so that's pretty close ;-)


which is pretty close to the 8kB I/O buffer you were talking about (if the
number of same-sized buffers much greater than other allocations).

If Tom thinks (as I understand) that the issue is *not* a memory leak, but out
of control increasing of nbatches, and failure to account for their size...then
this patch might help.

The number of batches is increased to avoid exceeding work_mem.  With very low
work_mem (or very larger number of tuples hashed), it'll try to use a large
number of batches.  At some point the memory used by BatchFiles structure
(increasing by powers of two) itself exceeds work_mem.

With larger work_mem, there's less need for more batches.  So the number of
batches used for small work_mem needs to be constrained, either based on
work_mem, or at all.

With my patch, the number of batches is nonlinear WRT work_mem, and reaches a
maximum for moderately small work_mem.  The goal is to choose the optimal
number of batches to minimize the degree to which work_mem is exceeded.



Yeah. The patch might be enough for debugging, but it's clearly not
something we could adopt as is, because we increase the number of batches
for a reason - we need to do that to keep the amount of memory needed for
the hash table contents (i.e. rows) below work_mem. If you just cap the
number of batches, you'll keep the amount of memory for BufFile under
control, but the hash table may exceed work_mem.

Considering how rare this issue likely is, we need to be looking for a
solution that does not break the common case.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Tomas Vondra

On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote:


...

It would be so nice if there was a way to force a specific plan for 
purposes of the testing.  I tried giving false data in pg_class 
reltuples and relpages:


foo=# analyze tmp_q;
ANALYZE
foo=# analyze tmp_r;
ANALYZE
foo=# select relname, relpages, reltuples from pg_class where relname in 
('tmp_q', 'tmp_r');
relname | relpages |  reltuples
-+--+-
tmp_r   |  5505039 | 1.13467e+08
tmp_q   |7 | 236
(2 rows)

foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where 
relname = 'tmp_q';
UPDATE 1
foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 
'tmp_r';
UPDATE 1

but that didn't help. Somehow the planner outsmarts every such trick, 
so I can't get it to follow my right outer join plan where the big 
table is hashed.  I am sure y'all know some way to force it.




That does not work, because the planner does not actually use these values
directly - it only computes the density from them, and then multiplies
that to the current number of pages in the file. That behaves much nicer
when the table grows/shrinks between refreshes of the pg_class values.

So what you need to do is tweak these values to skew the density in a way
that then results in the desired esimate when multiplied with the actual
number of pages. For me, this did the trick:

 update pg_class set (relpages, reltuples) = (100, 1)
 where relname = 'tmp_r';

 update pg_class set (relpages, reltuples) = (1, 100)
 where relname = 'tmp_q';

after which I get a plan like this:

  Hash Right Join
Hash Cond: (...)
->  Seq Scan on tmp_q q
->  Hash
  ->  Seq Scan on tmp_r r

As for the issue, I have a theory that I think would explain the issues.
It is related to the number of batches, as others speculated over here.
It's not a memory leak, though, it's just that each batch requires a lot
of extra memory and we don't account for that.

The trouble is - each batch is represented by BufFile, which is a whopping
8272 bytes, because it includes PGAlignedBlock. Now, with 131072 batches,
that's a nice 1GB of memory right there. And we don't actually account for
this memory in hashjoin code, so it's not counted against work_mem and we
just increase the number of batches.

Attached are two patches, that should help us to confirm that's actually
what's happening when running the query on actual data. The first patch
moves the BufFile stuff into a separate memory context, to make it more
obvious where the memory went. It also adds a buch of logging into the
ExecHashIncreaseNumBatches() function.

The second patch makes sure all the BufFiles are allocated right when
increasing the number of batches - otherwise we allocate them only when we
actually find a row for that batch, and I suspect the sample data shared
on this thread are somewhat correlated (I see long runs of the same UUID
value). That might slow down the memory growth. Of course, the real data
probably don't have such correlation, resulting in faster failures.

With the patch, I see stuff like this with 256k batches:

 ExecutorState: 65536 total in 4 blocks; 28136 free (4 chunks); 37400 used
   HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
 hash batch files: 4404002656 total in 524302 blocks; 8387928 free (20 
chunks); 4395614728 used

so it's conceivable it's the root cause.

As for a fix, I'm not sure. I'm pretty sure we need to consider the amount
of memory for BufFile(s) when increasing the number of batches. But we
can't just stop incrementing the batches, because that would mean the
current batch may easily get bigger than work_mem :-(

I think we might cap the number of batches kept in memory, and at some
point start spilling data into an "overflow batch." So for example we'd
allow 32k batches, and then instead of increasing nbatch to 64k, we'd
create a single "overflow batch" representing batches 32k - 64k. After
processing the first 32k batches, we'd close the files and reuse the
memory for the next 32k batches. We'd read the overflow batch, split it
into the 32k batches, and just process them as usual. Of course, there
might be multiple rounds of this, for example we might end up with 32k
concurrent batches but 128k virtual ones, which means we'd do 4 rounds of
this dance.

It's a bit inefficient, but situations like this should be rather rare,
and it's more graceful than just crashing with OOM.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 431f2442f463c16b4f8f7a4c268c1533ed1aca84 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 20 Apr 2019 21:44:37 +0200
Subject: [PATCH 1/2] move BufFile stuff into separate context

---
 src/backend/executor/nodeHash.c | 48 +
 src/backend/executor/nodeHashjoin.c |  7 +
 src/include/executor/hashjoin.h |  1 

Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Justin Pryzby
On Sun, Apr 14, 2019 at 11:24:59PM -0400, Tom Lane wrote:
> Gunther  writes:
> >ExecutorState: 2234123384 total in 266261 blocks; 3782328 free 
> > (17244 chunks); 2230341056 used
> 
> Oooh, that looks like a memory leak right enough.  The ExecutorState
> should not get that big for any reasonable query.

On Tue, Apr 16, 2019 at 11:30:19AM -0400, Tom Lane wrote:
> Hmm ... this matches up with a vague thought I had that for some reason
> the hash join might be spawning a huge number of separate batches.
> Each batch would have a couple of files with associated in-memory
> state including an 8K I/O buffer, so you could account for the

On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote:
>->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual 
> time=199792.446..199792.446 rows=113478127 loops=1)
>  Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  
> Memory Usage: 189207kB

Is it significant that there are ~2x as many ExecutorState blocks as there are
batches ?  266261/131072 => 2.03...

If there was 1 blocks leaked when batch=2, and 2 blocks leaked when batch=4,
and 4 blocks leaked when batch=131072, then when batch=16, there'd be 64k
leaked blocks, and 131072 total blocks.

I'm guessing Tom probably already thought of this, but:
2230341056/266261 => ~8376
which is pretty close to the 8kB I/O buffer you were talking about (if the
number of same-sized buffers much greater than other allocations).

If Tom thinks (as I understand) that the issue is *not* a memory leak, but out
of control increasing of nbatches, and failure to account for their size...then
this patch might help.

The number of batches is increased to avoid exceeding work_mem.  With very low
work_mem (or very larger number of tuples hashed), it'll try to use a large
number of batches.  At some point the memory used by BatchFiles structure
(increasing by powers of two) itself exceeds work_mem.

With larger work_mem, there's less need for more batches.  So the number of
batches used for small work_mem needs to be constrained, either based on
work_mem, or at all.

With my patch, the number of batches is nonlinear WRT work_mem, and reaches a
maximum for moderately small work_mem.  The goal is to choose the optimal
number of batches to minimize the degree to which work_mem is exceeded.

Justin
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 64eec91..4e99093 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1036,8 +1036,11 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * Increasing nbatch will not fix it since there's no way to subdivide the
 	 * group any more finely. We have to just gut it out and hope the server
 	 * has enough RAM.
+	 * Also avoid increasing nbatch if an additional nbatch would cause
+	 * overhead of batchFiles alone to exceed work_mem.
 	 */
 	if (nfreed == 0 || nfreed == ninmemory)
+		// && (1 hashtable->spaceAllowed/8192))
 	{
 		hashtable->growEnabled = false;
 #ifdef HJDEBUG
@@ -1655,8 +1658,18 @@ ExecHashTableInsert(HashJoinTable hashtable,
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed +
 			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
-			> hashtable->spaceAllowed)
-			ExecHashIncreaseNumBatches(hashtable);
+			> hashtable->spaceAllowed && hashtable->growEnabled) {
+			/*
+			 * spaceUsed doesn't include the overhead of
+			 * BatchFile structure.  If the overhead of an
+			 * additional 2x batch files would use more than the
+			 * space itself, do not grow...
+			 */
+			if (1spaceUsed/8192)
+ExecHashIncreaseNumBatches(hashtable);
+			else
+hashtable->growEnabled = false;
+		}
 	}
 	else
 	{


Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Justin Pryzby
> The only problem is that I can't test that this actually would trigger the
> memory problem, because I can't force the plan to use the right join, it
> always reverts to the left join hashing the tmp_q:

I think the table on the "OUTER" side is the one which needs to be iterated
over (not hashed), in order to return each of its rows even if there are no
join partners in the other table.  In your original query, the small table was
being hashed and the large table iterated; maybe that's whats important.

> which is of course much better, but when tmp_q and tmp_r are the results of
> complex stuff that the planner can't estimate, then it gets it wrong, and
> then the issue gets triggered because we are hashing on the big tmp_r, not
> tmp_q.

I was able to get something maybe promising ?  "Batches: 65536 (originally 1)"

I didn't get "Out of memory" error yet, but did crash the server with this one:
postgres=# explain analyze WITH v AS( SELECT * FROM 
generate_series(1,)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 
AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND 
i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k JOIN v ON k=i ;

Note, on pg12dev this needs to be "with v AS MATERIALIZED".

postgres=# SET work_mem='128kB';SET client_min_messages =log;SET 
log_statement_stats=on;explain(analyze,timing off) WITH v AS( SELECT * FROM 
generate_series(1,99)i WHERE i%10<10 AND i%11<11 AND i%12<12 AND i%13<13 
AND i%14<14 AND i%15<15 AND i%16<16 AND i%17<17 AND i%18<18 AND i%19<19 AND 
i%20<20 AND i%21<21 ) SELECT * FROM generate_series(1,99)k JOIN v ON k=i ;
 Hash Join  (cost=70.04..83.84 rows=5 width=8) (actual rows=99 loops=1)
   Hash Cond: (k.k = v.i)
   CTE v
 ->  Function Scan on generate_series i  (cost=0.00..70.00 rows=1 width=4) 
(actual rows=99 loops=1)
   Filter: (((i % 10) < 10) AND ((i % 11) < 11) AND ((i % 12) < 12) AND 
((i % 13) < 13) AND ((i % 14) < 14) AND ((i % 15) < 15) AND ((i % 16) < 16) AND 
((i % 17) < 17) AND ((i % 18) < 18) AND ((i % 19) < 19) AND ((i % 20) < 20) AND 
((i % 21) < 21))
   ->  Function Scan on generate_series k  (cost=0.00..10.00 rows=1000 width=4) 
(actual rows=99 loops=1)
   ->  Hash  (cost=0.02..0.02 rows=1 width=4) (actual rows=99 loops=1)
 Buckets: 4096 (originally 1024)  Batches: 512 (originally 1)  Memory 
Usage: 101kB
 ->  CTE Scan on v  (cost=0.00..0.02 rows=1 width=4) (actual 
rows=99 loops=1)

Justin




Re: Out of Memory errors are frustrating as heck!

2019-04-20 Thread Justin Pryzby
On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote:
> On 4/19/2019 17:01, Justin Pryzby wrote:
> >Were you able to reproduce the issue in some minimized way ?  Like after
> >joining fewer tables or changing to join with fewer join conditions ?
> >
> >On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:
> >>It would be possible to do at least one of these two things:
> 
> Thanks, and sorry for my pushyness. Yes, I have pin pointed the HashJoin,
> and I have created the two tables involved.
> 
> The data distribution of the join keys, they are all essentially UUIDs and
> essentially random.
> 
> I am sharing this data with you. However, only someone who can actually
> control the planner can use it to reproduce the problem. I have tried but
> not succeeded. But I am sure the problem is reproduced by this material.
> 
> Here is the part of the plan that generates this massive number of calls to
> 
> ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 
> width=1197)
>   Hash Cond: (((q.documentinternalid)::text = 
> (documentinformationsubject.documentinternalid)::text) AND 
> ((r.targetinternalid)::text = 
> (documentinformationsubject.actinternalid)::text))
>   ->  Hash Right Join  (cost=1341541.37..2612134.36 rows=13 
> width=341)
> Hash Cond: 
> (((documentinformationsubject_2.documentinternalid)::text = 
> (q.documentinternalid)::text) AND 
> ((documentinformationsubject_2.actinternalid)::text = 
> (q.actinternalid)::text))
>   ... let's call this tmp_q ...
>   ->  Hash  (cost=2908913.87..2908913.87 rows=34619 width=930)
> ->  Gather Merge  (cost=2892141.40..2908913.87 rows=34619 
> width=930)
> ... let's call this tmp_r ...

Would you send basic stats for these ?
q.documentinternalid, documentinformationsubject.documentinternalid, 
r.targetinternalid, documentinformationsubject.actinternalid

Like from this query
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram

Justin




Re: Out of Memory errors are frustrating as heck!

2019-04-19 Thread Jeremy Schneider
pg_hint_plan extension might be able to force a plan.

Also, I don’t know if perf probes & perf record/script could be useful for 
creating a log of all the calls to do memory allocation along with the unwound 
call stacks? Then analyzing that file? At least this can be done for a single 
process, and just while the problematic sql is running.

-Jeremy

Sent from my TI-83

> On Apr 19, 2019, at 20:34, Gunther  wrote:
> 
>> On 4/19/2019 17:01, Justin Pryzby wrote:
>> Were you able to reproduce the issue in some minimized way ?  Like after
>> joining fewer tables or changing to join with fewer join conditions ?
>> 
>> On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:
>>> It would be possible to do at least one of these two things:
> Thanks, and sorry for my pushyness. Yes, I have pin pointed the HashJoin, and 
> I have created the two tables involved.
> 
> The data distribution of the join keys, they are all essentially UUIDs and 
> essentially random.
> 
> I am sharing this data with you. However, only someone who can actually 
> control the planner can use it to reproduce the problem. I have tried but not 
> succeeded. But I am sure the problem is reproduced by this material.
> 
> Here is the part of the plan that generates this massive number of calls to 
> 
> ->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 
> width=1197)
>   Hash Cond: (((q.documentinternalid)::text = 
> (documentinformationsubject.documentinternalid)::text) AND 
> ((r.targetinternalid)::text = 
> (documentinformationsubject.actinternalid)::text))
>   ->  Hash Right Join  (cost=1341541.37..2612134.36 rows=13 
> width=341)
> Hash Cond: 
> (((documentinformationsubject_2.documentinternalid)::text = 
> (q.documentinternalid)::text) AND 
> ((documentinformationsubject_2.actinternalid)::text = 
> (q.actinternalid)::text))
>   ... let's call this tmp_q ...
>   ->  Hash  (cost=2908913.87..2908913.87 rows=34619 width=930)
> ->  Gather Merge  (cost=2892141.40..2908913.87 rows=34619 
> width=930)
> ... let's call this tmp_r ...
> This can be logically reduced to the following query
> 
> SELECT *
>   FROM tmp_q q
>   RIGHT OUTER JOIN tmp_r r
> USING(documentInternalId, actInternalId);
> with the following two tables
> 
> CREATE TABLE xtmp_q (
>  documentinternalidcharacter varying(255),
>  operationqualifiercodecharacter varying(512),
>  operationqualifiernamecharacter varying(512),
>  actinternalid character varying(255),
>  approvalinternalidcharacter varying(255),
>  approvalnumbercharacter varying(555),
>  approvalnumbersystem  character varying(555),
>  approvalstatecode character varying(512),
>  approvalstatecodesystem   character varying(512),
>  approvaleffectivetimelow  character varying(512),
>  approvaleffectivetimehigh character varying(512),
>  approvalstatuscodecharacter varying(32),
>  licensecode   character varying(512),
>  agencyid  character varying(555),
>  agencynametext
> );
> 
> CREATE TABLE tmp_r (
>  documentinternalid  character varying(255),
>  is_current  character(1),
>  documentid  character varying(555),
>  documenttypecodecharacter varying(512),
>  subjectroleinternalid   character varying(255),
>  subjectentityinternalid character varying(255),
>  subjectentityid character varying(555),
>  subjectentityidroot character varying(555),
>  subjectentityname   character varying,
>  subjectentityteltext,
>  subjectentityemail  text,
>  otherentityinternalid   character varying(255),
>  confidentialitycode character varying(512),
>  actinternalid   character varying(255),
>  operationcode   character varying(512),
>  operationname   text,
>  productitemcode character varying(512),
>  productinternalid   character varying(255)..
> );
> you can download the data here (URLs just a tiny bit obfuscated):
> 
> The small table http:// gusw dot net/tmp_q.gz
> 
> The big table is in the form of 9 parts of 20 MB each, http:// gusw dot 
> net/tmp_r.gz.00, .01, .02, ..., .09, maybe you need only the first part.
> 
> Download as many as you have patience to grab, and then import the data like 
> this:
> 
> \copy tmp_q from program 'zcat tmp_q.gz'
> \copt tmp_r from program 'cat tmp_r.gz.* |zcat'
> The only problem is that I can't test that this actually would trigger the 
> memory problem, because I can't force the plan to use the right join, it 
> always reverts to the left join hashing the tmp_q:
> 
> ->  Hash Left Join  (cost=10.25..5601401.19 rows=5505039 width=12118)
>   Hash Cond: (((r.documentinternalid)::text = 
> (q.documentinternalid)::text) AND ((r.actinternalid)::text = 
> (q.actinternalid)::text))
>   ->  Seq 

Re: Out of Memory errors are frustrating as heck!

2019-04-19 Thread Gunther

On 4/19/2019 17:01, Justin Pryzby wrote:

Were you able to reproduce the issue in some minimized way ?  Like after
joining fewer tables or changing to join with fewer join conditions ?

On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:

It would be possible to do at least one of these two things:


Thanks, and sorry for my pushyness. Yes, I have pin pointed the 
HashJoin, and I have created the two tables involved.


The data distribution of the join keys, they are all essentially UUIDs 
and essentially random.


I am sharing this data with you. However, only someone who can actually 
control the planner can use it to reproduce the problem. I have tried 
but not succeeded. But I am sure the problem is reproduced by this material.


Here is the part of the plan that generates this massive number of calls to

->  Hash Right Join  (cost=4255031.53..5530808.71 rows=34619 width=1197)
  Hash Cond: (((q.documentinternalid)::text = 
(documentinformationsubject.documentinternalid)::text) AND 
((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))
  ->  Hash Right Join  (cost=1341541.37..2612134.36 rows=13 
width=341)
Hash Cond: 
(((documentinformationsubject_2.documentinternalid)::text = 
(q.documentinternalid)::text) AND 
((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))
  ... let's call this tmp_q ...
  ->  Hash  (cost=2908913.87..2908913.87 rows=34619 width=930)
->  Gather Merge  (cost=2892141.40..2908913.87 rows=34619 
width=930)
... let's call this tmp_r ...

This can be logically reduced to the following query

SELECT *
  FROM tmp_q q
  RIGHT OUTER JOIN tmp_r r
USING(documentInternalId, actInternalId);

with the following two tables

CREATE TABLE xtmp_q (
 documentinternalidcharacter varying(255),
 operationqualifiercodecharacter varying(512),
 operationqualifiernamecharacter varying(512),
 actinternalid character varying(255),
 approvalinternalidcharacter varying(255),
 approvalnumbercharacter varying(555),
 approvalnumbersystem  character varying(555),
 approvalstatecode character varying(512),
 approvalstatecodesystem   character varying(512),
 approvaleffectivetimelow  character varying(512),
 approvaleffectivetimehigh character varying(512),
 approvalstatuscodecharacter varying(32),
 licensecode   character varying(512),
 agencyid  character varying(555),
 agencynametext
);

CREATE TABLE tmp_r (
 documentinternalid  character varying(255),
 is_current  character(1),
 documentid  character varying(555),
 documenttypecodecharacter varying(512),
 subjectroleinternalid   character varying(255),
 subjectentityinternalid character varying(255),
 subjectentityid character varying(555),
 subjectentityidroot character varying(555),
 subjectentityname   character varying,
 subjectentityteltext,
 subjectentityemail  text,
 otherentityinternalid   character varying(255),
 confidentialitycode character varying(512),
 actinternalid   character varying(255),
 operationcode   character varying(512),
 operationname   text,
 productitemcode character varying(512),
 productinternalid   character varying(255)..
);

you can download the data here (URLs just a tiny bit obfuscated):

The small table http:// gusw dot net/tmp_q.gz

The big table is in the form of 9 parts of 20 MB each, http:// gusw dot 
net/tmp_r.gz.00, .01, .02, ..., .09, maybe you need only the first part.


Download as many as you have patience to grab, and then import the data 
like this:


\copy tmp_q from program 'zcat tmp_q.gz'
\copt tmp_r from program 'cat tmp_r.gz.* |zcat'

The only problem is that I can't test that this actually would trigger 
the memory problem, because I can't force the plan to use the right 
join, it always reverts to the left join hashing the tmp_q:


->  Hash Left Join  (cost=10.25..5601401.19 rows=5505039 width=12118)
  Hash Cond: (((r.documentinternalid)::text = 
(q.documentinternalid)::text) AND ((r.actinternalid)::text = 
(q.actinternalid)::text))
  ->  Seq Scan on tmp_r r  (cost=0.00..5560089.39 rows=5505039 
width=6844)
  ->  Hash  (cost=10.10..10.10 rows=10 width=6306)
->  Seq Scan on tmp_q q  (cost=0.00..10.10 rows=10 
width=6306)

which is of course much better, but when tmp_q and tmp_r are the results 
of complex stuff that the planner can't estimate, then it gets it wrong, 
and then the issue gets triggered because we are hashing on the big 
tmp_r, not tmp_q.


It would be so nice if there was a way to force a specific plan for 
purposes of the testing.  I tried giving false data in pg_class 
reltuples and relpages:


foo=# analyze tmp_q;
ANALYZE

Re: Out of Memory errors are frustrating as heck!

2019-04-19 Thread Justin Pryzby
On Wed, Apr 17, 2019 at 11:52:44PM -0400, Gunther wrote:
> Hi guys. I don't want to be pushy, but I found it strange that after so much

Were you able to reproduce the issue in some minimized way ?  Like after
joining fewer tables or changing to join with fewer join conditions ?

On Thu, Apr 18, 2019 at 05:21:28PM +0200, Tomas Vondra wrote:
> As for the issue - I think the current hypothesis is that the data
> distribution is skewed in some strange way, triggering some unexpected
> behavior in hash join. That seems plausible, but it's really hard to
> investigate without knowing anything about the data distribution :-(
> 
> It would be possible to do at least one of these two things:
> 
> (a) export pg_stats info about distribution of the join keys

For starts, could you send the MCVs, maybe with some variation on this query ?
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram

Justin




Re: Out of Memory errors are frustrating as heck!

2019-04-19 Thread Gaetano Mendola
On Thu, Apr 18, 2019 at 6:01 AM Gunther  wrote:

> Hi guys. I don't want to be pushy, but I found it strange that after so
> much lively back and forth getting to the bottom of this, suddenly my last
> nights follow-up remained completely without reply. I wonder if it even got
> received. For those who read their emails with modern readers (I know I too
> am from a time where I wrote everything in plain text) I marked some
> important questions in bold.
>

Give Tom Lane remote access to your server, around 10 years ago I had the
server crashing and the fastest we could do is let him access the server.

G.


Re: Out of Memory errors are frustrating as heck!

2019-04-18 Thread Tomas Vondra

On Wed, Apr 17, 2019 at 11:52:44PM -0400, Gunther wrote:
Hi guys. I don't want to be pushy, but I found it strange that after 
so much lively back and forth getting to the bottom of this, suddenly 
my last nights follow-up remained completely without reply. I wonder 
if it even got received. For those who read their emails with modern 
readers (I know I too am from a time where I wrote everything in plain 
text) I marked some important questions in bold.




It was received (and it's visible in the archives). It's right before
easter, so I guess some people may be already on a vaction.

As for the issue - I think the current hypothesis is that the data
distribution is skewed in some strange way, triggering some unexpected
behavior in hash join. That seems plausible, but it's really hard to
investigate without knowing anything about the data distribution :-(

It would be possible to do at least one of these two things:

(a) export pg_stats info about distribution of the join keys

The number of tables involved in the query is not that high, and this
would allo us to generate a data set approximating your data. The one
thing this can't do is showing how it's affected by WHERE conditions.

(b) export data for join keys

This is similar to (a), but it would allow filtering data by the WHERE
conditions first. The amount of data would be higher, although we only
need data from the columns used as join keys.

Of course, if those key values contain sensitive data, it may not be
possible, but perhaps you could hash it in some way.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Out of Memory errors are frustrating as heck!

2019-04-17 Thread Gunther
Hi guys. I don't want to be pushy, but I found it strange that after so 
much lively back and forth getting to the bottom of this, suddenly my 
last nights follow-up remained completely without reply. I wonder if it 
even got received. For those who read their emails with modern readers 
(I know I too am from a time where I wrote everything in plain text) I 
marked some important questions in bold.


On 4/16/2019 11:30, Tom Lane wrote:

Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
718 {
(gdb) bt 8
#0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
#1  0x0084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x00702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at 
buffile.c:119
#3  0x00702e4c in makeBufFile (firstfile=68225) at buffile.c:138
#4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
#5  0x0061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, 
hashvalue=, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  0x0060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, 
slot=, hashvalue=)
  at nodeHash.c:1663
#7  0x00610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at 
nodeHashjoin.c:1051

Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the
"slow growth" behavior you're seeing by periodic decisions to
increase the number of batches.

You might try watching calls to ExecHashIncreaseNumBatches
and see if that theory holds water.


OK, checking that ... well yes, this breaks quickly into that, here is 
one backtrace


#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at 
nodeHash.c:893
#1  0x0060d84a in ExecHashTableInsert 
(hashtable=hashtable@entry=0x2ae8ca8, slot=slot@entry=0x2ae0238,
hashvalue=) at nodeHash.c:1655
#2  0x0060fd9c in MultiExecPrivateHash (node=) at 
nodeHash.c:186
#3  MultiExecHash (node=node@entry=0x2ac6dc8) at nodeHash.c:114
#4  0x005fe42f in MultiExecProcNode (node=node@entry=0x2ac6dc8) at 
execProcnode.c:501
#5  0x0061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) 
at nodeHashjoin.c:290
#6  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#7  0x005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at 
execProcnode.c:461
#8  0x0061ce6e in ExecProcNode (node=0x2a1dd40) at 
../../../src/include/executor/executor.h:247
#9  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#10 0x005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at 
execProcnode.c:461
#11 0x0061d2e4 in ExecProcNode (node=0x2a1dc30) at 
../../../src/include/executor/executor.h:247
#12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#13 0x005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at 
execProcnode.c:461
#14 0x005f75da in ExecProcNode (node=0x2a1d9b0) at 
../../../src/include/executor/executor.h:247
#15 ExecutePlan (execute_once=, dest=0xcc60e0 , 
direction=, numberTuples=0,
sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x2a1d9b0, estate=0x2a1d6c0)
at execMain.c:1723
#16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=, count=0, 
execute_once=) at execMain.c:364
#17 0x0059c718 in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, 
es=es@entry=0x28f1048,
queryString=, params=0x0, queryEnv=queryEnv@entry=0x0, 
planduration=0x7ffcbf930080) at explain.c:535

But this is still in the warm-up phase, we don't know if it is at the 
place where memory grows too much.



This could only happen with a very unfriendly distribution of the
hash keys, I think.  There's a heuristic in there to shut off
growth of nbatch if we observe that we're making no progress at
all, but perhaps this is a skewed distribution that's not quite
skewed enough to trigger that.


Your hunch is pretty right on. There is something very weirdly 
distributed in this particular join situation.


Let's see if I can count the occurrences ... I do cont 100. Now resident 
memory slowly grows, but not too much just 122 kB and CPU is at 88%. I 
think we haven't hit the problematic part of the plan. There is a sort 
merge at some leaf, which I believe is innocent. My gut feeling from 
looking at CPU% high that we are in one of those since NL is disabled.


Next stage is that memory shot up to 264 kB and CPU% down to 8.6.  Heavy 
IO (write and read).


Yes! And now entering the 3rd stage, where memory shots up to 600 kB. 
This is where it starts "breaking out". And only now the 100 breakpoint 
conts are used up. And within a second another 100. And even 1000 go by 
in a second. cont 1 goes by in 4 seconds. And during that time 
resident memory increased to over 700 kB. Let's measure:


736096 + cont 1 --> 740056, that is 3960 bytes for 1 conts, or 

Re: Out of Memory errors are frustrating as heck!

2019-04-17 Thread Gavin Flower

On 17/04/2019 18:01, Jean-David Beyer wrote:

On 4/16/19 6:39 PM, Gavin Flower wrote:

I suspect that most things will run a little better with some swap space.

Not always.

$ free
   total   used  free  shared   buffers cached
Mem:   16254616   13120960   3133656   20820646676   10765380
-/+ buffers/cache: 1708904  14545712
Swap:   4095996  17436   4078560

Unclear what is the point you're trying to make, and the stats you quote 
don't enlighten me.






Re: Out of Memory errors are frustrating as heck!

2019-04-17 Thread Jean-David Beyer
On 4/16/19 6:39 PM, Gavin Flower wrote:
> I suspect that most things will run a little better with some swap space.

Not always.

$ free
  total   used  free  shared   buffers cached
Mem:   16254616   13120960   3133656   20820646676   10765380
-/+ buffers/cache: 1708904  14545712
Swap:   4095996  17436   4078560

-- 
  .~.  Jean-David Beyer
  /V\  PGP-Key:166D840A 0C610C8B
 /( )\ Shrewsbury, New Jersey
 ^^-^^ 01:50:01 up 5 days, 56 min, 2 users, load average: 4.51, 4.59, 4.90




Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Justin Pryzby
I wonder if it'd be useful to compile with 
./configure CFLAGS=-DHJDEBUG=1




Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Gunther

On 4/16/2019 11:30, Tom Lane wrote:

Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
718 {
(gdb) bt 8
#0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
#1  0x0084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x00702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at 
buffile.c:119
#3  0x00702e4c in makeBufFile (firstfile=68225) at buffile.c:138
#4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
#5  0x0061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, 
hashvalue=, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  0x0060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, 
slot=, hashvalue=)
  at nodeHash.c:1663
#7  0x00610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at 
nodeHashjoin.c:1051

Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the
"slow growth" behavior you're seeing by periodic decisions to
increase the number of batches.

You might try watching calls to ExecHashIncreaseNumBatches
and see if that theory holds water.


OK, checking that ... well yes, this breaks quickly into that, here is 
one backtrace




This could only happen with a very unfriendly distribution of the
hash keys, I think.  There's a heuristic in there to shut off
growth of nbatch if we observe that we're making no progress at
all, but perhaps this is a skewed distribution that's not quite
skewed enough to trigger that.


Your hunch is pretty right on. There is something very weirdly 
distributed in this particular join situation.


#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at 
nodeHash.c:893
#1  0x0060d84a in ExecHashTableInsert 
(hashtable=hashtable@entry=0x2ae8ca8, slot=slot@entry=0x2ae0238,
hashvalue=) at nodeHash.c:1655
#2  0x0060fd9c in MultiExecPrivateHash (node=) at 
nodeHash.c:186
#3  MultiExecHash (node=node@entry=0x2ac6dc8) at nodeHash.c:114
#4  0x005fe42f in MultiExecProcNode (node=node@entry=0x2ac6dc8) at 
execProcnode.c:501
#5  0x0061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) 
at nodeHashjoin.c:290
#6  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#7  0x005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at 
execProcnode.c:461
#8  0x0061ce6e in ExecProcNode (node=0x2a1dd40) at 
../../../src/include/executor/executor.h:247
#9  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#10 0x005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at 
execProcnode.c:461
#11 0x0061d2e4 in ExecProcNode (node=0x2a1dc30) at 
../../../src/include/executor/executor.h:247
#12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#13 0x005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at 
execProcnode.c:461
#14 0x005f75da in ExecProcNode (node=0x2a1d9b0) at 
../../../src/include/executor/executor.h:247
#15 ExecutePlan (execute_once=, dest=0xcc60e0 , 
direction=, numberTuples=0,
sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x2a1d9b0, estate=0x2a1d6c0)
at execMain.c:1723
#16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=, count=0, 
execute_once=) at execMain.c:364
#17 0x0059c718 in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, 
es=es@entry=0x28f1048,
queryString=, params=0x0, queryEnv=queryEnv@entry=0x0, 
planduration=0x7ffcbf930080) at explain.c:535

But this is still in the warm-up phase, we don't know if it is at the 
place where memory grows too much.


Let's see if I can count the occurrences ... I do cont 100. Now resident 
memory slowly grows, but not too much just 122 kB and CPU is at 88%. I 
think we haven't hit the problematic part of the plan.  There is a sort 
merge at some leaf, which I believe is innocent. My gut feeling from 
looking at CPU% high that we are in one of those since NL is disabled.


Next stage is that memory shot up to 264 kB and CPU% down to 8.6.  Heavy 
IO (write and read).


Yes! And now entering the 3rd stage, where memory shots up to 600 kB. 
This is where it starts "breaking out". And only now the 100 breakpoint 
conts are used up. And within a second another 100. And even 1000 go by 
in a second. cont 1 goes by in 4 seconds. And during that time 
resident memory increased to over 700 kB. Let's measure:


736096 + cont 1 --> 740056, that is 3960 bytes for 1 conts, or 
0.396 bytes per cont. Prediction: cont 1 will now arrive at 744016? 
Aaaand ... BINGO! 744016 exactly! cont 5 will take about 20 seconds 
and will boost memory to 763816. Bets are on ... drumroll ... 35, 36 , 
... nope. This time didn't pan out. Breakpoint already hit 75727 times 
ignore next 5585 hits ... memory now 984052. So it took longer this time 
and memory increment was larger. We 

Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Gavin Flower

On 15/04/2019 08:23, Gunther wrote:


For weeks now, I am banging my head at an "out of memory" situation. 
There is only one query I am running on an 8 GB system, whatever I 
try, I get knocked out on this out of memory. It is extremely 
impenetrable to understand and fix this error. I guess I could add a 
swap file, and then I would have to take the penalty of swapping. But 
how can I actually address an out of memory condition if the system 
doesn't tell me where it is happening?



[...]

I strongly suigest having a swap file, I've got 32GB, and I've used 
2.9GB of my swap space after 4 days, but I'm not really pushing my 
system.  For me, mostly stuff that is only used once, or not at all, is 
swapped out.  If you do have a memory leak, then it might be easier to 
diagnose, if you don't run out on Memory.


I suspect that most things will run a little better with some swap space.


Cherers,
Gavin







Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Tom Lane
Jeff Janes  writes:
> On Mon, Apr 15, 2019 at 9:49 PM Gunther  wrote:
>> Isn't there some other way?

> I wonder of valgrind or something like that could be of use.  I don't know
> enough about those tools to know.  One problem is that this is not really a
> leak.  If the query completely successfully, it would have freed the
> memory.  And when the query completed with an error, it also freed the
> memory.  So it just an inefficiency, not a true leak, and leak-detection
> tools might not work.  But as I said, I have not studied them.

valgrind is a useful idea, given that Gunther is building his own
postgres (so he could compile it with -DUSE_VALGRIND + --enable-cassert,
which are needed to get valgrind to understand palloc allocations).
I don't recall details right now, but it is possible to trigger
a valgrind report intra-session similar to what you get by default
at process exit.  You could wait till the memory has bloated a
good deal and then ask for one of those reports that classify
allocations by call chain (I think you want the memcheck tool for
this, not the default valgrind tool).

However --- at least for the case involving hash joins, I think we
have a decent fix on the problem location already: it seems to be a
matter of continually deciding to increase nbatch, and now what we
need to investigate is why that's happening.

If there's a leak that shows up without any hash joins in the plan,
then that's a separate matter for investigation.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Jeff Janes
On Mon, Apr 15, 2019 at 9:49 PM Gunther  wrote:

> Jeff Janes had more
>
> Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
>> 715 {
>> (gdb) p context->name
>> $8 = 0x96ce5b "ExecutorState"
>>
>>
> I think that the above one might have been the one you wanted.
>
> Not sure how you could tell that? It's the same place as everything else.
> If we can find out what you're looking for, may be we can set a break point
> earlier up the call chain?
>

It is just a hunch.  That size is similar to one you reported for the
last-straw allocation on the case with the nested loops rather than hash
joins. So it is reasonable (but surely not guaranteed) that they are coming
from the same place in the code.  And since that other one occurred as the
last straw, then that one must have not be part of the start up
allocations, but rather the part where we should be at steady state, but
are not.  So maybe this one is too.


>
> I guess I should run this for a little longer. So I disable my breakpoints
>>
>>
> it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing
>> fast, so now back to gdb and break:
>>
> Unfortunately, I think this means you missed your opportunity and are now
> getting backtraces of the innocent bystanders.
>
> But why? If I see the memory still go up insanely fast, isn't that a sign
> for the leak?
>

You said it was no longer going up insanely fast by the time you got your
breakpoints re-activated.

> Particularly since you report that the version using nested loops rather
> than hash joins also leaked, so it is probably not the hash-join specific
> code that is doing it.
>
> How about it's in the DISTINCT? I noticed while peeking up the call chain,
> that it was already in the UNIQUE sort thing also.  I guess it's streaming
> the results from the hash join right into the unique sort step.
>

Isn't crashing before any rows are emitted from the hash join?  I thought
it was, but I am not very confident on that.


> What I've done before is compile with the comments removed from
> src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */
>
> I have just done that and it creates an insane amount of output from all
> the processes, I'm afraid there will be no way to keep that stuff
> separated. If there was a way of turning that one and off for one process
> only, then we could probably get more info...
>

Are you doing all this stuff on a production server in use by other people?

> Everything is also extremely slow that way. Like in a half hour the memory
> didn't even reach 100 MB.
>
> and then look for allocations sizes which are getting allocated but not
> freed, and then you can go back to gdb to look for allocations of those
> specific sizes.
>
> I guess I should look for both, address and size to match it better.
>

You can analyze the log for specific addresses which are allocated but not
freed, but once you find them you can't do much with them.  Those specific
addresses probably won't repeat on the next run, so, you so can't do
anything with the knowledge.  If you find a request size that is
systematically analyzed but not freed, you can condition logging (or gdb)
on that size.


> This generates a massive amount of output, and it bypasses the logging
> configuration and goes directly to stderr--so it might not end up where you
> expect.
>
> Yes, massive, like I said. Impossible to use. File system fills up
> rapidly. I made it so that it can be turned on and off, with the debugger.
>
> int _alloc_info = 0;
> #ifdef HAVE_ALLOCINFO
> #define AllocFreeInfo(_cxt, _chunk) \
> if(_alloc_info) \
> fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
> (_cxt)->header.name, (_chunk), (_chunk)->size)
> #define AllocAllocInfo(_cxt, _chunk) \
> if(_alloc_info) \
> fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
> (_cxt)->header.name, (_chunk), (_chunk)->size)
> #else
> #define AllocFreeInfo(_cxt, _chunk)
> #define AllocAllocInfo(_cxt, _chunk)
> #endif
>
> so with this I do
>
> (gdb) b AllocSetAlloc
> (gdb) cont
> (gdb) set _alloc_info=1
> (gdb) disable
> (gdb) cont
>
>
Thanks for this trick, I'll save this so I can refer back to it if I need
to do this again some time.


> then I wait, ... until it crashes again ... no, it's too much. It fills up
> my filesystem in no time with the logs.  It produced 3 GB in just a minute
> of run time.
>

You don't need to run it until it crashes, only until the preliminaries are
done and the leak has started to happen.  I would think a minute would be
more than enough, unless the leak doesn't start until some other join runs
to completion or something.  You can turn on the logging only after
ExecutorStat has obtained a size which is larger than we think it has any
reason to be.  If you still have log around, you can still analyze it.


> And also, I doubt we can find anything specifically by allocation size.
> It's just going to be 512 or whatever.
>

For 

Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Tom Lane
Gunther  writes:
> And there we go:

> Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
> 718 {
> (gdb) bt 8
> #0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
> #1  0x0084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
> #2  0x00702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at 
> buffile.c:119
> #3  0x00702e4c in makeBufFile (firstfile=68225) at buffile.c:138
> #4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
> #5  0x0061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, 
> hashvalue=, fileptr=0x6305b00) at nodeHashjoin.c:1220
> #6  0x0060d766 in ExecHashTableInsert 
> (hashtable=hashtable@entry=0x2b50ad8, slot=, 
> hashvalue=)
>  at nodeHash.c:1663
> #7  0x00610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at 
> nodeHashjoin.c:1051

Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the
"slow growth" behavior you're seeing by periodic decisions to
increase the number of batches.

You might try watching calls to ExecHashIncreaseNumBatches
and see if that theory holds water.

This could only happen with a very unfriendly distribution of the
hash keys, I think.  There's a heuristic in there to shut off
growth of nbatch if we observe that we're making no progress at
all, but perhaps this is a skewed distribution that's not quite
skewed enough to trigger that.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Gunther
It is confirmed, these two call paths are the only ones. At least 
probably the only ones to occur with enough of a frequency.


  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
28576 postgres  20   0 2695304   1.0g 200764 R  11.3 13.8   4:20.13 postgres: 
postgres integrator [local] EXPLAIN
28580 postgres  20   0  646616 432784  36968 S  98.7  5.5   8:53.28 gdb -p 28576

there is a problem with gdb, it also has a memoy leak and is very 
expensive with the checking of my conditional breakpoint. So I can't run 
it all the way through.


Also here captured with

(gdb) call MemoryContextStats(TopPortalContext)

TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
  PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
  PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 
used:
ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 
1369088328 used
  HashTableContext: 32768 total in 3 blocks; 17304 free (10 chunks); 15464 
used
HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used
HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used
HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 
used
  HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 
used
  TupleSort main: 32824 total in 2 blocks; 144 free (0 chunks); 32680 used
Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 
8389408 used
HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 
used
  TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 
used
Caller tuples: 4194304 total in 10 blocks; 1496136 free (20 chunks); 
2698168 used  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 
256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ...
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
Grand total: 1384601904 bytes in 163660 blocks; 2303840 free (145 chunks); 
1382298064 used

there is the developing memory leak. Now let's see if we can trace 
individual increments ...


(gdb) info break
Num Type   Disp Enb AddressWhat
1   breakpoint keep y   0x00849030 in AllocSetAlloc at 
aset.c:718
stop only if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp 
!= 0x84e7dd && 0x84e8ad != *(int *)$rsp
breakpoint already hit 4 times
(gdb) delete 1
(gdb) break AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 
&& *(int *)$rsp != 0x84e7dd
Breakpoint 2 at 0x849030: file aset.c, line 718.
(gdb) cont
Continuing.
^CError in testing breakpoint condition:
Quit

Breakpoint 2, AllocSetAlloc (context=0x2a1d190, size=381) at aset.c:718
718 {
(gdb) bt 4
#0  AllocSetAlloc (context=0x2a1d190, size=381) at aset.c:718
#1  0x0084e7dd in palloc (size=381) at mcxt.c:938
#2  0x006101bc in ExecHashJoinGetSavedTuple (file=file@entry=0x4b4a198, 
hashvalue=hashvalue@entry=0x7ffcbf92fe5c,
tupleSlot=0x2ae0ab8, hjstate=0x2a1d920) at nodeHashjoin.c:1277
#3  0x00610ca3 in ExecHashJoinNewBatch (hjstate=0x2a1d920) at 
nodeHashjoin.c:1042
(More stack frames follow...)
(gdb)  call MemoryContextStats(TopPortalContext)

doesn't show an increase of ExecutorState total:

TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
  PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
  PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 
used:
ExecutorState: 1369337168 total in 

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Gunther
I saw your replies, if there was a way of using gdb commands to have a 
conditional breakpoint which will only fire if the n-th caller in the 
chain is not a certain source location, then one could exclude the bulk 
of these allocations and focus better.


But I decided I try to re-factor this query. And I made an interesting 
observation.


There is a left outer join in parenthesis

... LEFT OUTER JOIN (SELECT ) q ...

the biggest parenthesis. I turned this into a temporary table, tmp_bulk. 
Then I change the main query to


... LEFT OUTER JOIN tmp_bulk q ...

now I am running it again. But what I noticed is that the tmp_bulk table 
is tiny! It only has like 250 rows. So this means the vast majority of 
the right left rows in that join are unmatched. The plan is all 
different now. Heavy CPU% load. Must be merge sorting? No memory growth, 
not yet.


  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 5394 postgres  20   0 1284448 287880 271528 R  99.3  3.6   9:21.83 postgres: 
postgres integrator [local] EXPLAIN
 5425 postgres  20   0 1278556  93184  82296 S  27.6  1.2   0:38.72 postgres: 
parallel worker for PID 5394

No, I never trust when a database job has high CPU% and low IO for a 
long time. So I do


SET ENABLE_MERGEJOIN TO OFF;

and then do it again.  Now I have high IO and low CPU%.

  PID USER  PR  NI    VIRT    RES    SHR S  %CPU %MEM TIME+ COMMAND
 5394 postgres  20   0 1280904 282036 273616 D   2.3  3.6  13:01.51 postgres: 
postgres integrator [local] EXPLAIN
 5510 postgres  20   0 1278892  87524  80804 D   2.3  1.1   0:05.20 postgres: 
parallel worker for PID 5394
 5509 postgres  20   0 1278860  87484  80796 D   2.3  1.1   0:05.30 postgres: 
parallel worker for PID 5394

Still I slip into the high CPU% situation, I guess I'll have to wait it 
out ...


... and still waiting. No growth whatsoever. The plan is now so totally 
different that it probably won't trigger the problem.


The original plan that causes the leak involved right joins. This one 
only left joins. Even after ANALYZE tmp_bulk it still comes up with the 
same plan. And that plan isn't quick to succeed but also doesn't trigger 
the memory leak.


So what I can tell is this: that growth to 1.5 GB is consistently 
happening. It isn't just happening in the beginning and then the rest is 
just a follow-up problem. Also there seems to be a final spike in growth 
from 1.5 GB to 2.2 GB that happens inside a second. That seems very 
strange.


Back to the debugger and do a better job of conditional breakpoints ... 
I already have an idea how I'll do that. I set a flag when I enter the



Anyway, the upshot is that you need to investigate what's happening
while the memory consumption is increasing.  The behavior before
that starts to happen isn't going to be very interesting.  It might
be a bit tricky to catch that if it only takes a few seconds to blow
up, but you could try "c 1" or so to step through a lot of
AllocSetAlloc calls, repeating till the bad stuff starts to happen,
and then going back to looking at just where the calls are coming
from.

Isn't 1.5 GB already way too big? There are 3 phases really.

1. steady state at less than 500 M
2. slow massive growth to 1 G to 1.5 - 1.8 G
3. explosion within 1 second from whatever the final size of slow
   massive growth to the final 2.2 G

I thought that slow massive growth is already a sign of a leak?

I will now filter the calls that come through ExecHashJoinGetSavedTuple

I figured I can do this:

(gdb) info frame
Stack level 0, frame at 0x7ffcbf92fdd0:
 rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e7dd
 called by frame at 0x7ffcbf92fdf0
 source language c.
 Arglist at 0x7ffcbf92fdc0, args: context=0x29a6450, size=371
 Locals at 0x7ffcbf92fdc0, Previous frame's sp is 0x7ffcbf92fdd0
 Saved registers:
  rip at 0x7ffcbf92fdc8

so is the saved $rip is 0x84e7dd then we are coming this way. Therefore 
I set my new breakpoint like this:


(gdb) b AllocSetAlloc if  (int)strcmp(context->name, "ExecutorState") == 0 && 
*(int *)$rsp != 0x84e7dd
Breakpoint 6 at 0x849030: file aset.c, line 718.
(gdb) info b
Num Type   Disp Enb Address    What
6   breakpoint keep y   0x00849030 in AllocSetAlloc at 
aset.c:718
    stop only if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int 
*)$rsp != 0x84e7dd

And there we go:

Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
718 {
(gdb) bt 8
#0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
#1  0x0084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x00702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at 
buffile.c:119
#3  0x00702e4c in makeBufFile (firstfile=68225) at buffile.c:138
#4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
#5  0x0061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, 
hashvalue=, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tom Lane
Gunther  writes:
> So what I am wondering now, is there seems to be an EXPLOSION of memory 
> consumption near the time of the crash. That ExecutorState has 
> 2,238,648,944 but just until the very last second(s) the RES memory as 
> per top was 1.5 GB I swear.

That's not hugely surprising really, especially in a complex query.
It could be performing some preliminary join that doesn't leak, and
then when it starts to perform the join that does have the leak,
kaboom.  Also, given that you seem to be invoking multi-batch joins,
maybe the preliminary phase is fine and there's only a leak when
reading back a batch.

Anyway, the upshot is that you need to investigate what's happening
while the memory consumption is increasing.  The behavior before
that starts to happen isn't going to be very interesting.  It might
be a bit tricky to catch that if it only takes a few seconds to blow
up, but you could try "c 1" or so to step through a lot of
AllocSetAlloc calls, repeating till the bad stuff starts to happen,
and then going back to looking at just where the calls are coming
from.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tom Lane
Gunther  writes:
> Is there a way of dumping that memory map info during normal runtime, by 
> calling a function with the debugger?

Sure, "call MemoryContextStats(TopMemoryContext)"

(or actually, since you know which context is the problematic one,
just print that one context)

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Gunther

On 4/15/2019 21:49, Gunther wrote:


I'm going to try without that DISTINCT step, or perhaps by dismantling 
this query until it works without this excessive memory growth.


It also failed. Out of memory. The resident memory size of the backend 
was 1.5 GB before it crashed.


TopMemoryContext: 4335600 total in 8 blocks; 41208 free (16 chunks); 
4294392 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free 
(0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free 
(0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 
2624 free (0 chunks); 21728 used Operator lookup cache: 24576 total in 2 
blocks; 10760 free (3 chunks); 13816 used pgstat TabStatusArray lookup 
hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used 
TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 
2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 
chunks); 1296 used MessageContext: 524288 total in 7 blocks; 186848 free 
(7 chunks); 337440 used Operator class cache: 8192 total in 1 blocks; 
560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 
blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 
total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 
total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 
total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 
24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 
1105920 total in 138 blocks; 10368 free (8 chunks); 1095552 used: 
ExecutorState: 2238648944 total in 266772 blocks; 3726944 free (16276 
chunks); 2234922000 used HashTableContext: 16384 total in 2 blocks; 4032 
free (5 chunks); 12352 used HashBatchContext: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 
7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 
7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 
7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 
7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 
blocks; 16072 free (6 chunks); 74216 used HashTableContext: 8192 total 
in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 
total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 
286912 total in 8 blocks; 246792 free (39 chunks); 40120 used TupleSort 
main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used 
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used 
HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 
8389408 used HashBatchContext: 66935744 total in 2037 blocks; 7936 free 
(0 chunks); 66927808 used ExprContext: 8192 total in 1 blocks; 7936 free 
(0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 
chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 
chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 
383480 free (0 chunks); 717848 used index info: 2048 total in 2 blocks; 
680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 
total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index 
info: 2048 total in 2 blocks; 696 free (1 chunks); 

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Alvaro Herrera
On 2019-Apr-15, Tom Lane wrote:

> It's barely conceivable that in your particular query, there's something
> acting to break that which doesn't manifest typically; but I think it's
> much more likely that you simply haven't found the culprit allocation.
> It's quite feasible that many many ExecHashJoinGetSavedTuple calls would
> go by in between problem allocations.

A possibly useful thing to do is use "commands" in gdb to print out a
stack trace for each allocation that touches the problem memory context
and collect them into some output, then classify the allocations based
on the stack trace on each.  No need to do it manually.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Gunther

Wow, we are getting somewhere.

Tom (BTW, your mail server rejects my direct mail, but I'm glad you got 
it through the list), you say:



I'm pretty sure that's not the droid we're looking for.
ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately
sticks it into a TupleTableSlot that will be responsible for freeing
it (when the next tuple is stuck into the same slot).  I'd suggest
continuing a few times and looking for other code paths leading
to AllocSetAlloc in this context.


I did continue a "few times", but few as in a dozen, it's always the same

(gdb) bt 6
#0  AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
#1  0x0084e6cd in palloc (size=375) at mcxt.c:938
#2  0x0061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, 
hashvalue=hashvalue@entry=0x7fff2e4ca76c,
tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3  0x00610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at 
nodeHashjoin.c:1042
#4  ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5  ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
(More stack frames follow...)

So I decided to just let it go until it exits the ExecHashJoin function:

#6  0x005fde68 in ExecProcNodeInstr (node=0x11688e0) at 
execProcnode.c:461
461 result = node->ExecProcNodeReal(node);
(gdb) list
456 {
457 TupleTableSlot *result;
458
459 InstrStartNode(node->instrument);
460
461 result = node->ExecProcNodeReal(node);
462
463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
464
465 return result;
(gdb) break 463
Breakpoint 3 at 0x5fde68: file execProcnode.c, line 463.
(gdb) disable 2
(gdb) cont
Continuing.

Breakpoint 3, ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:463
463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);

oops, that was fast, so up further ...

(gdb) cont
Continuing.

Breakpoint 4, ExecSort (pstate=0x11687d0) at nodeSort.c:109
109 if (TupIsNull(slot))
(gdb) cont
Continuing.

Breakpoint 3, ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:463
463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
(gdb) cont
Continuing.

Breakpoint 4, ExecSort (pstate=0x11687d0) at nodeSort.c:109
109 if (TupIsNull(slot))
(gdb) up
#1  0x005fde68 in ExecProcNodeInstr (node=0x11687d0) at 
execProcnode.c:461
461 result = node->ExecProcNodeReal(node);
(gdb) up
#2  0x0061d2c4 in ExecProcNode (node=0x11687d0) at 
../../../src/include/executor/executor.h:247
247 return node->ExecProcNode(node);
(gdb) up
#3  ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
73  slot = ExecProcNode(outerPlan);
(gdb) list
68  for (;;)
69  {
70  /*
71   * fetch a tuple from the outer subplan
72   */
73  slot = ExecProcNode(outerPlan);
74  if (TupIsNull(slot))
75  {
76  /* end of subplan, so we're done */
77  ExecClearTuple(resultTupleSlot);

... but whatever I do, ultimately I get to that allocation routine 
through the same path.


Since that is the bulk of the activity, and memory was still growing 
while we come through this path, I assume that this is it.



My first thought on noticing the SELECT DISTINCT was that you might be
hitting the grouping-function-related leak that Andres fixed in 9cf37a527;
but that fix did make it into 11.2 (by just a couple of days...).  Still,
maybe there's another issue in the same area.


I don't know about that one, I only know that I am running 11.2 freshly 
compiled.


The change suggested by Alvaro Herrera wasn't applicable.

Jeff Janes had more


Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715 {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"


I think that the above one might have been the one you wanted.
Not sure how you could tell that? It's the same place as everything 
else. If we can find out what you're looking for, may be we can set a 
break point earlier up the call chain?


I guess I should run this for a little longer. So I disable my
breakpoints

it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped
growing fast, so now back to gdb and break:

Unfortunately, I think this means you missed your opportunity and are 
now getting backtraces of the innocent bystanders.
But why? If I see the memory still go up insanely fast, isn't that a 
sign for the leak?
Particularly since you report that the version using nested loops 
rather than hash joins also leaked, so it is probably not the 
hash-join specific code that is doing it.
How about it's in the DISTINCT? I noticed while peeking up the call 
chain, 

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes
On Mon, Apr 15, 2019 at 12:34 PM Gunther  wrote:

> Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
> 715 {
> (gdb) p context->name
> $8 = 0x96ce5b "ExecutorState"
>
>
I think that the above one might have been the one you wanted.


> I guess I should run this for a little longer. So I disable my breakpoints
>
>
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing
> fast, so now back to gdb and break:
>
Unfortunately, I think this means you missed your opportunity and are now
getting backtraces of the innocent bystanders.

Particularly since you report that the version using nested loops rather
than hash joins also leaked, so it is probably not the hash-join specific
code that is doing it.

What I've done before is compile with the comments removed from
src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */

and then look for allocations sizes which are getting allocated but not
freed, and then you can go back to gdb to look for allocations of those
specific sizes.  This generates a massive amount of output, and it bypasses
the logging configuration and goes directly to stderr--so it might not end
up where you expect.


Thanks for the view definition.  Nothing in it stood out to me as risky.

Cheers,

Jeff


Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tom Lane
Alvaro Herrera  writes:
> Seems that ExecHashJoinGetSavedTuple stores a minimalTuple and sets the
> shouldFree flag to "true", and then in ExecHashJoinNewBatch, callee
> ExecFetchSlotMinimalTuple sets shouldFree to false inconditionally when
> the slot uses minimal tuple ops.  Maybe that's correct, but it does
> sound like a memory leak is not entirely impossible.  I wonder if this
> fixes it, without causing crashes elsewhere.

This discussion is about v11, not HEAD.  Still, I agree that that
coding in HEAD seems a bit fishy.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Alvaro Herrera
On 2019-Apr-15, Gunther wrote:

> #0  AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
> #1  0x0084e6cd in palloc (size=385) at mcxt.c:938
> #2  0x0061019c in ExecHashJoinGetSavedTuple 
> (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c,
> tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
> #3  0x00610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at 
> nodeHashjoin.c:1042

Seems that ExecHashJoinGetSavedTuple stores a minimalTuple and sets the
shouldFree flag to "true", and then in ExecHashJoinNewBatch, callee
ExecFetchSlotMinimalTuple sets shouldFree to false inconditionally when
the slot uses minimal tuple ops.  Maybe that's correct, but it does
sound like a memory leak is not entirely impossible.  I wonder if this
fixes it, without causing crashes elsewhere.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 546db02cad0..d47a1b48eec 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1648,7 +1648,7 @@ ExecFetchSlotMinimalTuple(TupleTableSlot *slot,
 	if (slot->tts_ops->get_minimal_tuple)
 	{
 		if (shouldFree)
-			*shouldFree = false;
+			*shouldFree = TTS_SHOULDFREE(slot);
 		return slot->tts_ops->get_minimal_tuple(slot);
 	}
 	else


Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tom Lane
Gunther  writes:
> Now I give you a bt so we have something to look at:

> #0  AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
> #1  0x0084e6cd in palloc (size=385) at mcxt.c:938
> #2  0x0061019c in ExecHashJoinGetSavedTuple 
> (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c,
>  tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277

I'm pretty sure that's not the droid we're looking for.
ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately
sticks it into a TupleTableSlot that will be responsible for freeing
it (when the next tuple is stuck into the same slot).  I'd suggest
continuing a few times and looking for other code paths leading
to AllocSetAlloc in this context.

My first thought on noticing the SELECT DISTINCT was that you might be
hitting the grouping-function-related leak that Andres fixed in 9cf37a527;
but that fix did make it into 11.2 (by just a couple of days...).  Still,
maybe there's another issue in the same area.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Gunther

OK Guys, you are very kind to continue taking an interest in this matter.

I will try what I can to help squish the bug.

Tomas Vondra just added a good idea that explains why I get the out of 
memory with still having so much cache available:


# sysctl vm.overcommit_memory
vm.overcommit_memory = 2
# sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50

as he predicted.

# cat /proc/meminfo |grep Commit
CommitLimit: 3955192 kB
Committed_AS:    2937352 kB

So I thing that explains why it turns into an out of memory error. We 
don't worry or wonder about that any more. I will change that parameter 
in the future to allow for some spikes. But it's not going to resolve 
the underlying memory leak issue.


Now I run explain analyze SELECT ... without the INSERT.

integrator=#  \set VERBOSITY verbose
integrator=#
integrator=# \pset pager off
Pager usage is off.
integrator=# \pset format unaligned
Output format is unaligned.
integrator=# \set VERBOSITY verbose
integrator=#
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# explain analyze SELECT * FROM reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798

And since that failed already, I guess we don't need to worry about the 
temporary table insert.


About adding LIMIT, I don't think it makes sense in the outer query, 
since the error is probably happening earlier. I did put a LIMIT 100 on 
one of the tables we join to, and it helped. But that doesn't really 
tell us anything I think.


Then yes, I can try the backtrace with the NLs enabled. It will just 
take a long long time and unfortunately it is extremely likely that I 
lose the console and then will be unable to get back to it. OK, 
screen(1) resolves that problem too. Will do, after I reported the above.


But now you have already produced more ideas ...

  Maybe it is memory for trigger or constraint checking, although I 
don't
  know why that would appear instantly.  What triggers or constraints 
do you
  have on businessoperation? 


Yeah, that would be my guess too. If I had to guess, something likely 
gets

confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).

Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though 


I think triggers and constraints is ruled out, because the problem 
happens without the INSERT.


That leaves us with expression evaluation. And OK, now you really wanna 
see the query, although it should be in the plan too. But for what it is 
worth:


SELECT DISTINCT
documentInternalId, is_current,
documentId,
documentTypeCode,
subjectRoleInternalId,
subjectEntityInternalId,
subjectEntityId,
subjectEntityIdRoot,
subjectEntityName,
subjectEntityTel,
subjectEntityEmail,
otherEntityInternalId,
confidentialityCode,
actInternalId,
code_code as operationCode,
code_displayName AS operationName,
operationQualifierCode,
operationQualifierName,
approvalNumber,
approvalNumberSystem,
approvalStateCode,
approvalStateCodeSystem,
approvalEffectiveTimeLow,
approvalEffectiveTimeHigh,
approvalStatusCode,
licenseCode,
agencyId,
agencyName,
productItemCode,
productInternalId
  FROM reports.DocumentInformationSubject
  LEFT OUTER JOIN (SELECT documentInternalId, documentId, actInternalId,
  subjectEntityCode as productItemCode,
  subjectEntityInternalId as productInternalId
 FROM reports.DocumentInformationSubject
WHERE participationTypeCode = 'PRD') prd
USING(documentInternalId, documentId, actInternalId)
  LEFT OUTER JOIN (
SELECT documentInternalId,
   q.code_code AS operationQualifierCode,
   q.code_displayName AS operationQualifierName,
   r.targetInternalId AS actInternalId,
   actInternalId AS approvalInternalId,
   an.extension AS approvalNumber,
   an.root AS approvalNumberSystem,
   qs.subjectEntityCode AS approvalStateCode,
   qs.subjectEntityCodeSystem AS approvalStateCodeSystem,
   qs.effectivetime_low AS approvalEffectiveTimeLow,
   qs.effectivetime_high AS approvalEffectiveTimeHigh,
   qs.statusCode AS approvalStatusCode,
   qs.code_code AS licenseCode,
   agencyId.extension AS agencyId,
   agencyName.trivialName AS agencyName
  FROM reports.DocumentInformation q
  LEFT OUTER JOIN (SELECT * FROM reports.DocumentInformationSubject WHERE 
participationTypeCode = 'AUT') qs
USING(documentInternalId, actInternalId)
  INNER JOIN integrator.ActRelationship r
ON(

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes
On Mon, Apr 15, 2019 at 11:28 AM Tom Lane  wrote:

> Jeff Janes  writes:
> > To get it to happen faster, maybe you could run the server with a small
> > setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
> > Unfortunately I don't know how to set a breakpoint for allocations into a
> > specific context, and setting a breakpoint for any memory allocation is
> > probably going to fire too often to be useful.
>
> If you can use gdb at all, it's not that hard to break on allocations
> into a specific context; I've done it many times.  The strategy is
> basically
>

> 1. Let query run long enough for memory usage to start increasing,
> then attach to backend with gdb.
>
> 2. Set breakpoint at, probably, AllocSetAlloc.  (In some cases,
> reallocs could be the problem, but I doubt it here.)  Then "c".
>
> 3. When it stops, "p *context" and see if this is the context
> you're looking for.  In this case, since we want to know about
> allocations into ExecutorState and we know there's only one
> active one, you just have to look at the context name.  In general
> you might have to look at the backtrace.  Anyway, if it isn't the
> one you want, just "c" until you get to an allocation into the
> one you do want.
>
> 4. Once you have found out the address of the context you care
> about, make the breakpoint conditional on the context argument
> being that one.  It might look like this:
>
> Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
> 715 {
> (gdb) p *context
> $1 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
> false,
>   methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild =
> 0x0,
>   nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0,
>   reset_cbs = 0x0}
> (gdb) cond 1  context == 0x1483be0
>
> 5. Now repeatedly "c", and check the stack trace each time, for a
> dozen or two times to get a feeling for where the allocations are
> being requested.
>
> In some cases you might be able to find the context address in a
> more efficient way than what I suggest in #3 --- for instance,
> you could instead set a breakpoint where the context is created
> and snag its address immediately, or you could dig around in
> backend data structures to find it.  But these ways generally
> require more familiarity with the code than just watching the
> requests go by.
>


Thanks for the recipe.  I can use gdb at all, just not very skillfully :)

With that as a starting point, experimentally, this seems to work to short
circuit the loop described in your step 3 (which I fear could be thousands
of iterations in some situations):

cond 1 strcmp(context.name,"ExecutorState")==0

Also, I've found that in the last few versions of PostgreSQL, processes
might get unreasonable numbers of SIGUSR1 (maybe due to parallelization?)
and so to avoid having to stand on the 'c' button, you might need this:

handle SIGUSR1 noprint nostop

Cheers,

Jeff


Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tom Lane
I wrote:
> If you can use gdb at all, it's not that hard to break on allocations
> into a specific context; I've done it many times.  The strategy is
> basically
> 1. Let query run long enough for memory usage to start increasing,
> then attach to backend with gdb.

BTW, just to clarify that strategy a bit: the usage pattern we expect
for ExecutorState is that there are a bunch of individual allocations
during executor startup, but then none while the query is running
(or at least, if there are any in that phase, they get freed before
moving on to the next row).  The form of the leak, almost certainly,
is that some allocation is happening per-row and *not* getting freed.
So there's no point in groveling through the startup behavior.  What
we want to know about is what happens after we reach the ought-to-be
steady state behavior.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tomas Vondra

On Sun, Apr 14, 2019 at 05:19:50PM -0400, Jeff Janes wrote:

  On Sun, Apr 14, 2019 at 4:51 PM Gunther  wrote:

For weeks now, I am banging my head at an "out of memory" situation.
There is only one query I am running on an 8 GB system, whatever I try,
I get knocked out on this out of memory.

  Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the
  OOM killer?  Do you get a core file you can inspect with gdb?

You might want to see the query, but it is a huge plan, and I can't
really break this down. It shouldn't matter though. But just so you can
get a glimpse here is the plan:

Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197)
  ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
   


  Maybe it is memory for trigger or constraint checking, although I don't
  know why that would appear instantly.  What triggers or constraints do you
  have on businessoperation? 


Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).

Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though :-(


  What if you just run the SELECT without the INSERT?  Or insert into a temp
  table rather than into businessoperation?  And if that doesn't crash, what
  if you then insert to businessoperation from the temp table?
   


Yeah. What's the schema of "businessoperation"? Anything special about
it? Triggers, expression indexes, check constraints, ...

Gunther, you mentioned you build postgres from sources. Would it be
possible to add some sort of extra debugging to see where the memory is
allocated from? It's a bit heavy-handed, though.

Or maybe splitting es_query_ctx into smaller contexts. That might be
easier to evaluate than sifting throuht god-knows-how-many-gbs of log.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Tom Lane
Jeff Janes  writes:
> To get it to happen faster, maybe you could run the server with a small
> setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
> Unfortunately I don't know how to set a breakpoint for allocations into a
> specific context, and setting a breakpoint for any memory allocation is
> probably going to fire too often to be useful.

If you can use gdb at all, it's not that hard to break on allocations
into a specific context; I've done it many times.  The strategy is
basically

1. Let query run long enough for memory usage to start increasing,
then attach to backend with gdb.

2. Set breakpoint at, probably, AllocSetAlloc.  (In some cases,
reallocs could be the problem, but I doubt it here.)  Then "c".

3. When it stops, "p *context" and see if this is the context
you're looking for.  In this case, since we want to know about
allocations into ExecutorState and we know there's only one
active one, you just have to look at the context name.  In general
you might have to look at the backtrace.  Anyway, if it isn't the
one you want, just "c" until you get to an allocation into the
one you do want.

4. Once you have found out the address of the context you care
about, make the breakpoint conditional on the context argument
being that one.  It might look like this:

Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
715 {
(gdb) p *context
$1 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false, 
  methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild = 0x0, 
  nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0, 
  reset_cbs = 0x0}
(gdb) cond 1  context == 0x1483be0

5. Now repeatedly "c", and check the stack trace each time, for a
dozen or two times to get a feeling for where the allocations are
being requested.

In some cases you might be able to find the context address in a
more efficient way than what I suggest in #3 --- for instance,
you could instead set a breakpoint where the context is created
and snag its address immediately, or you could dig around in
backend data structures to find it.  But these ways generally
require more familiarity with the code than just watching the
requests go by.

> Are you not showing the view definition for proprietary reasons, or just
> because you don't think it will be useful?

As far as that goes, I think the most likely theory right now is that
some particular function being used in the view is leaking memory.
So yes, we need to see the view ... or you can try removing bits of
it to see if the leak goes away.

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes
On Sun, Apr 14, 2019 at 11:59 PM Gunther  wrote:


> Is there any doubt that this might be a problem with Linux? Because if
> you want, I can whip out a FreeBSD machine, compile pgsql, and attach
> the same disk, and try it there. I am longing to have a reason to move
> back to FreeBSD anyway. But I have tons of stuff to do, so if you do not
> have reason to suspect Linux to do wrong here, I prefer skipping that
> futile attempt
>

I think the PostgreSQL leaking in the first place would be independent of
Linux being ungraceful about it.  So repeating it on BSD probably wouldn't
help us here.  If you want to take up the 2nd issue with the kernel folks,
having some evidence from BSD might (but not very likely) be helpful for
that, but that would be for a different mailing list.

Cheers,

Jeff


Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes
On Sun, Apr 14, 2019 at 11:04 PM Gunther  wrote:

> Could you rerun the query with \set VERBOSITY verbose to show the file/line
> that's failing ?
>
> Here goes:
>
> integrator=# \set VERBOSITY verbose
> integrator=# SET ENABLE_NESTLOOP TO OFF;
> SET
> integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM 
> reports.v_BusinessOperation;
> ERROR:  53200: out of memory
> DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
> LOCATION:  MemoryContextAlloc, mcxt.c:798
>
> you notice that I set ENABLE_NESTLOOP to off, that is because the planner
> goes off thinking the NL plan is marginally more efficient, but in fact it
> will take 5 hours to get to the same out of memory crash, while the no NL
> plan gets there in half an hour. That verbose setting didn't help much I
> guess.
>
I think the backtrace of the enable_nestloop=on plan would be more useful.
Here someone has filled up memory, and then we see HashBatchContext trip
over it that.  But it isn't the one the one that caused the problem, so the
backtrace doesn't help.  With the previous plan, it was an allocation into
ExecutorState which tripped over the problem, and it is likely that it is
coming from the same series of allocations that caused the problem.

To get it to happen faster, maybe you could run the server with a small
setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
Unfortunately I don't know how to set a breakpoint for allocations into a
specific context, and setting a breakpoint for any memory allocation is
probably going to fire too often to be useful.

Yes, the verbose option didn't help (but the gdb backtrace made up for
it--kind of--we really need the backtrace of the allocations into
ExecutorState).  It isn't helpful to know that a memory allocation failed
in the mcxt.c code.  To bad it doesn't report the location of the caller of
that code.  I know in Perl you can use Carp::croak to do that, but I don't
know to do it in C.

But really the first thing I want to know now is what if you just do the
select, without the insert?

explain analyze SELECT * FROM reports.v_BusinessOperation

If that works, what about "create temporary table foo as SELECT * FROM
reports.v_BusinessOperation" ?

And if that works, what about "INSERT INTO reports.BusinessOperation SELECT
* FROM foo"?

If the ERROR happens in the first or last of these, it might be much easier
to analyze in that simplified context. If it happens in the middle one,
then we probably haven't achieved much. (And if there is no ERROR at all,
then you have workaround, but we still haven't found the fundamental bug).

Are you not showing the view definition for proprietary reasons, or just
because you don't think it will be useful? If the latter, please send it as
an attachment, I can't guarantee it will be useful, but there is only one
way find out.

Cheers,

Jeff


Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Justin Pryzby
On Sun, Apr 14, 2019 at 11:59:45PM -0400, Gunther wrote:
> On 4/14/2019 23:24, Tom Lane wrote:
> >Any chance of extracting a self-contained test case that reproduces this?
> With 18 million rows involved in the base tables, hardly.

Were you able to reproduce the problem with SELECT (without INSERT) ?
How many rows does it output ?  Show explain analyze if possible.  If that
still errors, can you make it work with a small enough LIMIT ?

We haven't seen the view - maybe it's very complicated, but can you reproduce
with a simpler one ?  Fewer joins ?  Or fewer join conditions ?

Justin




Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Gunther

On 4/14/2019 23:24, Tom Lane wrote:

ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 
chunks); 2230341056 used

Oooh, that looks like a memory leak right enough.  The ExecutorState
should not get that big for any reasonable query.

2.2 GB is massive yes.

Your error and stack trace show a failure in HashBatchContext,
which is probably the last of these four:


HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 
41360 used
HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 
chunks); 100703776 used

Perhaps that's more than it should be, but it's silly to obsess over 100M
when there's a 2.2G problem elsewhere.

Yes.

   I think it's likely that it was
just coincidence that the failure happened right there.  Unfortunately,
that leaves us with no info about where the actual leak is coming from.


Strange though, that the vmstat tracking never showed that the cache 
allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak 
is there, and even if I had 2 GB of shared_buffers, I would still have 
enough for the OS to give me.


Is there any doubt that this might be a problem with Linux? Because if 
you want, I can whip out a FreeBSD machine, compile pgsql, and attach 
the same disk, and try it there. I am longing to have a reason to move 
back to FreeBSD anyway. But I have tons of stuff to do, so if you do not 
have reason to suspect Linux to do wrong here, I prefer skipping that 
futile attempt



The memory map shows that there were three sorts and four hashes going
on, so I'm not sure I believe that this corresponds to the query plan
you showed us before.
Like I said, the first explain was not using the same constraints (no 
NL). Now what I sent last should all be consistent. Memory dump and 
explain plan and gdb backtrace.

Any chance of extracting a self-contained test case that reproduces this?


With 18 million rows involved in the base tables, hardly.

But I am ready to try some other things with the debugger that you want 
me to try. If we have a memory leak issue, we might just as well try to  
plug it!


I could even to give someone of you access to the system that runs this.

thanks,
-Gunther




Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Tom Lane
Gunther  writes:
>ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 
> chunks); 2230341056 used

Oooh, that looks like a memory leak right enough.  The ExecutorState
should not get that big for any reasonable query.

Your error and stack trace show a failure in HashBatchContext,
which is probably the last of these four:

>HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 
> 41360 used
>HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
> 74216 used
>HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
> 74216 used
>HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 
> chunks); 100703776 used

Perhaps that's more than it should be, but it's silly to obsess over 100M
when there's a 2.2G problem elsewhere.  I think it's likely that it was
just coincidence that the failure happened right there.  Unfortunately,
that leaves us with no info about where the actual leak is coming from.

The memory map shows that there were three sorts and four hashes going
on, so I'm not sure I believe that this corresponds to the query plan
you showed us before.

Any chance of extracting a self-contained test case that reproduces this?

regards, tom lane




Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Gunther
Thanks Justin Pryzby too, and Jeff Janes, responding to both of you for 
efficiency. Answers and more logs and the gdb backtrace below.

The version is 10.2 latest.

v10.7 is available; could you upgrade ?

Sorry I meant 11.2 actually latest.

What are these set to ?  shared_buffers? work_mem?


shared_buffers=2G (of 8 total), then 1G, didn't help.

work_mem=4M by now (I had once been successful of avoiding out of memory 
by reducing work mem from 64M to 8M. But as Tom Lane says, it shouldn't 
be using more than 5 x work_mem in this query plan.


Jeff Janes said:

 I don't know why a 8GB system with a lot of cache that could be 
evicted would get an OOM when something using 1.5GB asks for 8272 
bytes more.  But that is a question of how the kernel works, rather 
than how PostgreSQL works.  But I also think the log you quote above 
belongs to a different event than the vmstat trace in your first email.
and I agree, except that the vmstat log and the error really belong 
together, same timestamp. Nothing else running on that machine this 
Sunday. Yes I ran this several times with different parameters, so some 
mixup is possible, but always ending in the same crash anyway. So here 
again, without the vmstat log, which really wouldn't be any different 
than I showed you. (See below for the ENABLE_NESTLOOP=off setting, not 
having those settings same between explain and actual execution might 
account for the discrepancy that you saw.)


integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# \set VERBOSITY verbose
integrator=# explain INSERT INTO reports.BusinessOperation SELECT * FROM 
reports.v_BusinessOperation;
integrator=# \pset pager off
Pager usage is off.
integrator=# \pset format unaligned
Output format is unaligned.
integrator=# explain INSERT INTO reports.BusinessOperation SELECT * FROM 
reports.v_BusinessOperation;
QUERY PLAN
Insert on businessoperation  (cost=5850091.58..5853120.74 rows=34619 width=1197)
  ->  Unique  (cost=5850091.58..5852774.55 rows=34619 width=1197)
->  Sort  (cost=5850091.58..5850178.13 rows=34619 width=1197)
  Sort Key: documentinformationsubject.documentinternalid, 
documentinformationsubject.is_current, documentinformationsubject.documentid, 
documentinformationsubject.documenttypecode, 
documentinformationsubject.subjectroleinternalid, 
documentinformationsubject.subjectentityinternalid, 
documentinformationsubject.subjectentityid, 
documentinformationsubject.subjectentityidroot, 
documentinformationsubject.subjectentityname, 
documentinformationsubject.subjectentitytel, 
documentinformationsubject.subjectentityemail, 
documentinformationsubject.otherentityinternalid, 
documentinformationsubject.confidentialitycode, 
documentinformationsubject.actinternalid, documentinformationsubject.code_code, 
documentinformationsubject.code_displayname, q.code_code, q.code_displayname, 
an.extension, an.root, documentinformationsubject_2.subjectentitycode, 
documentinformationsubject_2.subjectentitycodesystem, 
documentinformationsubject_2.effectivetime_low, 
documentinformationsubject_2.effectivetime_high, 
documentinformationsubject_2.statuscode, 
documentinformationsubject_2.code_code, agencyid.extension, 
agencyname.trivialname, documentinformationsubject_1.subjectentitycode, 
documentinformationsubject_1.subjectentityinternalid
  ->  Hash Right Join  (cost=4489522.06..5829375.93 rows=34619 
width=1197)
Hash Cond: (((q.documentinternalid)::text = 
(documentinformationsubject.documentinternalid)::text) AND 
((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))
->  Hash Right Join  (cost=1473632.24..2808301.92 rows=13 
width=341)
  Hash Cond: 
(((documentinformationsubject_2.documentinternalid)::text = 
(q.documentinternalid)::text) AND 
((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))
  ->  Hash Left Join  (cost=38864.03..1373533.69 rows=1 
width=219)
Hash Cond: 
((documentinformationsubject_2.otherentityinternalid)::text = 
(agencyname.entityinternalid)::text)
->  Hash Left Join  (cost=2503.10..1332874.75 
rows=1 width=229)
  Hash Cond: 
((documentinformationsubject_2.otherentityinternalid)::text = 
(agencyid.entityinternalid)::text)
  ->  Seq Scan on 
documentinformationsubject documentinformationsubject_2  (cost=0.00..1329868.64 
rows=1 width=177)
Filter: 
((participationtypecode)::text = 'AUT'::text)
  ->  Hash  (cost=1574.82..1574.82 
rows=34182 width=89)
->  Seq Scan on entity_id agencyid  
(cost=0.00..1574.82 rows=34182 width=89)
->  Hash  (cost=27066.08..27066.08 rows=399908 
width=64)
  

Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Jeff Janes
On Sun, Apr 14, 2019 at 9:06 PM Gunther  wrote:

> Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not
> having given enough detail.
>
> The version is 10.2 latest. The database was originally built with 10.1
> and then just started with 10.2.
>
Do you mean 11.2?  The latest in the 10 series is 10.7.  If you do mean
10.2, there a fix for a memory leak bug since then that might plausibly be
relevant (bdc7f686d1b8f423cb)

>
> I said "crash" and that is wrong. Not a signal nor core dump. It is the
> ERROR:  out of memory. Only the query crashes. Although I don't know if may
> be the backend server might have left a core dump?
>
I don't think there would be a core dump on only an ERROR, and probably not
worthwhile to trick it into generating one.


> The short version is:
>
> Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 
> 1425134928 used
> 2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
> 2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 
> in memory context "ExecutorState".
>
>  I don't know why a 8GB system with a lot of cache that could be evicted
would get an OOM when something using 1.5GB asks for 8272 bytes more.  But
that is a question of how the kernel works, rather than how PostgreSQL
works.  But I also think the log you quote above belongs to a different
event than the vmstat trace in your first email.


>   ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 
> chunks); 1408127768 used
> HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 
> used
>   HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 
> 41360 used
>
>
This does not seem to match your query plan.  Why would a plan with no Hash
Joins have a HashBatchContext?  I think this log must be from a different
query than the one that generated the plan you posted.  Perhaps one was
before you lowered work_mem and one was after?

Cheers,

Jeff


Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Peter
On Sun, Apr 14, 2019 at 05:19:11PM -0400, Tom Lane wrote:
! Gunther  writes:
! > For weeks now, I am banging my head at an "out of memory" situation. 
! > There is only one query I am running on an 8 GB system, whatever I try, 
! > I get knocked out on this out of memory. It is extremely impenetrable to 
! > understand and fix this error. I guess I could add a swap file, and then 
! > I would have to take the penalty of swapping. But how can I actually 
! > address an out of memory condition if the system doesn't tell me where 
! > it is happening?

Well, esactly with a swap space. No offense intended, but if You
don't have a swap space, You should not complain about unintellegibe
Out-of-memory situations.
Swapspace is not usually used to run applications from (that would
indeed give horrible performance), it is used to not get out-of-memory
errors. With a swapspace, the out-of-memory situation will persist,
and so one has time to take measurements and analyze system
behaviour and from that, one can better understand what is causing 
the problem, and decide what actions should be taken, on an informed
base (e.g. correct flaws in the system tuning, fix bad query, buy 
more memory, or what may be applicable)

If I remember correctly, I did even see a DTRACE flag in my build,
so what more is to wish? :))

P.




Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Gunther
Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not 
having given enough detail.


The version is 10.2 latest. The database was originally built with 10.1 
and then just started with 10.2. No dump and reload or pg_upgrade. 
Underlying system is 64bit Amazon Linux (CentOS like) running on an 
AMD64 VM (m5a) right now.


I said "crash" and that is wrong. Not a signal nor core dump. It is the 
ERROR:  out of memory. Only the query crashes. Although I don't know if 
may be the backend server might have left a core dump? Where would that 
be? Would it help anyone if I started the server with the -c option to 
get a core dump? I guess I could re-compile with gcc -g debugging 
symbols all on and then run with that -c option, and then use gdb to 
find out which line it was failing at and then inspect the query plan 
data structure? Would that be helpful? Does anyone want the coredump to 
inspect?


The short version is:

Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 
1425134928 used
2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory 
context "ExecutorState".

Here is the out of memory error dump in its full glory.

TopMemoryContext: 2197400 total in 7 blocks; 42952 free (15 chunks); 2154448 
used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 
used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 
chunks); 7776 used
  TopTransactionContext: 8192 total in 1 blocks; 7720 free (2 chunks); 472 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 2097152 total in 9 blocks; 396480 free (10 chunks); 1700672 
used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 
used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 
used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:
  ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 
chunks); 1408127768 used
HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
  HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 
41360 used
HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
  HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
  HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 
74216 used
TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 
40120 used
TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 
40120 used
HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 
8389408 used
  HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 
98704 used
TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 
326632 used
  Caller tuples: 4194304 total in 10 blocks; 1434888 free (20 chunks); 
2759416 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used

Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Jeff Janes
On Sun, Apr 14, 2019 at 4:51 PM Gunther  wrote:

> For weeks now, I am banging my head at an "out of memory" situation. There
> is only one query I am running on an 8 GB system, whatever I try, I get
> knocked out on this out of memory.
>
Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the
OOM killer?  Do you get a core file you can inspect with gdb?

You might want to see the query, but it is a huge plan, and I can't really
> break this down. It shouldn't matter though. But just so you can get a
> glimpse here is the plan:
>
> Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 
> width=1197)
>   ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
>
>
>
Maybe it is memory for trigger or constraint checking, although I don't
know why that would appear instantly.  What triggers or constraints do you
have on businessoperation?

What if you just run the SELECT without the INSERT?  Or insert into a temp
table rather than into businessoperation?  And if that doesn't crash, what
if you then insert to businessoperation from the temp table?

Also, what version?

Cheers,

Jeff


Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Tom Lane
Gunther  writes:
> For weeks now, I am banging my head at an "out of memory" situation. 
> There is only one query I am running on an 8 GB system, whatever I try, 
> I get knocked out on this out of memory. It is extremely impenetrable to 
> understand and fix this error. I guess I could add a swap file, and then 
> I would have to take the penalty of swapping. But how can I actually 
> address an out of memory condition if the system doesn't tell me where 
> it is happening?

> You might want to see the query, but it is a huge plan, and I can't 
> really break this down. It shouldn't matter though. But just so you can 
> get a glimpse here is the plan:

Is that the whole plan?  With just three sorts and two materializes,
it really shouldn't use more than more-or-less 5X work_mem.  What do
you have work_mem set to, anyway?  Is this a 64-bit build of PG?

Also, are the estimated rowcounts shown here anywhere close to what
you expect in reality?  If there are any AFTER INSERT triggers on the
insertion target table, you'd also be accumulating per-row trigger
queue entries ... but if there's only circa 35K rows to be inserted,
it's hard to credit that eating more than a couple hundred KB, either.

> Might this be a bug?

It's conceivable that you've hit some memory-leakage bug, but if so you
haven't provided nearly enough info for anyone else to reproduce it.
You haven't even shown us the actual error message :-(

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane