Re: [GENERAL] Out of memory/corrupted shared memory problem on server

2017-08-29 Thread Tom Lane
Johann Spies  writes:
> On 25 August 2017 at 13:48, Tom Lane  wrote:
>> Remember that "work_mem" is "work memory per plan node", so a complex
>> query could easily chew up a multiple of that number --- and that's
>> with everything going according to plan.  If, say, the planner
>> underestimates the number of table entries involved in a hash
>> aggregation, the actual consumption might be much larger.

> The main source of this query (doing a lot of calculations) is another
> Materialized View
> with more than 700 million records. I then analyzed that MV and this
> morning the good news was:

> # REFRESH MATERIALIZED VIEW wos_2017_1.citation_window_mv ;
> REFRESH MATERIALIZED VIEW
> Time: 27128469.899 ms

OK, so almost certainly the crash was caused by a hash aggregate
using so much memory that it triggered the OOM killer.  Whether
a hash aggregate's hashtable will stay within work_mem is dependent
on whether the planner correctly predicts the number of entries needed.
Analyzing the input MV must have improved that estimate and led the
planner to choose some other plan.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Out of memory/corrupted shared memory problem on server

2017-08-29 Thread Johann Spies
On 25 August 2017 at 13:48, Tom Lane  wrote:

> How complex is "complex"?  I can think of two likely scenarios:
> 1. You've stumbled across some kind of memory-leak bug in Postgres.
> 2. The query's just using too much memory.  In this connection, it's
> not good that you've got
>> work_mem = 2GB
> Remember that "work_mem" is "work memory per plan node", so a complex
> query could easily chew up a multiple of that number --- and that's
> with everything going according to plan.  If, say, the planner
> underestimates the number of table entries involved in a hash
> aggregation, the actual consumption might be much larger.
>
> My first move would be to reduce work_mem by an order of magnitude
> or two.  If that doesn't help, check the plan for the view's query
> and see if it contains any hash aggregation steps --- if so, does
> "set enable_hashagg = off" help?  (Also, make sure the view's input
> tables have been ANALYZEd recently.)
>
> If none of that helps, we should investigate the memory-leak-bug
> theory.  One thing you could do in that direction is to run
> the postmaster with a "ulimit -v" size less than what will trigger
> the ire of the OOM killer, so that the query encounters a normal
> ENOMEM error rather than SIGKILL when it's eaten too much memory.
> That should result in it dumping a memory consumption map to stderr,
> which would give some clue where the problem is.  We'd need to see
> that map as well as details about your query to make progress.


Thanks Tom and Christoph Moench-Tegeder.

I first tried to refresh it after bringing down the work_mem to 1 GB.
It failed again.
The main source of this query (doing a lot of calculations) is another
Materialized View
with more than 700 million records. I then analyzed that MV and this
morning the good news was:

# REFRESH MATERIALIZED VIEW wos_2017_1.citation_window_mv ;
REFRESH MATERIALIZED VIEW
Time: 27128469.899 ms

Regards
Johann
-- 
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Out of memory/corrupted shared memory problem on server

2017-08-25 Thread Tom Lane
Johann Spies  writes:
> While restoring a dump from our development server (768G ram) to the
> production server, (PG 9.6.3 on Debian Stretch with 128G Ram) the
> refreshing of a Materialized View fails like this:

> local] js@wos=# REFRESH MATERIALIZED VIEW wos_2017_1.citation_window_mv ;
> server closed the connection unexpectedly

> In the log:
> 2017-08-24 19:23:26 SAST [7532-18] LOG:  server process (PID 4890) was
> terminated by signal 9: Killed

As Christoph said, this looks a lot like the kernel OOM killer decided
you'd eaten too much memory.

> The Materialized View  uses a complex query and  should contain 69 772
> 381 records.

How complex is "complex"?  I can think of two likely scenarios:
1. You've stumbled across some kind of memory-leak bug in Postgres.
2. The query's just using too much memory.  In this connection, it's
not good that you've got
> work_mem = 2GB
Remember that "work_mem" is "work memory per plan node", so a complex
query could easily chew up a multiple of that number --- and that's
with everything going according to plan.  If, say, the planner
underestimates the number of table entries involved in a hash
aggregation, the actual consumption might be much larger.

My first move would be to reduce work_mem by an order of magnitude
or two.  If that doesn't help, check the plan for the view's query
and see if it contains any hash aggregation steps --- if so, does
"set enable_hashagg = off" help?  (Also, make sure the view's input
tables have been ANALYZEd recently.)

If none of that helps, we should investigate the memory-leak-bug
theory.  One thing you could do in that direction is to run
the postmaster with a "ulimit -v" size less than what will trigger
the ire of the OOM killer, so that the query encounters a normal
ENOMEM error rather than SIGKILL when it's eaten too much memory.
That should result in it dumping a memory consumption map to stderr,
which would give some clue where the problem is.  We'd need to see
that map as well as details about your query to make progress.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Out of memory/corrupted shared memory problem on server

2017-08-25 Thread Christoph Moench-Tegeder
## Johann Spies (johann.sp...@gmail.com):

> --
> 2017-08-24 19:23:26 SAST [7532-18] LOG:  server process (PID 4890) was
> terminated by signal 9: Killed

That looks like out-of-memory. Check the kernel log/dmesg to verify.

If it's the dreaded OOM-killer, you should check your overcommit
settings (sysctl vm.overcommit_*) and fix them in a way that
the kernel isn't forced to kill processes (that is, reduce overcommit).

Finally, in some cases it has been helpful to reduce work_mem -
that way PostgreSQL may be skewed away from memory intensive
operations (at the cost of processing time and/or disk IO - but
that's still better than having processes killed and getting no
result at all).
You could check the query plan for the query behind your view
(EXPLAIN) for potentially memory hungry operations.

> max_worker_processes = 24# (change requires restart)
> max_parallel_workers_per_gather = 4# taken from max_worker_processes

In case the query updating the materialized view uses parallel
processing, you could save quite some memory by turning that off
(more processes -> more memory usage -> not helpful in your case).

> # (for 60GB)
> kernel.shmall = 15728640
> kernel.shmmax = 64424509440

This is obsolete since PostgreSQL 9.3 ("Greatly reduce System V shared
memory requirements" says the Release Notes).

Regards,
Christoph

-- 
Spare Space.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general