Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-12 Thread Scott Marlowe
On Tue, Oct 10, 2017 at 4:28 PM, pinker wrote: > > Yes, it would be much easier if it would be just single query from the top, > but the most cpu is eaten by the system itself and I'm not sure why. You are experiencing a context switch storm. The OS is spending so much time

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Tomas Vondra
On 10/11/2017 02:26 AM, pinker wrote: > Tomas Vondra-4 wrote >> I'm probably a bit dumb (after all, it's 1AM over here), but can you >> explain the CPU chart? I'd understand percentages (say, 75% CPU used) >> but what do the seconds / fractions mean? E.g. when the system time >> reaches 5

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Justin Pryzby
On Tue, Oct 10, 2017 at 01:40:07PM -0700, pinker wrote: > Hi to all! > > We've got problem with a very serious repetitive incident on our core > system. Namely, cpu load spikes to 300-400 and the whole db becomes > unresponsive. From db point of view nothing special is happening, memory > looks

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Andres Freund wrote > Others mentioned already that that's worth improving. Yes, we are just setting up pgbouncer Andres Freund wrote > Some versions of this kernel have had serious problems with transparent > hugepages. I'd try turning that off. I think it defaults to off even in > that

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Tomas Vondra-4 wrote > I'm probably a bit dumb (after all, it's 1AM over here), but can you > explain the CPU chart? I'd understand percentages (say, 75% CPU used) > but what do the seconds / fractions mean? E.g. when the system time > reaches 5 seconds, what does that mean? hehe, no you've just

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Andres Freund
Hi, On 2017-10-10 13:40:07 -0700, pinker wrote: > and the total number of connections are increasing very fast (but I suppose > it's the symptom not the root cause of cpu load) and exceed max_connections > (1000). Others mentioned already that that's worth improving. > System: > * CentOS Linux

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Tomas Vondra
On 10/11/2017 12:28 AM, pinker wrote: > Tomas Vondra-4 wrote >> What is "CPU load"? Perhaps you mean "load average"? > > Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the > graph from yesterday's failure (after 6p.m.): >

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Victor Yegorov wrote > Looks like `sdg` and `sdm` are the ones used most. > Can you describe what's on those devices? Do you have WAL and DB sitting > together? > Where DB log files are stored? it's multipath with the same LUN for PGDATA and pg_log, but separate one for xlogs and archives.

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread John R Pierce
On 10/10/2017 3:28 PM, pinker wrote: It was exactly my first guess. work_mem is set to ~ 350MB and I see a lot of stored procedures with unnecessary WITH clauses (i.e. materialization) and right after it IN query with results of that (hash). 1000 connections all doing queries that need 1

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Tomas Vondra-4 wrote > What is "CPU load"? Perhaps you mean "load average"? Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the graph from yesterday's failure (after 6p.m.): So as one can see connections spikes

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Victor Yegorov
2017-10-11 0:53 GMT+03:00 pinker : > > Can you provide output of `iostat -myx 10` at the “peak” moments, please? > > sure, please find it here: > https://pastebin.com/f2Pv6hDL Looks like `sdg` and `sdm` are the ones used most. Can you describe what's on those devices? Do you

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Scott Marlowe-2 wrote > Ouch, unless I'm reading that wrong, your IO subsystem seems to be REALLY > slow. it's a huge array where a lot is happening, for instance data snapshots :/ the lun on which is this db is dm-7. I'm a DBA with null knowledge about arrays so any advice will be much

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Scott Marlowe
On Tue, Oct 10, 2017 at 3:53 PM, pinker wrote: > Victor Yegorov wrote >> Can you provide output of `iostat -myx 10` at the “peak” moments, please? > > sure, please find it here: > https://pastebin.com/f2Pv6hDL Ouch, unless I'm reading that wrong, your IO subsystem seems to be

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Victor Yegorov wrote > Can you provide output of `iostat -myx 10` at the “peak” moments, please? sure, please find it here: https://pastebin.com/f2Pv6hDL Victor Yegorov wrote > Also, it'd be good to look in more detailed bgwriter/checkpointer stats. > You can find more details in this post:

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Tomas Vondra
On 10/10/2017 10:40 PM, pinker wrote: > Hi to all! > > We've got problem with a very serious repetitive incident on our core > system. Namely, cpu load spikes to 300-400 and the whole db becomes What is "CPU load"? Perhaps you mean "load average"? Also, what are the basic system parameters

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Victor Yegorov
2017-10-10 23:40 GMT+03:00 pinker : > We've got problem with a very serious repetitive incident on our core > system. Namely, cpu load spikes to 300-400 and the whole db becomes > unresponsive. From db point of view nothing special is happening, memory > looks fine, disks io's are

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Thank you Scott, we are planning to do it today. But are you sure it will help in this case? -- Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Scott Marlowe
On Tue, Oct 10, 2017 at 2:40 PM, pinker wrote: > Hi to all! > > We've got problem with a very serious repetitive incident on our core > system. Namely, cpu load spikes to 300-400 and the whole db becomes > unresponsive. From db point of view nothing special is happening, memory >

[GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Hi to all! We've got problem with a very serious repetitive incident on our core system. Namely, cpu load spikes to 300-400 and the whole db becomes unresponsive. From db point of view nothing special is happening, memory looks fine, disks io's are ok and the only problem is huge cpu load. Kernel