subject:"\"\\\[HACKERS\\\] pgstat wait timeout\""

Re: [HACKERS] pgstat wait timeout (RE: contrib/cache_scan)

2014-03-12 Thread Tom Lane

Jeff Janes  writes:
> On Wed, Mar 12, 2014 at 7:42 AM, Tom Lane  wrote:
>> We've seen sporadic reports of that sort of behavior for years, but no
>> developer has ever been able to reproduce it reliably.  Now that you've
>> got a reproducible case, do you want to poke into it and see what's going
>> on?

> I didn't know we were trying to reproduce it, nor that it was a mystery.
>  Do anything that causes serious IO constipation, and you will probably see
> that message.

The cases that are a mystery to me are where there's no reason to believe
that I/O is particularly overloaded.  But perhaps Kaigai-san's example is
only that ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout (RE: contrib/cache_scan)

2014-03-12 Thread Jeff Janes

On Wed, Mar 12, 2014 at 7:42 AM, Tom Lane  wrote:

> Kouhei Kaigai  writes:
> > WARNING:  pgstat wait timeout
> > WARNING:  pgstat wait timeout
> > WARNING:  pgstat wait timeout
> > WARNING:  pgstat wait timeout
>
> > Once I got above messages, write performance is dramatically
> > degraded, even though I didn't take detailed investigation.
>
> > I could reproduce it on the latest master branch without my
> > enhancement, so I guess it is not a problem something special
> > to me.
> > One other strangeness is, right now, this problem is only
> > happen on my virtual machine environment - VMware ESXi 5.5.0.
> > I couldn't reproduce the problem on my physical environment
> > (Fedora20, core i5-4570S).
>
> We've seen sporadic reports of that sort of behavior for years, but no
> developer has ever been able to reproduce it reliably.  Now that you've
> got a reproducible case, do you want to poke into it and see what's going
> on?
>

I didn't know we were trying to reproduce it, nor that it was a mystery.
 Do anything that causes serious IO constipation, and you will probably see
that message.  For example, turn off synchronous_commit and run the default
pgbench transaction at a large scale but that still comfortably fits in
RAM, and wait for a checkpoint sync phase to kick in.

The pgstat wait timeout is a symptom, not the cause.

Cheers,

Jeff

Re: [HACKERS] pgstat wait timeout (RE: contrib/cache_scan)

2014-03-12 Thread Tomas Vondra

On 12 Březen 2014, 14:54, Kouhei Kaigai wrote:
> It is another topic from the main thread,
>
> I noticed the following message under the test cases that
> takes heavy INSERT workload; provided by Haribabu.
>
> [kaigai@iwashi ~]$ createdb mytest
> [kaigai@iwashi ~]$ psql -af ~/cache_scan.sql mytest
> \timing
> Timing is on.
> --cache scan select 5 million
> create table test(f1 int, f2 char(70), f3 float, f4 char(100));
> CREATE TABLE
> Time: 22.373 ms
> truncate table test;
> TRUNCATE TABLE
> Time: 17.705 ms
> insert into test values (generate_series(1,500), 'fujitsu', 1.1,
> 'Australia software tech pvt ltd');
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
>:
>
> Once I got above messages, write performance is dramatically
> degraded, even though I didn't take detailed investigation.
>
> I could reproduce it on the latest master branch without my
> enhancement, so I guess it is not a problem something special
> to me.
> One other strangeness is, right now, this problem is only
> happen on my virtual machine environment - VMware ESXi 5.5.0.
> I couldn't reproduce the problem on my physical environment
> (Fedora20, core i5-4570S).
> Any ideas?

I've seen this happening in cases when it was impossible to write
the stat file for some reason. IIRC there were two basic causes I've seen
in the past:

(1) writing the stat copy failed - for example when the temporary stat
directory was placed in tmpfs, but it was too small

(2) writing the stat copy took too long - e.g. with tmpfs and memory
pressure, forcing the system to swap to free space for the stat copy

(3) IIRC the inquiry (backend -> postmaster) to write the file is sent
using UDP, which may be dropped in some cases (e.g. when the system is
overloaded), so the postmaster does not even know it should write the file

I'm not familiar with VMware ESXi virtualization, but I suppose it might
be relevant to all three causes.

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout (RE: contrib/cache_scan)

2014-03-12 Thread Tom Lane

Kouhei Kaigai  writes:
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout

> Once I got above messages, write performance is dramatically
> degraded, even though I didn't take detailed investigation.

> I could reproduce it on the latest master branch without my
> enhancement, so I guess it is not a problem something special
> to me.
> One other strangeness is, right now, this problem is only
> happen on my virtual machine environment - VMware ESXi 5.5.0.
> I couldn't reproduce the problem on my physical environment
> (Fedora20, core i5-4570S).

We've seen sporadic reports of that sort of behavior for years, but no
developer has ever been able to reproduce it reliably.  Now that you've
got a reproducible case, do you want to poke into it and see what's going
on?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pgstat wait timeout (RE: contrib/cache_scan)

2014-03-12 Thread Kouhei Kaigai

It is another topic from the main thread,

I noticed the following message under the test cases that
takes heavy INSERT workload; provided by Haribabu.

[kaigai@iwashi ~]$ createdb mytest
[kaigai@iwashi ~]$ psql -af ~/cache_scan.sql mytest
\timing
Timing is on.
--cache scan select 5 million
create table test(f1 int, f2 char(70), f3 float, f4 char(100));
CREATE TABLE
Time: 22.373 ms
truncate table test;
TRUNCATE TABLE
Time: 17.705 ms
insert into test values (generate_series(1,500), 'fujitsu', 1.1, 'Australia 
software tech pvt ltd');
WARNING:  pgstat wait timeout
WARNING:  pgstat wait timeout
WARNING:  pgstat wait timeout
WARNING:  pgstat wait timeout
   :

Once I got above messages, write performance is dramatically
degraded, even though I didn't take detailed investigation.

I could reproduce it on the latest master branch without my
enhancement, so I guess it is not a problem something special
to me.
One other strangeness is, right now, this problem is only
happen on my virtual machine environment - VMware ESXi 5.5.0.
I couldn't reproduce the problem on my physical environment
(Fedora20, core i5-4570S).
Any ideas?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Wednesday, March 12, 2014 3:26 PM
> To: Haribabu Kommi; Kohei KaiGai
> Cc: Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
> 
> Thanks for your efforts!
> >  Head  patched
> > Diff
> > Select -  500K772ms2659ms-200%
> > Insert - 400K   3429ms 1948ms  43% (I am
> > not sure how it improved in this case)
> > delete - 200K 2066ms 3978ms-92%
> > update - 200K3915ms  5899ms-50%
> >
> > This patch shown how the custom scan can be used very well but coming
> > to patch as It is having some performance problem which needs to be
> > investigated.
> >
> > I attached the test script file used for the performance test.
> >
> First of all, it seems to me your test case has too small data set that
> allows to hold all the data in memory - briefly 500K of 200bytes record
> will consume about 100MB. Your configuration allocates 512MB of
> shared_buffer, and about 3GB of OS-level page cache is available.
> (Note that Linux uses free memory as disk cache adaptively.)
> 
> This cache is designed to hide latency of disk accesses, so this test case
> does not fit its intention.
> (Also, the primary purpose of this module is a demonstration for
> heap_page_prune_hook to hook vacuuming, so simple code was preferred than
> complicated implementation but better performance.)
> 
> I could reproduce the overall trend, no cache scan is faster than cached
> scan if buffer is in memory. Probably, it comes from the cost to walk down
> T-tree index using ctid per reference.
> Performance penalty around UPDATE and DELETE likely come from trigger
> invocation per row.
> I could observe performance gain on INSERT a little bit.
> It's strange for me, also. :-(
> 
> On the other hand, the discussion around custom-plan interface effects this
> module because it uses this API as foundation.
> Please wait for a few days to rebase the cache_scan module onto the newer
> custom-plan interface; that I submitted just a moment before.
> 
> Also, is it really necessary to tune the performance stuff in this example
> module of the heap_page_prune_hook?
> Even though I have a few ideas to improve the cache performance, like
> insertion of multiple rows at once or local chunk copy instead of t-tree
> walk down, I'm not sure whether it is productive in the current v9.4
> timeframe. ;-(
> 
> Thanks,
> --
> NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei
> 
> 
> 
> > -Original Message-
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Haribabu
> > Kommi
> > Sent: Wednesday, March 12, 2014 1:14 PM
> > To: Kohei KaiGai
> > Cc: Kaigai Kouhei(海外 浩平); Tom Lane; PgHacker; Robert Haas
> > Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for
> > cache-only table scan?)
> >
> > On Thu, Mar 6, 2014 at 10:15 PM, Kohei KaiGai  wrote:
> > > 2014-03-06 18:17 GMT+09:00 Haribabu Kommi :
> > >> I will update you later regarding the performance test results.
> > >>
> >
> > I ran the performance test on the cache scan patch and below are the
> readings.
> >
> > Configuration:
> >
> > Shared_buffers - 512MB
> > cache_scan.num_blocks - 600
> > checkpoint_segments - 255
> >
> > Machine:
> > OS - centos - 6.4
> > CPU - 4 core 2.5 GHZ
> > Memory - 4GB
> >
> >  Head  patched
> > Diff
> > Select -  500K772ms2659ms-200%
> > Insert - 400K

Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

2012-05-12 Thread Magnus Hagander

On Fri, May 11, 2012 at 3:35 PM, Tom Lane  wrote:
> Magnus Hagander  writes:
>> On Thu, May 10, 2012 at 6:52 PM, Tom Lane  wrote:
>>> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
>>> nasty bug: it does not reset ImmediateInterruptOK before returning.
>>> How is it that Windows machines aren't falling over constantly?
>
>> Hmm. the commit you made to fix it says it changes how
>> ImmediateInterruptOK is handled, but there was not a single line of
>> code that actually changed that? Or am I misreading this completely?
>
> Exit is now out the bottom of the loop, not by a raw "return;".

oh, d'uh. Sorry, missed that one completely.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

2012-05-11 Thread Tom Lane

Magnus Hagander  writes:
> On Thu, May 10, 2012 at 6:52 PM, Tom Lane  wrote:
>> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
>> nasty bug: it does not reset ImmediateInterruptOK before returning.
>> How is it that Windows machines aren't falling over constantly?

> Hmm. the commit you made to fix it says it changes how
> ImmediateInterruptOK is handled, but there was not a single line of
> code that actually changed that? Or am I misreading this completely?

Exit is now out the bottom of the loop, not by a raw "return;".

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

2012-05-11 Thread Magnus Hagander

On Thu, May 10, 2012 at 6:52 PM, Tom Lane  wrote:
> I wrote:
>> Hence I think we oughta swap the order of those two array
>> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
>> pgwin32_select.)
>
> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
> nasty bug: it does not reset ImmediateInterruptOK before returning.
> How is it that Windows machines aren't falling over constantly?

Hmm. the commit you made to fix it says it changes how
ImmediateInterruptOK is handled, but there was not a single line of
code that actually changed that? Or am I misreading this completely?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

2012-05-10 Thread Tom Lane

I wrote:
> Hence I think we oughta swap the order of those two array
> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
> pgwin32_select.)

Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
nasty bug: it does not reset ImmediateInterruptOK before returning.
How is it that Windows machines aren't falling over constantly?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

2012-05-10 Thread Tom Lane

Magnus Hagander  writes:
> On May 10, 2012 4:59 PM, "Tom Lane"  wrote:
>> I spent some time staring at the Windows WaitLatchOrSocket code myself.
>> The only thing I could find that seemed wrong is that in the event
>> array, we list the latch's event before pgwin32_signal_event.  The
>> Microsoft documentation I looked at says that if more than one event
>> is ready, WaitforMultipleObjects reports the first such array member.
>> This means that if the latch is already set when control gets here,
>> signal handlers will not be serviced.

> Yeah, that does seem wrong.

>> That doesn't match what would
>> happen on a Unix machine, so it seems like at least a violation of the
>> POLA.  Hence I think we oughta swap the order of those two array
>> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
>> pgwin32_select.)  I do not however

> Maybe we need a loop that checks for all events?

I don't think so.  It's already the case that WaitLatch doesn't
guarantee that all possible flags are set in its result.  In connection
with Peter G's observation that we could simplify the API by rechecking
PostmasterIsAlive for WL_POSTMASTER_DEATH, I was planning to clarify
the API spec as "result bits that are set are guaranteed to reflect
reality, but it's not guaranteed that we set every bit that could
possibly be set".  This should not break any caller since the same
result could occur given a slight change in timing anyway; the caller
has to be prepared to come back and check for more conditions after it
services whatever WaitLatch does report.  However, signal service is
not a condition the caller is supposed to deal with, so I think we
want a guarantee that that happens inside WaitLatch.

>> see a way that that would explain the
>> pgstat failures, because the stats collector's latch really shouldn't
>> ever get set during normal regression test runs.

> So could there be something wrong in the other end, meaning the latch
> *does* get set?

Even if it did, it'd get cleared at the top of the loop, so that the
next call ought to handle things.  Tis a puzzlement.  AFAICS the only
condition WaitforMultipleObjects is going to see in these tests is
read-ready on the socket; surely it wouldn't fail to notice that?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

2012-05-10 Thread Magnus Hagander

On May 10, 2012 4:59 PM, "Tom Lane"  wrote:
>
> I wrote:
> > Last night I changed the stats collector process to use
> > WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> > the postmaster has died.  This morning I observe that several Windows
> > buildfarm members are showing regression test failures caused by
> > unexpected "pgstat wait timeout" warnings.  Everybody else is fine.
>
> > This suggests that there is something broken in the Windows
> > implementation of WaitLatchOrSocket.  I wonder whether it also
> > tells us something we did not know about the underlying cause of
> > those messages.  Not sure what though.  Ideas?  Can anyone who
> > knows Windows take another look at WaitLatchOrSocket?
>
> Anybody have any clues about that?  If not, I think I'll have to revert
> the pgstat changes for beta1, which isn't really forward progress.

Haven't had time to look at the code itself, and won't before wrap time.
Sorry.

> I spent some time staring at the Windows WaitLatchOrSocket code myself.
> The only thing I could find that seemed wrong is that in the event
> array, we list the latch's event before pgwin32_signal_event.  The
> Microsoft documentation I looked at says that if more than one event
> is ready, WaitforMultipleObjects reports the first such array member.
> This means that if the latch is already set when control gets here,
> signal handlers will not be serviced.

Yeah, that does seem wrong.

>  That doesn't match what would
> happen on a Unix machine, so it seems like at least a violation of the
> POLA.  Hence I think we oughta swap the order of those two array
> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
> pgwin32_select.)  I do not however

Maybe we need a loop that checks for all events?

> see a way that that would explain the
> pgstat failures, because the stats collector's latch really shouldn't
> ever get set during normal regression test runs.

So could there be something wrong in the other end, meaning the latch
*does* get set?

/Magnus

Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

2012-05-10 Thread Tom Lane

I wrote:
> Last night I changed the stats collector process to use
> WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> the postmaster has died.  This morning I observe that several Windows
> buildfarm members are showing regression test failures caused by
> unexpected "pgstat wait timeout" warnings.  Everybody else is fine.

> This suggests that there is something broken in the Windows
> implementation of WaitLatchOrSocket.  I wonder whether it also
> tells us something we did not know about the underlying cause of
> those messages.  Not sure what though.  Ideas?  Can anyone who
> knows Windows take another look at WaitLatchOrSocket?

Anybody have any clues about that?  If not, I think I'll have to revert
the pgstat changes for beta1, which isn't really forward progress.

I spent some time staring at the Windows WaitLatchOrSocket code myself.
The only thing I could find that seemed wrong is that in the event
array, we list the latch's event before pgwin32_signal_event.  The
Microsoft documentation I looked at says that if more than one event
is ready, WaitforMultipleObjects reports the first such array member.
This means that if the latch is already set when control gets here,
signal handlers will not be serviced.  That doesn't match what would
happen on a Unix machine, so it seems like at least a violation of the
POLA.  Hence I think we oughta swap the order of those two array
elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
pgwin32_select.)  I do not however see a way that that would explain the
pgstat failures, because the stats collector's latch really shouldn't
ever get set during normal regression test runs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] "pgstat wait timeout" just got a lot more common on Windows

2012-05-09 Thread Tom Lane

Last night I changed the stats collector process to use
WaitLatchOrSocket instead of a periodic forced wakeup to see whether
the postmaster has died.  This morning I observe that several Windows
buildfarm members are showing regression test failures caused by
unexpected "pgstat wait timeout" warnings.  Everybody else is fine.

This suggests that there is something broken in the Windows
implementation of WaitLatchOrSocket.  I wonder whether it also
tells us something we did not know about the underlying cause of
those messages.  Not sure what though.  Ideas?  Can anyone who
knows Windows take another look at WaitLatchOrSocket?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2012-01-31 Thread pratikchirania

Hi,

I Disabled autovacuuming and the warnings stopped logging.
After enabling Autovacuuming, the warnings again started logging.

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgstat-wait-timeout-tp5078125p5444033.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2012-01-23 Thread pratikchirania

Hi,

Any ideas on this?

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgstat-wait-timeout-tp5078125p5165651.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2012-01-04 Thread pratikchirania

Thanks, i missed that.

After doing these changes, following is the observation:

1. The size of the pgstat file is 86KB. Last edited was when i moved the
file location to RAMdisk.
2. The issue persists. I am seeing continuous logs:

2012-01-05 00:00:06 JST WARNING:  pgstat wait timeout
2012-01-05 00:00:14 JST WARNING:  pgstat wait timeout
2012-01-05 00:00:26 JST WARNING:  pgstat wait timeout
.
.
.
2012-01-05 15:36:25 JST WARNING:  pgstat wait timeout
2012-01-05 15:36:37 JST WARNING:  pgstat wait timeout
2012-01-05 15:36:45 JST WARNING:  pgstat wait timeout


--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgstat-wait-timeout-tp5078125p5121894.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2012-01-04 Thread Tomas Vondra

On 4 Leden 2012, 13:17, pratikchirania wrote:
> I have installed RAMdisk and pointed the parameter:
>
> #stats_temp_directory = 'B:\pg_stat_tmp'
> I also tried #stats_temp_directory = 'B:/pg_stat_tmp'
>
> But, still there is no file created in the RAM disk.
> The previous stat file is touched even after the change is made. (I have
> restarted the service after effecting the change)

You have to remove the '#' at the beginning, this way it's commented out.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2012-01-04 Thread pratikchirania

I have installed RAMdisk and pointed the parameter:

#stats_temp_directory = 'B:\pg_stat_tmp'
I also tried #stats_temp_directory = 'B:/pg_stat_tmp'

But, still there is no file created in the RAM disk.
The previous stat file is touched even after the change is made. (I have
restarted the service after effecting the change)

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgstat-wait-timeout-tp5078125p5119436.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-28 Thread Steve Crawford


On 12/28/2011 09:34 AM, Alvaro Herrera wrote:

Excerpts from Steve Crawford's message of mié dic 28 13:24:37 -0300 2011:

On 12/28/2011 05:05 AM, Alvaro Herrera wrote:

Excerpts from Steve Crawford's message of mar dic 27 22:51:06 -0300 2011:

I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is
currently in test/dev mode. I'm currently seeing the following messages
occurring every few seconds:

...
Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait
timeout
Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait
timeout

Hm, so can you strace the stats collector to see what it's doing?  Maybe
grab a backtrace with GDB from it before anything else.

My guess is 27324 is the autovac launcher and the others are autovac
workers just as they die.


You are correct. 27324 is the launcher and the others are autovac
workers. Here's the strace of the stats collector process:

getppid()   = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
getppid()   = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
getppid()   = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
rinse...lather...repeat...ad nauseum...

Weird ... even across more "pgstat wait timeout" messages?  It's like
it's not getting the "inquiry" messages that would tell it to write the
file ... something wrong with the UDP socket perhaps?


Bingo!

postgres  27325 postgres8u *IPv6*5379428   
0t0UDP localhost:47204->localhost:47204


In working on diagnosing a network timeout issue over an IPv4 to IPv4 
VPN I disabled IPv6 via sysctl on this machine and pretty much forgot 
about it since we are still IPv4 internally. But PostgreSQL had already 
established a (now non-functional) IPv6 local connection. Re-enabling 
IPv6, as it was not related to the VPN timeouts, corrected the "pgstat 
wait timeout" issue.


Cheers,
Steve


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-28 Thread Alvaro Herrera


Excerpts from Steve Crawford's message of mié dic 28 13:24:37 -0300 2011:
> On 12/28/2011 05:05 AM, Alvaro Herrera wrote:
> > Excerpts from Steve Crawford's message of mar dic 27 22:51:06 -0300 2011:
> >> I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is
> >> currently in test/dev mode. I'm currently seeing the following messages
> >> occurring every few seconds:
> >>
> >> ...
> >> Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
> >> Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait
> >> timeout
> >> Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
> >> Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait
> >> timeout
> > Hm, so can you strace the stats collector to see what it's doing?  Maybe
> > grab a backtrace with GDB from it before anything else.
> >
> > My guess is 27324 is the autovac launcher and the others are autovac
> > workers just as they die.
> >
> You are correct. 27324 is the launcher and the others are autovac 
> workers. Here's the strace of the stats collector process:
> 
> getppid()   = 27320
> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
> getppid()   = 27320
> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
> getppid()   = 27320
> poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
> rinse...lather...repeat...ad nauseum...

Weird ... even across more "pgstat wait timeout" messages?  It's like
it's not getting the "inquiry" messages that would tell it to write the
file ... something wrong with the UDP socket perhaps?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-28 Thread Steve Crawford


On 12/28/2011 05:05 AM, Alvaro Herrera wrote:

Excerpts from Steve Crawford's message of mar dic 27 22:51:06 -0300 2011:

I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is
currently in test/dev mode. I'm currently seeing the following messages
occurring every few seconds:

...
Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait
timeout
Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait
timeout

Hm, so can you strace the stats collector to see what it's doing?  Maybe
grab a backtrace with GDB from it before anything else.

My guess is 27324 is the autovac launcher and the others are autovac
workers just as they die.

You are correct. 27324 is the launcher and the others are autovac 
workers. Here's the strace of the stats collector process:


getppid()   = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
getppid()   = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
getppid()   = 27320
poll([{fd=8, events=POLLIN|POLLERR}], 1, 2000) = 0 (Timeout)
rinse...lather...repeat...ad nauseum...

And the backtrace:

#0  0x7ff4d2e80f58 in poll () from /lib/libc.so.6
#1  0x7ff4d4e6f465 in ?? ()
#2  0x7ff4d4e6fd83 in pgstat_start ()
#3  0x7ff4d4e73475 in ?? ()
#4 
#5  0x7ff4d2e85fd3 in select () from /lib/libc.so.6
#6  0x7ff4d4e71b93 in ?? ()
#7  0x7ff4d4e74b01 in PostmasterMain ()
#8  0x7ff4d4e193b3 in main ()

Cheers,
Steve


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-28 Thread Alvaro Herrera


Excerpts from Steve Crawford's message of mar dic 27 22:51:06 -0300 2011:
> I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is 
> currently in test/dev mode. I'm currently seeing the following messages 
> occurring every few seconds:
> 
> ...
> Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
> Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait 
> timeout
> Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
> Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait 
> timeout

Hm, so can you strace the stats collector to see what it's doing?  Maybe
grab a backtrace with GDB from it before anything else.

My guess is 27324 is the autovac launcher and the others are autovac
workers just as they die.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pgstat wait timeout

2011-12-27 Thread Steve Crawford

I have a system (9.0.4 on Ubuntu Server 10.04 LTS x86_64) that is 
currently in test/dev mode. I'm currently seeing the following messages 
occurring every few seconds:


...
Dec 27 17:43:22 foo postgres[23693]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:43:27 foo postgres[27324]: [71400-1] : WARNING:  pgstat wait 
timeout

Dec 27 17:43:33 foo postgres[23695]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:43:54 foo postgres[27324]: [71401-1] : WARNING:  pgstat wait 
timeout

Dec 27 17:43:59 foo postgres[23697]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:44:04 foo postgres[27324]: [71402-1] : WARNING:  pgstat wait 
timeout

Dec 27 17:44:09 foo postgres[23715]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:44:17 foo postgres[27324]: [71403-1] : WARNING:  pgstat wait 
timeout

Dec 27 17:44:22 foo postgres[23716]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:44:27 foo postgres[27324]: [71404-1] : WARNING:  pgstat wait 
timeout

Dec 27 17:44:33 foo postgres[23718]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:44:54 foo postgres[27324]: [71405-1] : WARNING:  pgstat wait 
timeout

Dec 27 17:44:59 foo postgres[23824]: [6-1] : WARNING:  pgstat wait timeout
Dec 27 17:45:04 foo postgres[27324]: [71406-1] : WARNING:  pgstat wait 
timeout


I can't correlate events exactly, but the messages seem to have started 
shortly after I dropped a pgbench user and database. My Googling turned 
up various requests for debugging info on "hackers". Since the system 
isn't live, I haven't touched it in case anyone wants me to collect 
debugging info.


Otherwise, I plan on just blowing the install away and replacing it with 9.1

Cheers,
Steve


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-20 Thread Andrew Dunstan




On 12/20/2011 05:13 AM, pratikchirania wrote:

Would this be alleviated by setting stats_temp_dir to point to a ramdisk?

I am not aware how to do this. I am using a windows server OS.
The conf file has the entry : #stats_temp_directory = 'pg_stat_tmp'

What do I change it to? Please elucidate.



On Windows it appears you need third party software for a ramdisk. 
Search google for info.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-20 Thread pratikchirania

>Would this be alleviated by setting stats_temp_dir to point to a ramdisk?

I am not aware how to do this. I am using a windows server OS.
The conf file has the entry : #stats_temp_directory = 'pg_stat_tmp'

What do I change it to? Please elucidate.

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgstat-wait-timeout-tp5078125p5088497.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-19 Thread Andrew Dunstan




On 12/19/2011 11:45 AM, Robert Haas wrote:

On Mon, Dec 19, 2011 at 10:02 AM, pratikchirania  wrote:

Version upgrade: hmm.. Is there any fix/change related to this issue in
9.0.6?

You could read the release notes for those minor version upgrades.

Based on a quick look through the commit logs, and a quick grep of
release-9-0.sgml, I don't think so.




Would this be alleviated by setting stats_temp_dir to point to a ramdisk?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-19 Thread Robert Haas

On Mon, Dec 19, 2011 at 10:02 AM, pratikchirania  wrote:
> Version upgrade: hmm.. Is there any fix/change related to this issue in
> 9.0.6?

You could read the release notes for those minor version upgrades.

Based on a quick look through the commit logs, and a quick grep of
release-9-0.sgml, I don't think so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-19 Thread pratikchirania

OS: I am using Windows server 2003

Version upgrade: hmm.. Is there any fix/change related to this issue in
9.0.6?
If yes, I will upgrade in next scheduled downtime (I am using this as
production server)...

postgres queries are very occasionly used (a set of calls once in 30
minutes).. so I guess I am not calling my DB component heavily.

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgstat-wait-timeout-tp5078125p5086379.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-15 Thread Tomas Vondra

On 15 Prosinec 2011, 19:42, pratikchirania wrote:
> Size of pgstat.stat file: 86KB

That's pretty small.

> I did not understand the second part. Where do I get "iostat -x 1"
> message?
> (Its not present in any file in the pg_log folder)

iostat is not part of PostgreSQL, it's a tool used to display various I/O
metrics in Linux (and Unix in general). What OS are you using?

It seems the I/O subsystem is so busy it can't write the pgstat.stat on
time, so a warning is printed. You need to find out why the I/O is so
overutilized.

> I am using postgres 9.0.1

That's way too old. Upgrade to 9.0.6.

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-15 Thread pratikchirania

Size of pgstat.stat file: 86KB

I did not understand the second part. Where do I get "iostat -x 1" message?
(Its not present in any file in the pg_log folder)


I am using postgres 9.0.1

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgstat-wait-timeout-tp5078125p5078391.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-15 Thread Tomas Vondra

On 15 Prosinec 2011, 18:19, Magnus Hagander wrote:
> On Thu, Dec 15, 2011 at 18:13, Tomas Vondra  wrote:
>> On 15 Prosinec 2011, 17:55, pratikchirania wrote:
>>> Hi,
>>>
>>> I am having a scenario where I get consistent warnings in the pglog
>>> folder:
>>>
>>> 2011-12-11 00:00:03 JST WARNING:  pgstat wait timeout
>>> 2011-12-11 00:00:14 JST WARNING:  pgstat wait timeout
>>> 2011-12-11 00:00:24 JST WARNING:  pgstat wait timeout
>>> 2011-12-11 00:00:31 JST WARNING:  pgstat wait timeout
>>> 2011-12-11 00:00:44 JST WARNING:  pgstat wait timeout
>>> 2011-12-11 00:00:52 JST WARNING:  pgstat wait timeout
>>> 2011-12-11 00:01:03 JST WARNING:  pgstat wait timeout
>>> 2011-12-11 00:01:11 JST WARNING:  pgstat wait timeout
>>>
>>> This is impacting database performance.
>>
>> It's rather a sign that the I/O is overloaded, although in some cases it
>> may actually be the cause.
>>
>>> The issue persists even when I use the database minimally.
>>
>> Yes, because the file is written periodically - twice a second IIRC. If
>> the file is large, this may be an issue. What is the pgstat.stat size
>> (should be in data/global).
>
> That was only true prior to 8.4. As of 8.4 it's only written when
> necessary, which is usually a lot less than twice / second.

Thanks for the correction. Nevertheless, it would be useful to know what
is the size of the file and what is the I/O usage.

Pratik, can you post the size of the pgstat.stat file and post a few lines
of "iostat -x 1" collected when the "pgstat wait timeout" happens?

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-15 Thread Magnus Hagander

On Thu, Dec 15, 2011 at 18:13, Tomas Vondra  wrote:
> On 15 Prosinec 2011, 17:55, pratikchirania wrote:
>> Hi,
>>
>> I am having a scenario where I get consistent warnings in the pglog
>> folder:
>>
>> 2011-12-11 00:00:03 JST WARNING:  pgstat wait timeout
>> 2011-12-11 00:00:14 JST WARNING:  pgstat wait timeout
>> 2011-12-11 00:00:24 JST WARNING:  pgstat wait timeout
>> 2011-12-11 00:00:31 JST WARNING:  pgstat wait timeout
>> 2011-12-11 00:00:44 JST WARNING:  pgstat wait timeout
>> 2011-12-11 00:00:52 JST WARNING:  pgstat wait timeout
>> 2011-12-11 00:01:03 JST WARNING:  pgstat wait timeout
>> 2011-12-11 00:01:11 JST WARNING:  pgstat wait timeout
>>
>> This is impacting database performance.
>
> It's rather a sign that the I/O is overloaded, although in some cases it
> may actually be the cause.
>
>> The issue persists even when I use the database minimally.
>
> Yes, because the file is written periodically - twice a second IIRC. If
> the file is large, this may be an issue. What is the pgstat.stat size
> (should be in data/global).

That was only true prior to 8.4. As of 8.4 it's only written when
necessary, which is usually a lot less than twice / second.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat wait timeout

2011-12-15 Thread Tomas Vondra

On 15 Prosinec 2011, 17:55, pratikchirania wrote:
> Hi,
>
> I am having a scenario where I get consistent warnings in the pglog
> folder:
>
> 2011-12-11 00:00:03 JST WARNING:  pgstat wait timeout
> 2011-12-11 00:00:14 JST WARNING:  pgstat wait timeout
> 2011-12-11 00:00:24 JST WARNING:  pgstat wait timeout
> 2011-12-11 00:00:31 JST WARNING:  pgstat wait timeout
> 2011-12-11 00:00:44 JST WARNING:  pgstat wait timeout
> 2011-12-11 00:00:52 JST WARNING:  pgstat wait timeout
> 2011-12-11 00:01:03 JST WARNING:  pgstat wait timeout
> 2011-12-11 00:01:11 JST WARNING:  pgstat wait timeout
>
> This is impacting database performance.

It's rather a sign that the I/O is overloaded, although in some cases it
may actually be the cause.

> The issue persists even when I use the database minimally.

Yes, because the file is written periodically - twice a second IIRC. If
the file is large, this may be an issue. What is the pgstat.stat size
(should be in data/global).

> I have tried fine-tuning the Auto-vacuum parameters:

Autovacuum has nothing to do with this.

> Any Ideas?

Move the file to a RAM drive - there's even a config parameter
'stats_temp_directory' to do that. See
http://www.postgresql.org/docs/9.1/static/runtime-config-statistics.html

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pgstat wait timeout

2011-12-15 Thread pratikchirania

Hi,

I am having a scenario where I get consistent warnings in the pglog folder:

2011-12-11 00:00:03 JST WARNING:  pgstat wait timeout
2011-12-11 00:00:14 JST WARNING:  pgstat wait timeout
2011-12-11 00:00:24 JST WARNING:  pgstat wait timeout
2011-12-11 00:00:31 JST WARNING:  pgstat wait timeout
2011-12-11 00:00:44 JST WARNING:  pgstat wait timeout
2011-12-11 00:00:52 JST WARNING:  pgstat wait timeout
2011-12-11 00:01:03 JST WARNING:  pgstat wait timeout
2011-12-11 00:01:11 JST WARNING:  pgstat wait timeout
.
.
.

This is impacting database performance.

The issue persists even when I use the database minimally.

I have tried fine-tuning the Auto-vacuum parameters:

Change the parameter autovacuum_vacuum_cost_delay to 40ms ::: Issue is
reproduced 4 hours after this change

Change the parameter autovacuum_max_workers to 20 ::: I got the warning
message 2 times at times 17:20:12 and 17:20:20. After this, no warnings for
5 hours. Then I tried:

Change the parameter autovacuum_vacuum_cost_delay to: 60ms, Change the
parameter autovacuum_max_workers to: 10::: I got the warning message 2 times
at times 17:20:12 and 17:20:20. After this, no warnings for 5 hours

Any Ideas?

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgstat-wait-timeout-tp5078125p5078125.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" warnings

2011-08-11 Thread Robert Haas

On Thu, Aug 11, 2011 at 10:30 AM, Tom Lane  wrote:
> Andres Freund  writes:
>>> --On 10. August 2011 21:54:06 +0300 Heikki Linnakangas
>>>  wrote:
 So my theory is that if the I/O is really busy, write() on the stats file
 blocks for more than 5 seconds, and you get the timeout.
>
>> Yes, I have seen it several times as well. I can actually reproduce it
>> without much problems, so if you have some idea to test...
>
> It doesn't surprise me that it's possible to reproduce it under extreme
> I/O load.  What I am wondering about is whether there's some bug/effect
> that allows it to happen without that.

I got it several times during a pgbench -i -s 5000 run this morning.
I guess that's a lot of I/O, but I'm not sure I'd refer to one process
filling a table with data as "extreme".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" warnings

2011-08-11 Thread Tom Lane

Andres Freund  writes:
>> --On 10. August 2011 21:54:06 +0300 Heikki Linnakangas
>>  wrote:
>>> So my theory is that if the I/O is really busy, write() on the stats file
>>> blocks for more than 5 seconds, and you get the timeout.

> Yes, I have seen it several times as well. I can actually reproduce it
> without much problems, so if you have some idea to test...

It doesn't surprise me that it's possible to reproduce it under extreme
I/O load.  What I am wondering about is whether there's some bug/effect
that allows it to happen without that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" warnings

2011-08-11 Thread Andres Freund

On Thursday, August 11, 2011 11:49:12 Bernd Helmle wrote:
> --On 10. August 2011 21:54:06 +0300 Heikki Linnakangas
> 
>  wrote:
> > So my theory is that if the I/O is really busy, write() on the stats
> > file
> > blocks for more than 5 seconds, and you get the timeout.
> 
> I've seen it on customer instances with very high INSERT peak loads (several
> dozens backends INSERTing/UPDATEing data concurrently). We are using a RAM
> disk for stats_temp_directory now for a while, and the timeout never
> occured again.
Yes, I have seen it several times as well. I can actually reproduce it without 
much problems, so if you have some idea to test...

I also routinely use stats_temp_directory + tmpfs to solve this (and related 
issues). I really think the stats file mechanism should be improved 
fundamentally.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" warnings

2011-08-11 Thread Bernd Helmle




--On 10. August 2011 21:54:06 +0300 Heikki Linnakangas 
 wrote:



So my theory is that if the I/O is really busy, write() on the stats file
blocks for more than 5 seconds, and you get the timeout.


I've seen it on customer instances with very high INSERT peak loads (several 
dozens backends INSERTing/UPDATEing data concurrently). We are using a RAM disk 
for stats_temp_directory now for a while, and the timeout never occured again.


--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "pgstat wait timeout" warnings

2011-08-10 Thread Heikki Linnakangas


On 10.08.2011 21:45, Tom Lane wrote:

We occasionally see $SUBJECT in the buildfarm, and I've also recently
had reports of them from Red Hat customers.  The obvious theory is that
these reflect high load preventing the stats collector from responding,
but it would really take pretty crushing load to make that happen if
there were not anything funny going on.

It struck me just now while reviewing the latch code that pg_usleep
could sleep for less than the expected time if a signal happened, and
if that happened repeatedly for some reason, perhaps the loop could
complete in much less than the nominal time.  I'm not sure I believe
that idea either, but anyway I'm feeling motivated to try to gather more
data.


I've also seen this on my laptop occasionally. The most recent case I 
remember was when I COPYed a lot of data, so that the harddisk was 
really busy. The system was a bit unresponsive anyway, because of all 
the I/O happening.


So my theory is that if the I/O is really busy, write() on the stats 
file blocks for more than 5 seconds, and you get the timeout.



Does anyone have a problem with sticking a lot of debugging printout
into backend_read_statsfile() in HEAD only?  I'm envisioning it starting
to dump assorted information including elapsed time, errno values, etc
once the loop counter is more than halfway to expiration, which is
already a situation that we shouldn't see under normal conditions.


No objections here.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] "pgstat wait timeout" warnings

2011-08-10 Thread Tom Lane

We occasionally see $SUBJECT in the buildfarm, and I've also recently
had reports of them from Red Hat customers.  The obvious theory is that
these reflect high load preventing the stats collector from responding,
but it would really take pretty crushing load to make that happen if
there were not anything funny going on.

It struck me just now while reviewing the latch code that pg_usleep
could sleep for less than the expected time if a signal happened, and
if that happened repeatedly for some reason, perhaps the loop could
complete in much less than the nominal time.  I'm not sure I believe
that idea either, but anyway I'm feeling motivated to try to gather more
data.

Does anyone have a problem with sticking a lot of debugging printout
into backend_read_statsfile() in HEAD only?  I'm envisioning it starting
to dump assorted information including elapsed time, errno values, etc
once the loop counter is more than halfway to expiration, which is
already a situation that we shouldn't see under normal conditions.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

40 matches

Mail list logo