Re: [HACKERS] Move unused buffers to freelist

2013-05-23 Thread Amit Kapila
On Friday, May 24, 2013 2:47 AM Jim Nasby wrote:
> On 5/14/13 2:13 PM, Greg Smith wrote:
> > It is possible that we are told to put something in the freelist that
> > is already in it; don't screw up the list if so.
> >
> > I don't see where the code does anything to handle that though.  What
> was your intention here?
> 
> IIRC, the code that pulls from the freelist already deals with the
> possibility that a block was on the freelist but has since been put to
> use. 

You are right, the check exists in StrategyGetBuffer()

>If that's the case then there shouldn't be much penalty to adding
> a block multiple times (at least within reason...)

There is a check in StrategyFreeBuffer() which will not allow to put
multiple times, 
I had just used the same check in new function.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-23 Thread Amit Kapila
On Thursday, May 23, 2013 8:45 PM Robert Haas wrote:
> On Tue, May 21, 2013 at 3:06 AM, Amit Kapila 
> wrote:
> >> Here are the results.  The first field in each line is the number of
> >> clients. The second number is the scale factor.  The numbers after
> >> "master" and "patched" are the median of three runs.
> 
> >>but overall, on both the read-only and
> >> read-write tests, I'm not seeing anything that resembles the big
> gains
> >> you reported.
> >
> > I have not generated numbers for read-write tests, I will check that
> once.
> > For read-only tests, the performance increase is minor and different
> from
> > what I saw.
> > Few points which I could think of for difference in data:
> >
> > 1. In my test's I always observed best data when number of
> clients/threads
> > are equal to number of cores which in your case should be at 16.
> 
> Sure, but you also showed substantial performance increases across a
> variety of connection counts, whereas I'm seeing basically no change
> at any connection count.
> > 2. I think for scale factor 100 and 300, there should not be much
> > performance increase, as for them they should mostly get buffer from
> > freelist inspite of even bgwriter adds to freelist or not.
> 
> I agree.
> 
> > 3. In my tests variance is for shared buffers, database size is
> always less
> > than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB),
> but due
> > to variance in shared buffers, it can lead to I/O.
> 
> Not sure I understand this.

What I wanted to say is that all your tests was on same shared buffer
configuration 8GB, where as in my tests I was trying to vary shared buffers
as well.
However this is not important point, as it should show performance gain on
configuration you ran, if there is any real benefit of this patch.
 
> > 4. Each run is of 20 minutes, not sure if this has any difference.
> 
> I've found that 5-minute tests are normally adequate to identify
> performance changes on the pgbench SELECT-only workload.
> 
> >> Tests were run on a 16-core, 64-hwthread PPC64 machine provided to
> the
> >> PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel
> 3.2.6.
> >
> > To think about the difference in your and my runs, could you please
> tell me
> > about below points
> > 1. What is RAM in machine.
> 
> 64GB
> 
> > 2. Are number of threads equal to number of clients.
> 
> Yes.
> 
> > 3. Before starting tests I have always done pre-warming of buffers
> (used
> > pg_prewarm written by you last year), is it same for above read-only
> tests.
> 
> No, I did not use pg_prewarm.  But I don't think that should matter
> very much.  First, the data was all in the OS cache.  Second, on the
> small scale factors, everything should end up in cache pretty quickly
> anyway.  And on the large scale factors, well, you're going to be
> churning shared_buffers anyway, so pg_prewarm is only going to affect
> the very beginning of the test.
> 
> > 4. Can you please once again run only the test where you saw
> variation(8
> > clients @ scale> factor 1000 on master), because I have also seen
> that
> > performance difference is very good for certain
> >configurations(Scale Factor, RAM, Shared Buffers)
> 
> I can do this if I get a chance, but I don't really see where that's
> going to get us.  It seems pretty clear to me that there's no benefit
> on these tests from this patch.  So either one of us is doing the
> benchmarking incorrectly, or there's some difference in our test
> environments that is significant, but none of the proposals you've
> made so far seem to me to explain the difference.

Sorry for requesting you to run again without any concrete point.
I realized after reading data you posted more carefully that the reading was
just some m/c problem or something else, but actually there is no gain.
After your post, I had tried with various configurations on different m/c,
but till now I am not able see the performance gain as was shown in my
initial mail.
Infact I had tried on same m/c as well, it some times give good data. I will
update you if I get any concrete reason and results.

> > Apart from above, I had one more observation during my investigation
> to find
> > why in some cases, there is a small dip:
> > 1. Many times, it finds the buffer in free list is not usable, means
> it's
> > refcount or usage count is not zero, due to which it had to spend
> more time
> > under BufFreelistLock.
> >I had not any further experiments related to this finding like if
> it
> > really adds any overhead.
> >
> > Currently I am trying to find reasons for small dip of performance
> and see
> > if I could do something to avoid it. Also I will run tests with
> various
> > configurations.
> >
> > Any other suggestions?
> 
> Well, I think that the code in SyncOneBuffer is not really optimal.
> In some cases you actually lock and unlock the buffer header an extra
> time, which seems like a whole lotta extra overhead.  In fact, I don't
> think you should be modifying S

Re: [HACKERS] PostgreSQL 9.3 beta breaks some extensions "make install"

2013-05-23 Thread Christoph Berg
Re: Cédric Villemain 2013-05-17 <201305171642.59241.ced...@2ndquadrant.com>
> If it seems to be on the right way, I'll keep fixing EXTENSION building with 
> VPATH.

I haven't tried the patch, but let me just say that Debian (and
apt.postgresql.org) would very much like the VPATH situation getting
improved. At the moment we seem to have to invent a new build recipe
for every extension around.

Thanks for working on this.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-23 Thread Hannu Krosing
On 05/23/2013 10:03 PM, Andres Freund wrote:
> On 2013-05-23 19:51:48 +0200, Andres Freund wrote:
>> We currently need to make sure we scanned the whole relation and have
>> frozen everything to have a sensible relfrozenxid for a relation.
>>
>> So, what I propose instead is basically:
>> 1) only vacuum non-all-visible pages, even when doing it for
>>anti-wraparound
>> 2) When we can set all-visible guarantee that all tuples on the page are
>>fully hinted. During recovery do the same, so we don't need to log
>>all hint bits.
>>We can do this with only an exclusive lock on the buffer, we don't
>>need a cleanup lock.
>> 3) When we cannot mark a page all-visible or we cannot get the cleanup
>>lock, remember the oldest xmin on that page. We could set all visible
>>in the former case, but we want the page to be cleaned up sometime
>>soonish.
>> 4) If we can get the cleanup lock, purge dead tuples from the page and
>>the indexes, just as today. Set the page as all-visible.
>>
>> That way we know that any page that is all-visible doesn't ever need to
>> look at xmin/xmax since we are sure to have set all relevant hint
>> bits.
> Heikki noticed that I made quite the omission here which is that you
> would need to mark tuples as all visible as well. I was thinking about
> using HEAP_MOVED_OFF | HEAP_MOVED_IN as a hint for that.
We could have a "vacuum_less=true" mode, where instead of marking tuples
all visible
here you actually freeze them, that is set the xid to frozen. You will
get less forensic
capability in exchange of less vacuuming.

Maybe also add an "early_freeze" hint bit to mark this situation.

Or maybe set the tuples frozenxid when un-marking the page as all
visible to delay
the effects a little ?

Hannu
>
> Greetings,
>
> Andres Freund
>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to .gitignore

2013-05-23 Thread Fabrízio de Royes Mello
On Fri, May 24, 2013 at 12:04 AM, Christopher Browne wrote:

> There hasn't been general agreement on the merits of particular .gitignore
> rules of this sort.
>
> You could hide your own favorite patterns by putting this into your
> ~/.gitignore that isn't part of the repo, configuring this globally, thus:
> git config --global core.excludesfile '~/.gitignore'
>
>
Yes... I know that...


> That has the consequence that you can hide whatever things your own tools
> like to create, and not worry about others' preferences.
>
> Us Emacs users can put things like *~, #*#, and such into our own "ignore"
> configuration; that doesn't need to bother you, and vice-versa for your
> vim-oriented patterns.
>

I agree with you about vim-oriented patterns, because its a particular
tool, but "ctags" and "etags" be part of postgres source tree and its
generate some output inside them, so I think we must ignore it.

IMHO all output generated by tools inside the source tree that will not be
committed must be added to .gitignore

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello


Re: [HACKERS] Patch to .gitignore

2013-05-23 Thread Christopher Browne
There hasn't been general agreement on the merits of particular .gitignore
rules of this sort.

You could hide your own favorite patterns by putting this into your
~/.gitignore that isn't part of the repo, configuring this globally, thus:
git config --global core.excludesfile '~/.gitignore'

That has the consequence that you can hide whatever things your own tools
like to create, and not worry about others' preferences.

Us Emacs users can put things like *~, #*#, and such into our own "ignore"
configuration; that doesn't need to bother you, and vice-versa for your
vim-oriented patterns.


[HACKERS] Patch to .gitignore

2013-05-23 Thread Fabrízio de Royes Mello
Hi all,

The proposed patch add some files to ignore in .gitignore:
- tags (produced by src/tools/make_ctags)
- TAGS (produced bu src/tools/make_etags)
- .*.swp (may appear in source tree if vi/vim was killed by some reason)

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello


gitignore_ctags_etags_vi.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-23 Thread Robert Haas
On Thu, May 23, 2013 at 1:51 PM, Andres Freund  wrote:
> So, what I propose instead is basically:
> 1) only vacuum non-all-visible pages, even when doing it for
>anti-wraparound

Check.  We might want an option to force a scan of the whole relation.

> 2) When we can set all-visible guarantee that all tuples on the page are
>fully hinted. During recovery do the same, so we don't need to log
>all hint bits.
>We can do this with only an exclusive lock on the buffer, we don't
>need a cleanup lock.

I don't think this works.  Emitting XLOG_HEAP_VISIBLE for a heap page
does not emit an FPI for the heap page, only (if needed) for the
visibility map page.  So a subsequent crash that tears the page could
keep XLOG_HEAP_VISIBLE but lose other changes on the page - i.e. the
hint bits.

> 3) When we cannot mark a page all-visible or we cannot get the cleanup
>lock, remember the oldest xmin on that page. We could set all visible
>in the former case, but we want the page to be cleaned up sometime
>soonish.

I think you mean "in the latter case" not "in the former case".  If
not, then I'm confused.

> 4) If we can get the cleanup lock, purge dead tuples from the page and
>the indexes, just as today. Set the page as all-visible.
>
> That way we know that any page that is all-visible doesn't ever need to
> look at xmin/xmax since we are sure to have set all relevant hint
> bits.
>
> We don't even necessarily need to log the hint bits for all items since
> the redo for all_visible could make sure all items are hinted. The only
> problem is knowing up to where we can truncate pg_clog...

The redo for all_visible cannot make sure all items are hinted.
Again, there's no FPI on the heap page.  The heap page could in fact
contain dead tuples at the time we mark it all-visible.  Consider, for
example:

0. Checkpoint.
1. The buffer becomes all visible.
2. A tuple is inserted, making the buffer not-all-visible.
3. The page is written by the OS.
4. Crash.

Now, recovery will first find the record marking the buffer
all-visible, and will mark it all-visible.  Now the all-visible bit on
the page is flat-out wrong, but it doesn't matter because we haven't
reached consistency.  Next we'll find the heap-insert record, which
will have an FPI, since it's the first WAL-logged change to the buffer
since the last checkpoint.  Now the FPI fixes everything and we're
back in a sane state.

Now in this particular case it wouldn't hurt anything if the redo
routine that set the all-visible bit also hinted all the tuples,
because the FPI is going to overwrite it anyway.  But suppose in lieu
of steps (3) and (4) we write half of the page and then crash, leaving
behind a torn page.  Now it's pretty crazy to think about trying to
hint tuples; the page may be in a completely insane state.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block write statistics WIP

2013-05-23 Thread Heikki Linnakangas

On 23.05.2013 19:10, Greg Smith wrote:

On 5/20/13 7:51 AM, Heikki Linnakangas wrote:

The way that MarkDirty requires this specific underlying storage
manager behavior to work properly strikes me as as a bit of a
layering violation too. I'd like the read and write paths to have
a similar API, but here they don't even operate on the same type
of inputs. Addressing that is probably harder than just throwing
a hack on the existing code though.


To be honest, I don't understand what you mean by that. ?


Let's say you were designing a storage layer API from scratch for
what Postgres does. That might take a relation as its input, like
ReadBuffer does. Hiding the details of how that turns into a physical
file operation would be a useful goal of such a layer. I'd then
consider it a problem if that exposed things like the actual mapping
of relations into files to callers.


Ok, got it.


What we actually have right now is this MarkDirty function that
operates on BufferTag data, which points directly to the underlying
file. I see that as cutting the storage API in half and calling a
function in the middle of the implementation.


Well, no, the BufferTag struct is internal to the buffer manager 
implementation. It's not part of the API; it's an implementation detail 
of the buffer manager.



It strikes me as kind of weird that the read side and write side are
not more symmetrical.


It might seem weird if you expect the API to be similar to POSIX read() 
and write(), where you can read() an arbitrary block at any location, 
and write() an arbitrary block at any location. A better comparison 
would be e.g open() and close(). open() returns a file descriptor, which 
you pass to other functions to operate on the file. When you're done, 
you call close(fd). The file descriptor is completely opaque to the user 
program, you do all the operations through the functions that take the 
fd as argument. Similarly, ReadBuffer() returns a Buffer, which is 
completely opaque to the caller, and you do all the operations through 
various functions and macros that operate on the Buffer. When you're 
done, you release the buffer with ReleaseBuffer().


(sorry for the digression from the original topic, I don't have any 
problem with what adding an optional Relation argument to MarkBuffer if 
you need that :-) )


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-23 Thread Greg Smith

On 5/23/13 7:56 PM, Claudio Freire wrote:

Besides of the obvious option of making a lighter check (doesn't have
to be 100% precise), wouldn't this check be done when it would
otherwise sleep? Is it so heavy still in that context?


A commit to typical 7200RPM disk is about 10ms, while 
autovacuum_vacuum_cost_delay is 20ms.  If the statement cost limit logic 
were no more complicated than commit_delay, it would be feasible to do 
something similar each time a statement was being put to sleep.


I suspect that the cheapest useful thing will be more expensive than 
commit_delay's test.  That's a guess though.  I'll have to think about 
this more when I circle back toward usability.  Thanks for the 
implementation idea.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-23 Thread Claudio Freire
On Thu, May 23, 2013 at 8:46 PM, Greg Smith  wrote:
> On 5/23/13 7:34 PM, Claudio Freire wrote:
>>
>> Why not make the delay conditional on the amount of concurrency, kinda
>> like the commit_delay? Although in this case, it should only count
>> unwaiting connections.
>
>
> The test run by commit_delay is way too heavy to run after every block is
> processed.  That code is only hit when there's a commit, which already
> assumes a lot of overhead is going on--the disk flush to WAL--so burning
> some processing/lock acquisition time isn't a big deal.  The spot where
> statement delay is going is so performance sensitive that everything it
> touches needs to be local to the backend.

Besides of the obvious option of making a lighter check (doesn't have
to be 100% precise), wouldn't this check be done when it would
otherwise sleep? Is it so heavy still in that context?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-23 Thread Greg Smith

On 5/23/13 7:34 PM, Claudio Freire wrote:

Why not make the delay conditional on the amount of concurrency, kinda
like the commit_delay? Although in this case, it should only count
unwaiting connections.


The test run by commit_delay is way too heavy to run after every block 
is processed.  That code is only hit when there's a commit, which 
already assumes a lot of overhead is going on--the disk flush to WAL--so 
burning some processing/lock acquisition time isn't a big deal.  The 
spot where statement delay is going is so performance sensitive that 
everything it touches needs to be local to the backend.


For finding cost delayed statements that are causing trouble because 
they are holding locks, the only place I've thought of that runs 
infrequently and is poking at the right data is the deadlock detector. 
Turning that into a more general mechanism for finding priority 
inversion issues is an interesting idea.  It's a bit down the road from 
what I'm staring at now though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-23 Thread Claudio Freire
On Thu, May 23, 2013 at 8:27 PM, Greg Smith  wrote:
> The main unintended consequences issue I've found so far is when a cost
> delayed statement holds a heavy lock.  Autovacuum has some protection
> against letting processes with an exclusive lock on a table go to sleep.  It
> won't be easy to do that with arbitrary statements.  There's a certain
> amount of allowing the user to shoot themselves in the foot here that will
> be time consuming (if not impossible) to eliminate.  The person who runs an
> exclusive CLUSTER that's limited by statement_cost_delay may suffer from
> holding the lock too long.  But that might be their intention with setting
> the value.  Hard to idiot proof this without eliminating useful options too.


Why not make the delay conditional on the amount of concurrency, kinda
like the commit_delay? Although in this case, it should only count
unwaiting connections.

That way, if there's a "delay deadlock", the delay gets out of the way.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cost limited statements RFC

2013-05-23 Thread Greg Smith
I'm working on a new project here that I wanted to announce, just to 
keep from duplicating effort in this area.  I've started to add a cost 
limit delay for regular statements.  The idea is that you set a new 
statement_cost_delay setting before running something, and it will 
restrict total resources the same way autovacuum does.  I'll be happy 
with it when it's good enough to throttle I/O on SELECT and CREATE INDEX 
CONCURRENTLY.


Modifying the buffer manager to account for statement-based cost 
accumulation isn't difficult.  The tricky part here is finding the right 
spot to put the delay at.  In the vacuum case, it's easy to insert a 
call to check for a delay after every block of I/O.  It should be 
possible to find a single or small number of spots to put a delay check 
in the executor.  But I expect that every utility command may need to be 
modified individually to find a useful delay point.  This is starting to 
remind me of the SEPostgres refactoring, because all of the per-command 
uniqueness ends up requiring a lot of work to modify in a unified way.


The main unintended consequences issue I've found so far is when a cost 
delayed statement holds a heavy lock.  Autovacuum has some protection 
against letting processes with an exclusive lock on a table go to sleep. 
 It won't be easy to do that with arbitrary statements.  There's a 
certain amount of allowing the user to shoot themselves in the foot here 
that will be time consuming (if not impossible) to eliminate.  The 
person who runs an exclusive CLUSTER that's limited by 
statement_cost_delay may suffer from holding the lock too long.  But 
that might be their intention with setting the value.  Hard to idiot 
proof this without eliminating useful options too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Add tests for LOCK TABLE

2013-05-23 Thread Robins Tharakan
Hi,

Please find attached a patch to take code-coverage of LOCK TABLE (
src/backend/commands/lockcmds.c) from 57% to 84%.

Any and all feedback is welcome.
--
Robins Tharakan


regress_lock.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block write statistics WIP

2013-05-23 Thread Greg Smith

On 5/20/13 7:51 AM, Heikki Linnakangas wrote:

The way that MarkDirty requires this specific underlying storage manager
behavior to work properly strikes me as as a bit of a layering violation
too. I'd like the read and write paths to have a similar API, but here
they don't even operate on the same type of inputs. Addressing that is
probably harder than just throwing a hack on the existing code though.


To be honest, I don't understand what you mean by that. ?


Let's say you were designing a storage layer API from scratch for what 
Postgres does.  That might take a relation as its input, like ReadBuffer 
does.  Hiding the details of how that turns into a physical file 
operation would be a useful goal of such a layer.  I'd then consider it 
a problem if that exposed things like the actual mapping of relations 
into files to callers.


What we actually have right now is this MarkDirty function that operates 
on BufferTag data, which points directly to the underlying file.  I see 
that as cutting the storage API in half and calling a function in the 
middle of the implementation.  It strikes me as kind of weird that the 
read side and write side are not more symmetrical.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WARNING : pgstat wait timeout - Postgres 9.1

2013-05-23 Thread Michael Paquier
On Thu, May 23, 2013 at 9:31 PM, Mathieu Guerin
wrote:

> What are the consequences ? Because this file will be remove if the server
> reboot.
>
Those temporary statistics are stored in global directory when server shuts
down, so the risk here would be to lose a portion of this data in the case
of a crash, either at PG or at OS level.


> If we change the parameter stats_temp_directory is it necessary to reboot
> the server ?
>
No, sending SIGHUP to the server is enough.


> When I lauch a SHOW ALL; command, the parameter stats_temp_director is
> not here.
>
You should get it...
stats_temp_directory|
pg_stat_tmp   | Writes temporary
statistics files to the specified directory.
-- 
Michael


[HACKERS] Add more regression tests for ALTER OPERATOR FAMILY.. ADD / DROP

2013-05-23 Thread Robins
Hi,

Please find attached a patch to take code-coverage of ALTER OPERATOR
FAMILY.. ADD / DROP (src/backend/commands/opclasscmds.c) from 50% to 87%.

Any and all feedback is welcome.
--
Robins Tharakan


regress_opclass.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-23 Thread Sergey Konoplev
On Thu, May 23, 2013 at 6:18 AM, German Becker  wrote:
> Let me describe the process I follow to get to this. What I am doing is
> testing a migration from 8.3 to 9.1. They way I plan to do it is the
> following.
> 1) Create the schema
> 2) import the biggest tables, which are not updated,only growing, with COPY
> (this is about 35gb of data)
> 2)import the small, changing part of the data
>
>
> The target system is 9.1 with streaming relication.
> For steps 1 and 2, I set a "restore" configuration, that amongs other things
> like more work mem, it sets archive_mode=off and wal_level=minimal (attached
> the difference between restore and normal).
> The archive_command is just a cp wrapped in a shell script in case I need to
> change it.

You can not migrate between any major versions with WAL based or
streaming replication.

Use either full dump/restore or schema only dump/restore plus trigger
based replication (londiste, slony) to migrate data.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray...@gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-23 Thread Jim Nasby

On 5/14/13 2:13 PM, Greg Smith wrote:

It is possible that we are told to put something in the freelist that
is already in it; don't screw up the list if so.

I don't see where the code does anything to handle that though.  What was your 
intention here?


IIRC, the code that pulls from the freelist already deals with the possibility 
that a block was on the freelist but has since been put to use. If that's the 
case then there shouldn't be much penalty to adding a block multiple times (at 
least within reason...)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-23 Thread David Powers
Thanks for the response.

I have some evidence against an issue in the backup procedure (though I'm
not ruling it out).  We moved back to taking the backup off of the primary
and all errors for all three clusters went away.  All of the hardware is
the same, OS and postgres versions are largely the same (9.2.3 vs. 9.2.4 in
some cases, various patch levels of Cent 6.3 for the OS).  The backup code
is exactly the same, just pointed at a different set of boxes.

Currently I'm just running for a couple of days to ensure that we have
viable static backups.  After that I'll redo one of the restores from a
suspected backup and will post the logs.

-David


On Thu, May 23, 2013 at 11:26 AM, Robert Haas  wrote:

> On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
>  wrote:
> > We are seeing these errors on a regular basis on the testing box now.  We
> > have even changed the backup script to
> > shutdown the hot standby, take lvm snapshot, restart the hot standby,
> rsync
> > the lvm snapshot.  It still happens.
> >
> > We have never seen this before we introduced the hot standby.  So we will
> > now revert to taking the backups from lvm snapshots on the production
> > database.  If you have ideas of what else we should try / what
> information
> > we can give you to debug this let us know and we will try to so.
> >
> > Until then we will sadly operate on the assumption that the combination
> of
> > hot standby and "frozen snapshot" backup of it is not production ready.
>
> I'm pretty suspicious that your backup procedure is messed up in some
> way.  The fact that you got invalid page headers is really difficult
> to attribute to a PostgreSQL bug.  A number of the other messages that
> you have posted also tend to indicate either corruption, or that WAL
> replay has stopped early.  It would be interesting to see the logs
> from when the clone was first started up, juxtaposed against the later
> WAL flush error messages.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] getting rid of freezing

2013-05-23 Thread Andres Freund
On 2013-05-23 19:51:48 +0200, Andres Freund wrote:
> We currently need to make sure we scanned the whole relation and have
> frozen everything to have a sensible relfrozenxid for a relation.
> 
> So, what I propose instead is basically:
> 1) only vacuum non-all-visible pages, even when doing it for
>anti-wraparound
> 2) When we can set all-visible guarantee that all tuples on the page are
>fully hinted. During recovery do the same, so we don't need to log
>all hint bits.
>We can do this with only an exclusive lock on the buffer, we don't
>need a cleanup lock.
> 3) When we cannot mark a page all-visible or we cannot get the cleanup
>lock, remember the oldest xmin on that page. We could set all visible
>in the former case, but we want the page to be cleaned up sometime
>soonish.
> 4) If we can get the cleanup lock, purge dead tuples from the page and
>the indexes, just as today. Set the page as all-visible.
> 
> That way we know that any page that is all-visible doesn't ever need to
> look at xmin/xmax since we are sure to have set all relevant hint
> bits.

Heikki noticed that I made quite the omission here which is that you
would need to mark tuples as all visible as well. I was thinking about
using HEAP_MOVED_OFF | HEAP_MOVED_IN as a hint for that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WARNING : pgstat wait timeout - stats_temp_directory - postgres 9.1

2013-05-23 Thread Alvaro Herrera
Mathieu Guerin escribió:
> Hello,
> 
> I am facing a problem with pgstat as my subject says. I known, some topics
> are open about that, but I would like to go deeper.
> 
> Some person told that the better way to don't have this message anymore is
> to configure pgstat.stat to be loaded in the RAM with a tmpfs mount point.
> 
> What are the consequences ? Because this file will be remove if the server
> reboot.

There are two separate files, one is the temp file which is used while
the server is running and is written very frequently.  You put that one
on volatile storage (stats_temp_directory) and immediately see a
performance benefit.

The other one is the permanent file, which is written only once when the
system is shutting down.  This is not put in stats_temp_directory, so
it's safe.

In case of a crash (the server didn't have the chance to write the
permanent file), stats would be reset anyway at restart, so there's no
conceptual problem with the permanent file not being written.

> If we change the parameter stats_temp_directory is it necessary to reboot
> the server ?

No, a reload (pg_ctl reload) is sufficient.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gemulon.postgresql.org/gitmaster.postgresql.org

2013-05-23 Thread Stefan Kaltenbrunner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/23/2013 12:20 PM, Stefan Kaltenbrunner wrote:
> Hi All!
> 
> 
> We will be upgrading gemulon.postgresql.org during the next few
> hours to the current release of debian (wheezy/7.0) as discussed
> with various people. To prevent any kind of issues we will be
> locking out commiters for a brief amount of time so don't be
> surprised if you get an error message.

all done - happy commiting



Stefan
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAlGeXicACgkQr1aG+WhhYQEgpQCgt/QNu0YS3AtFun0xwi017Dza
J78AmwQ71DRH6SOqIanBz9AdGe/0xGof
=mVe8
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-23 Thread Andres Freund
On 2013-05-23 19:51:48 +0200, Andres Freund wrote:
> I think that the existence of hint bits and the crash safe visibility
> maps should provide sufficient tooling to make freezing unneccessary
> without loosing much information for debugging if we modify the way
> vacuum works a bit.

> That way we know that any page that is all-visible doesn't ever need to
> look at xmin/xmax since we are sure to have set all relevant hint
> bits.

One case that would make this problematic is row level locks on
tuples. We would need to unset all visible for them, otherwise we might
do the wrong thing when looking at xmax...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Pavan Deolasee
On Thu, May 23, 2013 at 11:10 PM, Heikki Linnakangas <
hlinnakan...@vmware.com> wrote:

> On 23.05.2013 07:55, Robert Haas wrote:
>
>> On Thu, May 23, 2013 at 7:10 AM, Heikki Linnakangas
>>   wrote:
>>
>>> 1. Scan the WAL log of the old cluster, starting from the point where
>>> the new cluster's timeline history forked off from the old cluster. For
>>> each
>>> WAL record, make a note of the data blocks that are touched. This yields
>>> a
>>> list of all the data blocks that were changed in the old cluster, after
>>> the
>>> new cluster forked off.
>>>
>>
>> Suppose that a transaction is open and has written tuples at the point
>> where WAL forks.  After WAL forks, the transaction commits.  Then, it
>> hints some of the tuples that it wrote.  There is no record in WAL
>> that those blocks are changed, but failing to revert them leads to
>> data corruption.
>>
>
> Bummer, you're right. Hmm, if you have checksums enabled, however, we'll
> WAL log a full-page every time a page is dirtied for setting a hint bit,
> which fixes the problem. So, there's a caveat with pg_rewind; you must have
> checksums enabled.
>
>
I was quite impressed with the idea, but hint bits indeed are problem. I
realised the same issue also applies to the other idea that Fujii-san and
others have suggested about waiting for dirty buffers to be written until
the WAL is received at the standby. But since that idea would anyways need
to be implemented in the core, we could teach SetHintBits() to return false
unless the corresponding commit WAL records are written to the standby
first.


Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


[HACKERS] getting rid of freezing

2013-05-23 Thread Andres Freund
Hi,

after having discussed $subject shortly over dinner yesterday, while I
should have been preparing the slides for my talk I noticed that there
might be a rather easy way to get rid of freezing.

I think that the existence of hint bits and the crash safe visibility
maps should provide sufficient tooling to make freezing unneccessary
without loosing much information for debugging if we modify the way
vacuum works a bit.

Currently, aside from recovery, we only set all visible in vacuum.

vacuumlazy.c's lazy_scan_heap currently works like:

for (blkno = 0; blkno < nblocks; blkno++)
{
if (!scan_all && invisible)
   continue;

/* cannot lock buffer immediately */
if (!ConditionalLockBufferForCleanup(buf))
{
if (!scan_all)
continue;

/* don't block if we don't need freezing */
if (!lazy_check_needs_freeze(buf))
   continue;

/* now wait for cleanup lock */
LockBufferForCleanup(buf);
}

for (tuple in all_tuples)
{
cleanup_tuple();
}

if (nfrozen > 0)
   log_heap_freeze()

if (all_visible)
{
PageSetAllVisible(page);
visibilitymap_set(page);
}
}

In other words, if we don't need to make sure there aren't any old
tuples, we only scan visible parts of the relation. If we are making a
freeze vacuum we scan the whole relation, waiting for a cleanup lock on
the relation if necessary.

We currently need to make sure we scanned the whole relation and have
frozen everything to have a sensible relfrozenxid for a relation.

So, what I propose instead is basically:
1) only vacuum non-all-visible pages, even when doing it for
   anti-wraparound
2) When we can set all-visible guarantee that all tuples on the page are
   fully hinted. During recovery do the same, so we don't need to log
   all hint bits.
   We can do this with only an exclusive lock on the buffer, we don't
   need a cleanup lock.
3) When we cannot mark a page all-visible or we cannot get the cleanup
   lock, remember the oldest xmin on that page. We could set all visible
   in the former case, but we want the page to be cleaned up sometime
   soonish.
4) If we can get the cleanup lock, purge dead tuples from the page and
   the indexes, just as today. Set the page as all-visible.

That way we know that any page that is all-visible doesn't ever need to
look at xmin/xmax since we are sure to have set all relevant hint
bits.

We don't even necessarily need to log the hint bits for all items since
the redo for all_visible could make sure all items are hinted. The only
problem is knowing up to where we can truncate pg_clog...

Makes sense?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Heikki Linnakangas

On 23.05.2013 08:03, Simon Riggs wrote:

On 23 May 2013 12:10, Heikki Linnakangas  wrote:


Please take a look: https://github.com/vmware/pg_rewind


The COPYRIGHT file shows that VMware is claiming copyright on unstated
parts of the code for this. As such, its not a normal submission to
the PostgreSQL project, which involves placing copyright with the
PGDG.


We have a lot of code in PostgreSQL source tree with different copyright 
notices, and there's no problem with that as long as the coe is licensed 
under the PostgreSQL license. For patches that add or modify code in 
PostgreSQL, we generally have copyright notices with just PGDG, to avoid 
having a long list of copyright notices of a lot of companies and 
individuals on every file. I'm no lawyer, but I believe there's no 
difference from the legal point of view.



As a result, while it sounds interesting, people should be aware of
that and I suggest we shouldn't discuss that code on this list, to
avoid any disputes should we decide to include a similar facility in
core Postgres in the future.


That's just paranoia. There are a lot of tools out there on pgfoundry, 
with various copyright holders and even difference licenses, and it's 
fine to talk about all those on this list. Besides, the code is licensed 
under the PostgreSQL license, so if someone decides we should have this 
e.g in contrib, you can just grab the sources and commit. Thirdly, 
there's no reason to refrain from even discussing this, even if someone 
would include a similar facility in core Postgres - this is about 
copyrights, not patents (and yes, this contribution has been cleared by 
VMware legal department; VMware doesn't hold any patents on this)


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Heikki Linnakangas

On 23.05.2013 07:55, Robert Haas wrote:

On Thu, May 23, 2013 at 7:10 AM, Heikki Linnakangas
  wrote:

1. Scan the WAL log of the old cluster, starting from the point where
the new cluster's timeline history forked off from the old cluster. For each
WAL record, make a note of the data blocks that are touched. This yields a
list of all the data blocks that were changed in the old cluster, after the
new cluster forked off.


Suppose that a transaction is open and has written tuples at the point
where WAL forks.  After WAL forks, the transaction commits.  Then, it
hints some of the tuples that it wrote.  There is no record in WAL
that those blocks are changed, but failing to revert them leads to
data corruption.


Bummer, you're right. Hmm, if you have checksums enabled, however, we'll 
WAL log a full-page every time a page is dirtied for setting a hint bit, 
which fixes the problem. So, there's a caveat with pg_rewind; you must 
have checksums enabled.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] gemulon.postgresql.org/gitmaster.postgresql.org

2013-05-23 Thread Stefan Kaltenbrunner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi All!


We will be upgrading gemulon.postgresql.org during the next few hours
to the current release of debian (wheezy/7.0) as discussed with
various people.
To prevent any kind of issues we will be locking out commiters for a
brief amount of time so don't be surprised if you get an error message.


Stefan
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAlGeQaYACgkQr1aG+WhhYQH4PACgncD04Mlo+sC27UROsnRkVo3e
NuEAoM/3U5QGt/TETG5f9OjXEdfATd+w
=zNvy
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Robert Haas
On Thu, May 23, 2013 at 11:34 AM, Fujii Masao  wrote:
> On Thu, May 23, 2013 at 8:55 PM, Robert Haas  wrote:
>> On Thu, May 23, 2013 at 7:10 AM, Heikki Linnakangas
>>  wrote:
>>> 1. Scan the WAL log of the old cluster, starting from the point where
>>> the new cluster's timeline history forked off from the old cluster. For each
>>> WAL record, make a note of the data blocks that are touched. This yields a
>>> list of all the data blocks that were changed in the old cluster, after the
>>> new cluster forked off.
>>
>> Suppose that a transaction is open and has written tuples at the point
>> where WAL forks.  After WAL forks, the transaction commits.  Then, it
>> hints some of the tuples that it wrote.  There is no record in WAL
>> that those blocks are changed, but failing to revert them leads to
>> data corruption.
>
> Yes in asynchronous replication case. But in synchronous replication case,
> after WAL forks, hint bits would not be set if their corresponding commit 
> record
> is not replicated to the standby. The transaction commit keeps waiting
> for the reply
> from the standby before updating clog. So, this data corruption would not 
> happen
> in sync case.

Not necessarily.  SyncRepWaitForLSN() can be interrupted via a cancel signal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma
> Right, the actual signal handler will only kill the query immediately
> if the backend is in a safe state (eg, while it's waiting for a
> heavyweight lock).  Otherwise it just sets a flag that's checked by
> CHECK_FOR_INTERRUPTS.  See StatementCancelHandler in postgres.c.

Roger that, I will definitely have a look and ask if I have any questions.

Thanks a ton!

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Tom Lane
Atri Sharma  writes:
> Right.I believe this is part of the standard way in which we handle
> interrupts,right? Making sure that we cancel a query when the backend
> is in a state to do so,not when the interrupt actually comes in,right?

Right, the actual signal handler will only kill the query immediately
if the backend is in a safe state (eg, while it's waiting for a
heavyweight lock).  Otherwise it just sets a flag that's checked by
CHECK_FOR_INTERRUPTS.  See StatementCancelHandler in postgres.c.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Merlin Moncure
On Thu, May 23, 2013 at 10:43 AM, Atri Sharma  wrote:
>>
>> A little bit --- the timeout won't actually kill the query until the
>> next time control reaches a CHECK_FOR_INTERRUPTS macro that's not inside
>> a critical section.  We've had issues in the past with particular code
>> paths that failed to include such a check in a long-running loop, and
>> there might still be some left.  But by and large, it won't take very
>> long for the query to notice the interrupt.
>
>
> Right.I believe this is part of the standard way in which we handle
> interrupts,right? Making sure that we cancel a query when the backend
> is in a state to do so,not when the interrupt actually comes in,right?

yes. all non trivial parts of the code (in terms of time) must run the
interrupt check.  it basically just looks a the signal flag set by the
signal handler.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma
>
> A little bit --- the timeout won't actually kill the query until the
> next time control reaches a CHECK_FOR_INTERRUPTS macro that's not inside
> a critical section.  We've had issues in the past with particular code
> paths that failed to include such a check in a long-running loop, and
> there might still be some left.  But by and large, it won't take very
> long for the query to notice the interrupt.


Right.I believe this is part of the standard way in which we handle
interrupts,right? Making sure that we cancel a query when the backend
is in a state to do so,not when the interrupt actually comes in,right?

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Tom Lane
Atri Sharma  writes:
> I am not sure, but does statement_timeout depend on *what* the query
> is doing internally(i.e. if it is holding lots of locks,pins etc)?

A little bit --- the timeout won't actually kill the query until the
next time control reaches a CHECK_FOR_INTERRUPTS macro that's not inside
a critical section.  We've had issues in the past with particular code
paths that failed to include such a check in a long-running loop, and
there might still be some left.  But by and large, it won't take very
long for the query to notice the interrupt.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small typo in syncrep.h

2013-05-23 Thread Robert Haas
On Thu, May 23, 2013 at 4:44 AM, Pavan Deolasee
 wrote:
> While reading code, I noticed a small typo in syncrep.h. It says
> SyncRepUpdateSyncStandbysDefined() is called by wal writer whereas its
> called by the checkpointer process, at least in the HEAD. Previously, it was
> being called by the bgwriter process.
>
> Attached patches fix this in the relevant releases.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Fujii Masao
On Thu, May 23, 2013 at 8:55 PM, Robert Haas  wrote:
> On Thu, May 23, 2013 at 7:10 AM, Heikki Linnakangas
>  wrote:
>> 1. Scan the WAL log of the old cluster, starting from the point where
>> the new cluster's timeline history forked off from the old cluster. For each
>> WAL record, make a note of the data blocks that are touched. This yields a
>> list of all the data blocks that were changed in the old cluster, after the
>> new cluster forked off.
>
> Suppose that a transaction is open and has written tuples at the point
> where WAL forks.  After WAL forks, the transaction commits.  Then, it
> hints some of the tuples that it wrote.  There is no record in WAL
> that those blocks are changed, but failing to revert them leads to
> data corruption.

Yes in asynchronous replication case. But in synchronous replication case,
after WAL forks, hint bits would not be set if their corresponding commit record
is not replicated to the standby. The transaction commit keeps waiting
for the reply
from the standby before updating clog. So, this data corruption would not happen
in sync case.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma
On Thu, May 23, 2013 at 8:52 PM, Amit Langote  wrote:
>>
>> If you let an uncooperative user issue arbitrary SQL queries, he can
>> do any number of things to put server performance into the tank.
>> For instance, take out exclusive locks on all your tables and just
>> go to sleep (although I think this is limited by table permissions in
>> recent PG versions).  Or start up an unconstrained join on some giant
>> tables.  etc. etc.  This isn't an area that people have felt deserved
>> adding a lot of overhead to control.
>
> In such a case, would statement_timeout apply? If using
> statement_timeout, would the longest a client can stall server be
> limited to statement_timeout amount of time?
>

I am not sure, but does statement_timeout depend on *what* the query
is doing internally(i.e. if it is holding lots of locks,pins etc)?

Regards,

Atri



--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-23 Thread Robert Haas
On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
 wrote:
> We are seeing these errors on a regular basis on the testing box now.  We
> have even changed the backup script to
> shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync
> the lvm snapshot.  It still happens.
>
> We have never seen this before we introduced the hot standby.  So we will
> now revert to taking the backups from lvm snapshots on the production
> database.  If you have ideas of what else we should try / what information
> we can give you to debug this let us know and we will try to so.
>
> Until then we will sadly operate on the assumption that the combination of
> hot standby and "frozen snapshot" backup of it is not production ready.

I'm pretty suspicious that your backup procedure is messed up in some
way.  The fact that you got invalid page headers is really difficult
to attribute to a PostgreSQL bug.  A number of the other messages that
you have posted also tend to indicate either corruption, or that WAL
replay has stopped early.  It would be interesting to see the logs
from when the clone was first started up, juxtaposed against the later
WAL flush error messages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma
> No, you're not following.  These would be table-level heavyweight locks,
> and there already are timeout mechanisms that work at that level.
>

Oh,right.Sorry,I missed that. I will go and research heavyweight locks now.

Thanks a ton!

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Amit Langote
>
> If you let an uncooperative user issue arbitrary SQL queries, he can
> do any number of things to put server performance into the tank.
> For instance, take out exclusive locks on all your tables and just
> go to sleep (although I think this is limited by table permissions in
> recent PG versions).  Or start up an unconstrained join on some giant
> tables.  etc. etc.  This isn't an area that people have felt deserved
> adding a lot of overhead to control.

In such a case, would statement_timeout apply? If using
statement_timeout, would the longest a client can stall server be
limited to statement_timeout amount of time?


--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Tom Lane
Atri Sharma  writes:
>> For instance, take out exclusive locks on all your tables and just
>> go to sleep (although I think this is limited by table permissions in
>> recent PG versions).

> This is what I have been worried about. The locks(exclusive locks or
> content locks in buffers) can be held and stalled. Cant we do anything
> about it? This is why I contemplated the timeout part.

No, you're not following.  These would be table-level heavyweight locks,
and there already are timeout mechanisms that work at that level.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-23 Thread Robert Haas
On Tue, May 21, 2013 at 3:06 AM, Amit Kapila  wrote:
>> Here are the results.  The first field in each line is the number of
>> clients. The second number is the scale factor.  The numbers after
>> "master" and "patched" are the median of three runs.
>>
>> 01 100 master 1433.297699 patched 1420.306088
>> 01 300 master 1371.286876 patched 1368.910732
>> 01 1000 master 1056.891901 patched 1067.341658
>> 01 3000 master 637.312651 patched 685.205011
>> 08 100 master 10575.017704 patched 11456.043638
>> 08 300 master 9262.601107 patched 9120.925071
>> 08 1000 master 1721.807658 patched 1800.733257
>> 08 3000 master 819.694049 patched 854.333830
>> 32 100 master 26981.677368 patched 27024.507600
>> 32 300 master 14554.870871 patched 14778.285400
>> 32 1000 master 1941.733251 patched 1990.248137
>> 32 3000 master 846.654654 patched 892.554222
>
> Is the above test for tpc-b?
> In the above tests, there is performance increase from 1~8% and decrease
> from 0.2~1.5%

It's just the default pgbench workload.

>> And here's the same results for 5-minute, read-only tests:
>>
>> 01 100 master 9361.073952 patched 9049.553997
>> 01 300 master 8640.235680 patched 8646.590739
>> 01 1000 master 8339.364026 patched 8342.799468
>> 01 3000 master 7968.428287 patched 7882.121547
>> 08 100 master 71311.491773 patched 71812.899492
>> 08 300 master 69238.839225 patched 70063.632081
>> 08 1000 master 34794.778567 patched 65998.468775
>> 08 3000 master 60834.509571 patched 61165.998080
>> 32 100 master 203168.264456 patched 205258.283852
>> 32 300 master 199137.276025 patched 200391.633074
>> 32 1000 master 177996.853496 patched 176365.732087
>> 32 3000 master 149891.147442 patched 148683.269107
>>
>> Something appears to have screwed up my results for 8 clients @ scale
>> factor 300 on master,
>
>   Do you want to say the reading of 1000 scale factor?

Yes.

>>but overall, on both the read-only and
>> read-write tests, I'm not seeing anything that resembles the big gains
>> you reported.
>
> I have not generated numbers for read-write tests, I will check that once.
> For read-only tests, the performance increase is minor and different from
> what I saw.
> Few points which I could think of for difference in data:
>
> 1. In my test's I always observed best data when number of clients/threads
> are equal to number of cores which in your case should be at 16.

Sure, but you also showed substantial performance increases across a
variety of connection counts, whereas I'm seeing basically no change
at any connection count.

> 2. I think for scale factor 100 and 300, there should not be much
> performance increase, as for them they should mostly get buffer from
> freelist inspite of even bgwriter adds to freelist or not.

I agree.

> 3. In my tests variance is for shared buffers, database size is always less
> than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but due
> to variance in shared buffers, it can lead to I/O.

Not sure I understand this.

> 4. Each run is of 20 minutes, not sure if this has any difference.

I've found that 5-minute tests are normally adequate to identify
performance changes on the pgbench SELECT-only workload.

>> Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
>> PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.
>
> To think about the difference in your and my runs, could you please tell me
> about below points
> 1. What is RAM in machine.

64GB

> 2. Are number of threads equal to number of clients.

Yes.

> 3. Before starting tests I have always done pre-warming of buffers (used
> pg_prewarm written by you last year), is it same for above read-only tests.

No, I did not use pg_prewarm.  But I don't think that should matter
very much.  First, the data was all in the OS cache.  Second, on the
small scale factors, everything should end up in cache pretty quickly
anyway.  And on the large scale factors, well, you're going to be
churning shared_buffers anyway, so pg_prewarm is only going to affect
the very beginning of the test.

> 4. Can you please once again run only the test where you saw variation(8
> clients @ scale> factor 1000 on master), because I have also seen that
> performance difference is very good for certain
>configurations(Scale Factor, RAM, Shared Buffers)

I can do this if I get a chance, but I don't really see where that's
going to get us.  It seems pretty clear to me that there's no benefit
on these tests from this patch.  So either one of us is doing the
benchmarking incorrectly, or there's some difference in our test
environments that is significant, but none of the proposals you've
made so far seem to me to explain the difference.

> Apart from above, I had one more observation during my investigation to find
> why in some cases, there is a small dip:
> 1. Many times, it finds the buffer in free list is not usable, means it's
> refcount or usage count is not zero, due to which it had to spend more time
> under BufFreelistLock.
> 

Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma
> For instance, take out exclusive locks on all your tables and just
> go to sleep (although I think this is limited by table permissions in
> recent PG versions).

This is what I have been worried about. The locks(exclusive locks or
content locks in buffers) can be held and stalled. Cant we do anything
about it? This is why I contemplated the timeout part.

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Tom Lane
Atri Sharma  writes:
> BTW, what is your opinion on a rogue client's damaging capabilities?
> Theoretically, what can a client which tries to stall the backend
> target?

If you let an uncooperative user issue arbitrary SQL queries, he can
do any number of things to put server performance into the tank.
For instance, take out exclusive locks on all your tables and just
go to sleep (although I think this is limited by table permissions in
recent PG versions).  Or start up an unconstrained join on some giant
tables.  etc. etc.  This isn't an area that people have felt deserved
adding a lot of overhead to control.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma
On Thu, May 23, 2013 at 8:18 PM, Tom Lane  wrote:
> Atri Sharma  writes:
>> I was musing over a possible condition where a rogue client gets the
>> backend to process queries which take a *lot* of time(note, this is
>> only in my head atm.I may be completely wrong here).
>
>> Wouldnt something on the lines of a timeout help here?
>
> You can already set statement_timeout for that.  I don't think worrying
> about it at the level of buffer content locks would be terribly helpful,
> since those locks are generally held only for long enough to read or
> write the page or to verify the visibility of rows on it.  Even if the
> client is rogue, it can't affect those timings too much.

Right. I seem to be understanding this now.

BTW, what is your opinion on a rogue client's damaging capabilities?
Theoretically, what can a client which tries to stall the backend
target?

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Tom Lane
Atri Sharma  writes:
> I was musing over a possible condition where a rogue client gets the
> backend to process queries which take a *lot* of time(note, this is
> only in my head atm.I may be completely wrong here).

> Wouldnt something on the lines of a timeout help here?

You can already set statement_timeout for that.  I don't think worrying
about it at the level of buffer content locks would be terribly helpful,
since those locks are generally held only for long enough to read or
write the page or to verify the visibility of rows on it.  Even if the
client is rogue, it can't affect those timings too much.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma
On Thu, May 23, 2013 at 8:01 PM, Tom Lane  wrote:
> Atri Sharma  writes:
>> On a different note, shouldn't we have a time out for a content lock
>> in buffer cache?
>
> No; the overhead of setting up and canceling such a timeout would
> greatly outweigh any possible benefit.
>
> Generally speaking, LWLocks are not meant to be used in situations where
> the lock hold time might be long enough to justify worrying about
> timeouts.  If you need that kind of behavior, use a heavyweight lock.

Right, the overheads,especially in case of interruptions would be high.

I was musing over a possible condition where a rogue client gets the
backend to process queries which take a *lot* of time(note, this is
only in my head atm.I may be completely wrong here).

Wouldnt something on the lines of a timeout help here?

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Tom Lane
Atri Sharma  writes:
> On a different note, shouldn't we have a time out for a content lock
> in buffer cache?

No; the overhead of setting up and canceling such a timeout would
greatly outweigh any possible benefit.

Generally speaking, LWLocks are not meant to be used in situations where
the lock hold time might be long enough to justify worrying about
timeouts.  If you need that kind of behavior, use a heavyweight lock.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ASYNC Privileges proposal

2013-05-23 Thread Chris Farmiloe
Hey all,

I find the current LISTEN / NOTIFY rather limited in the context of
databases with multiple roles. As it stands it is not possible to restrict
the use of LISTEN or NOTIFY to specific roles, and therefore notifications
(and their payloads) cannot really be trusted as coming from any particular
source.

If the payloads of notifications could be trusted, then applications could
make better use of them, without fear of leaking any sensitive information
to anyone who shouldn't be able to see it.

I'd like to propose a new ASYNC database privilege that would control
whether a role can use LISTEN, NOTIFY and UNLISTEN statements and the
associated pg_notify function.

ie:
GRANT ASYNC ON DATABASE  TO bob;
REVOKE ASYNC ON DATABASE  FROM bob;

SECURITY DEFINER functions could then be used anywhere that a finer grained
access control was required.

I had a quick play to see what might be involved [attached], and would like
to hear people thoughts; good idea, bad idea, not like that! etc

Chris.


async_privileges_r0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Thom Brown
On 23 May 2013 07:10, Heikki Linnakangas  wrote:
> Hi,
>
> I've been hacking on a tool to allow resynchronizing an old master server
> after failover. The need to do a full backup/restore has been a common
> complaint ever since we've had streaming replication. I saw on the wiki that
> this was discussed in the dev meeting; too bad I couldn't make it.
>
> In a nutshell, the idea is to do copy everything that has changed between
> the cluster, like rsync does, but instead of reading through all files, use
> the WAL to determine what has changed. Here's a somewhat more detailed
> explanation, from the README:
>
> Theory of operation
> ---
>
> The basic idea is to copy everything from the new cluster to old, except for
> the blocks that we know to be the same.
>
> 1. Scan the WAL log of the old cluster, starting from the point where
> the new cluster's timeline history forked off from the old cluster. For each
> WAL record, make a note of the data blocks that are touched. This yields a
> list of all the data blocks that were changed in the old cluster, after the
> new cluster forked off.
>
> 2. Copy all those changed blocks from the new master to the old master.
>
> 3. Copy all other files like clog, conf files etc. from the new cluster
> to old. Everything except the relation files.
>
> 4. Apply the WAL from the new master, starting from the checkpoint
> created at failover. (pg_rewind doesn't actually apply the WAL, it just
> creates a backup label file indicating that when PostgreSQL is started, it
> will start replay from that checkpoint and apply all the required WAL)
>
>
> Please take a look: https://github.com/vmware/pg_rewind

6 instances set up:

[Primary (5530)]
|
---[Standby 1 (5531)]
|
---[Standby 2 (5532)]
|
---[Standby 3 (5533)]
|
---[Standby 4 (5534)]
|
---[Standby 5 (5535)]

1) Created a table on the primary with some data.
2) Promoted Standby 1
3) Cleanly shut down Primary
4) pg_rewind --target-pgdata=/tmp/primary
--source-server='host=localhost port=5531 dbname=postgres'

Last common WAL position: 0/30227F8 on timeline 1
Last common checkpoint at 0/30227F8 on timeline 1
error reading xlog record: record with zero length at 0/3022860
Done!

Contents of pg_xlog directory in Primary and Standby 1:
thom@swift /tmp $ ls -l primary/pg_xlog/
total 49156
-rw--- 1 thom users 16777216 May 23 09:52 00010002
-rw--- 1 thom users 16777216 May 23 09:52 00010003
-rw--- 1 thom users 16777216 May 23 09:52 00020003
-rw--- 1 thom users   41 May 23 09:52 0002.history
drwx-- 2 thom users   80 May 23 09:52 archive_status
thom@swift /tmp $ ls -l standby1/pg_xlog/
total 49156
-rw--- 1 thom users 16777216 May 23 09:49 00010002
-rw--- 1 thom users 16777216 May 23 09:50 00010003
-rw--- 1 thom users 16777216 May 23 09:52 00020003
-rw--- 1 thom users   41 May 23 09:50 0002.history
drwx-- 2 thom users   80 May 23 09:50 archive_status

5) Changed recovery.done in primary to point its primary_conninfo port
to 5531 (that of Standby 1).
6) Renamed it to .conf.
7) Changed postgresql.conf to set the port back to its original one
(as pg_rewind has caused it to match that of Standby 1)
8) Start Primary

Latest log in primary reads:

LOG:  database system was interrupted; last known up at 2013-05-23 09:50:34 EDT
LOG:  entering standby mode
LOG:  invalid xl_info in checkpoint record
FATAL:  could not locate required checkpoint record
HINT:  If you are not restoring from a backup, try removing the file
"/tmp/primary/backup_label".
LOG:  startup process (PID 31503) exited with exit code 1
LOG:  aborting startup due to startup process failure

9) Okay, so I'll delete that label and try again.  Now all is well:

LOG:  database system was interrupted; last known up at 2013-05-23 09:50:34 EDT
LOG:  entering standby mode
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/3022828
LOG:  record with zero length at 0/3041A60
LOG:  consistent recovery state reached at 0/3041A60
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/300 on timeline 2

10) Connect to Standby 1 and insert more rows into our original table.
11) Connect to Primary and those rows are appearing.

--
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-23 Thread German Becker
On Thu, May 23, 2013 at 5:29 AM, Sergey Konoplev  wrote:

> On Thu, May 23, 2013 at 1:25 AM, Amit Langote 
> wrote:
> > Okay, now I understand. Also, looking at his "ls -l pg_xlog", I could
> > find that modified timestamps of all those pre-allocated segments are
> > about similar (around 12:10), whereas the latest modified time (15:37)
> > is of segment 0001000E00A7.
> >
> > Wonder if whatever configuration he is using is sub-optimal that these
> > many WAL segments can be re-cycled upon a checkpoint? Or is this okay?
>
> Is archive_mode=on?
> What is archive_command?
> Is the server in the recovery mode?
>
> --
> Kind regards,
> Sergey Konoplev
> PostgreSQL Consultant and DBA
>
> Profile: http://www.linkedin.com/in/grayhemp
> Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
> Skype: gray-hemp
> Jabber: gray...@gmail.com
>


Hi Sergey and all,
Let me describe the process I follow to get to this. What I am doing is
testing a migration from 8.3 to 9.1. They way I plan to do it is the
following.
1) Create the schema
2) import the biggest tables, which are not updated,only growing, with COPY
(this is about 35gb of data)
2)import the small, changing part of the data


The target system is 9.1 with streaming relication.
For steps 1 and 2, I set a "restore" configuration, that amongs other
things like more work mem, it sets archive_mode=off and wal_level=minimal
(attached the difference between restore and normal).
The archive_command is just a cp wrapped in a shell script in case I need
to change it.

Let me know if you need any more info


restorevsnormal.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MVCC catalog access

2013-05-23 Thread Robert Haas
On Wed, May 22, 2013 at 11:11 PM, Andres Freund  wrote:
> Make that actually having acquired an xid. We skip a large part of the
> work if a transaction doesn't yet have one afair. I don't think the mere
> presence of 600 idle connections without an xid in contrast to just
> having max_connection at 600 should actually make a difference in the
> cost of acquiring a snapshot?

Attached is a slightly updated version of the patch I'm using for
testing, and an updated version of the pg_cxn source that I'm using to
open lotsa connections.  With this version, I can do this:

./pg_cxn -n 600 -c BEGIN -c 'SELECT txid_current()'

...which I think is sufficient to make sure all those transactions
have XIDs.  Then I reran the "depend" test case (create a schema with
1000,000 functions and then drop the schema with CASCADE) that I
mentioned in my original posting.  Here are the results:

MVCC Off: Create 8685.662 ms, Drop 9973.233 ms
MVCC On: Create 7931.039 ms, Drop 10189.189 ms
MVCC Off: Create 7810.084 ms, Drop 9594.580 ms
MVCC On: Create 8854.577 ms, Drop 10240.024 ms

OK, let's try the rebuild-the-relcache test using the same pg_cxn
scenario (600 transactions that have started a transaction and
selected txid_current()).

[rhaas ~]$ time for s in `seq 1 1000`; do rm -f
pgdata/global/pg_internal.init && psql -c 'SELECT 2+2' >/dev/null;
done

MVCC catalog access on:
real0m11.006s
user0m2.746s
sys 0m2.664s

MVCC catalog access off:
real0m10.583s
user0m2.745s
sys 0m2.661s

MVCC catalog access on:
real0m10.646s
user0m2.750s
sys 0m2.661s

MVCC catalog access off:
real0m10.823s
user0m2.756s
sys 0m2.681s

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


mvcc-catalog-access-v2.patch
Description: Binary data
/*
 * pg_cxn.c
 */

#include 
#include 
#include 
#include "libpq-fe.h"

struct cmd_list;
typedef struct cmd_list cmd_list;

struct cmd_list
{
	char   *cmd;
	cmd_list   *next;
};

static void pg_connect(const char *conninfo, cmd_list *);
static cmd_list *add_command(cmd_list *, char *);
static void usage(void);

int
main(int argc, char **argv)
{
	int		c;
	int		n = 1;
	int		optindex;
	int		i;
	const char *conninfo;
	cmd_list   *cmds;

	while ((c = getopt_long(argc, argv, "n:c:", NULL, &optindex)) != -1)
	{
		switch (c)
		{
			case 'n':
n = atoi(optarg);
break;
			case 'c':
cmds = add_command(cmds, optarg);
break;
			default:
usage();
break;
		}
	}
	argv += optind;
	argc -= optind;

	if (argc > 0)
		conninfo = argv[0];
	else
		conninfo = "";

	for (i = 0; i < n; ++i)
		pg_connect(conninfo, cmds);

	printf("Established %d connections.\n", n);

	while (1)
		sleep(3600);

	return 0;
}

static cmd_list *
add_command(cmd_list *cmds, char *cmd)
{
	cmd_list   *newnode;

	newnode = malloc(sizeof(cmd_list));
	if (newnode == NULL)
	{
		perror("malloc");
		exit(1);
	}
	newnode->cmd = cmd;
	newnode->next = cmds;
	return newnode;
}

static void
pg_connect(const char *conninfo, cmd_list *cmds)
{
	PGconn	   *conn;

	/* Make a connection to the database */
	conn = PQconnectdb(conninfo);

	/* Check to see that the backend connection was successfully made */
	if (PQstatus(conn) != CONNECTION_OK)
	{
		fprintf(stderr, "%s", PQerrorMessage(conn));
		exit(1);
	}

	/* Execute commands. */
	while (cmds != NULL)
	{
		PGresult *res;

		res = PQexec(conn, cmds->cmd);
		if (PQresultStatus(res) == PGRES_FATAL_ERROR)
		{
			fprintf(stderr, "%s", PQresultErrorMessage(res));
			exit(1);
		}
		if (PQresultStatus(res) != PGRES_COMMAND_OK
			&& PQresultStatus(res) != PGRES_TUPLES_OK)
		{
			fprintf(stderr, "unexpected result status: %s\n",
	PQresStatus(PQresultStatus(res)));
			exit(1);
		}
		PQclear(res);
		cmds = cmds->next;
	}
}

static void
usage()
{
	fprintf(stderr, "Usage: pg_cxn [OPTION] [CONNECTION-STRING]\n\n"
		"Options:\n"
		"  -n NUM		Number of connections to open.\n"
		"  -c SQL		SQL to execute on each connection.\n"
		"(You can repeat this option more than once.)\n");
	exit(1);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WARNING : pgstat wait timeout - Postgres 9.1

2013-05-23 Thread Mathieu Guerin
Hello,

I am facing a problem with pgstat as my subject says. I known, some topics
are open about that, but I would like to go deeper.

Some person told that the better way to don't have this message anymore is
to configure pgstat.stat to be loaded in the RAM with a tmpfs mount point.

What are the consequences ? Because this file will be remove if the server
reboot.

If we change the parameter stats_temp_directory is it necessary to reboot
the server ?

When I lauch a SHOW ALL; command, the parameter stats_temp_director is not
here.

For information, my pgstat.stat file is up to 1,3MB.

Thank you for your help.
Math


Re: [HACKERS] Heap truncation without AccessExclusiveLock (9.4)

2013-05-23 Thread Simon Riggs
On 15 May 2013 16:35, Heikki Linnakangas  wrote:

> Shared memory space is limited, but we only need the watermarks for any
> in-progress truncations. Let's keep them in shared memory, in a small
> fixed-size array. That limits the number of concurrent truncations that can
> be in-progress, but that should be ok. To not slow down common backend
> operations, the values (or lack thereof) are cached in relcache. To sync the
> relcache when the values change, there will be a new shared cache
> invalidation event to force backends to refresh the cached watermark values.
> A backend (vacuum) can ensure that all backends see the new value by first
> updating the value in shared memory, sending the sinval message, and waiting
> until everyone has received it.

I think we could use a similar scheme for 2 other use cases.

1. Unlogged tables. It would be useful to have a persistent "safe high
watermark" for an unlogged table. So in the event of a crash, we
truncate back to the safe high watermark, not truncate the whole
table. That would get updated at each checkpoint. Unlogged tables will
get much more useful with that change. (Issues, with indexes would
need to be resolved also).

2. Table extension during COPY operations is difficult. We need to be
able to extend in larger chunks, so we would need to change the
algorithm about how extension works. I'm thinking there's a
relationship there with watermarks.

Can we look at the needs of multiple areas at once, so we come up with
a more useful design that covers more than just one use case, please?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Simon Riggs
On 23 May 2013 12:10, Heikki Linnakangas  wrote:

> Please take a look: https://github.com/vmware/pg_rewind

The COPYRIGHT file shows that VMware is claiming copyright on unstated
parts of the code for this. As such, its not a normal submission to
the PostgreSQL project, which involves placing copyright with the
PGDG.

As a result, while it sounds interesting, people should be aware of
that and I suggest we shouldn't discuss that code on this list, to
avoid any disputes should we decide to include a similar facility in
core Postgres in the future.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Robert Haas
On Thu, May 23, 2013 at 7:10 AM, Heikki Linnakangas
 wrote:
> 1. Scan the WAL log of the old cluster, starting from the point where
> the new cluster's timeline history forked off from the old cluster. For each
> WAL record, make a note of the data blocks that are touched. This yields a
> list of all the data blocks that were changed in the old cluster, after the
> new cluster forked off.

Suppose that a transaction is open and has written tuples at the point
where WAL forks.  After WAL forks, the transaction commits.  Then, it
hints some of the tuples that it wrote.  There is no record in WAL
that those blocks are changed, but failing to revert them leads to
data corruption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma


Sent from my iPad

On 23-May-2013, at 16:27, Amit Kapila  wrote:

> On Thursday, May 23, 2013 3:35 PM Atri Sharma wrote:
>> Hi all,
>> 
>> I was searching for an existing mechanism which ensures that a crashed
>> process releases a content lock held on a page in buffer cache. Also,
>> a similar mechanism for refcount of the page.
>> 
>> I seem to have missed it. Do we have a mechanism,and if yes,where can I
>> find it?
> 
> On any process (backend, bgwriter,checkpointer, ..) crash, it reinitialize
> the shared memory which will clear any locks held.
> Please refer functions HandleChildCrash() and PostmasterStateMachine().
> 
> 


Thanks a ton.

On a different note, shouldn't we have a time out for a content lock in buffer 
cache? There is a limit to the time any sane process would want to hold a 
lock,especially when many tasks can be accomplish with only pinning the page. 

I am just musing here,but it seems that we could potentially analyse this.I may 
be completely wrong.

Regards,

Atri

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-23 Thread Heikki Linnakangas

Hi,

I've been hacking on a tool to allow resynchronizing an old master 
server after failover. The need to do a full backup/restore has been a 
common complaint ever since we've had streaming replication. I saw on 
the wiki that this was discussed in the dev meeting; too bad I couldn't 
make it.


In a nutshell, the idea is to do copy everything that has changed 
between the cluster, like rsync does, but instead of reading through all 
files, use the WAL to determine what has changed. Here's a somewhat more 
detailed explanation, from the README:


Theory of operation
---

The basic idea is to copy everything from the new cluster to old, except 
for the blocks that we know to be the same.


1. Scan the WAL log of the old cluster, starting from the point where
the new cluster's timeline history forked off from the old cluster. For 
each WAL record, make a note of the data blocks that are touched. This 
yields a list of all the data blocks that were changed in the old 
cluster, after the new cluster forked off.


2. Copy all those changed blocks from the new master to the old master.

3. Copy all other files like clog, conf files etc. from the new cluster
to old. Everything except the relation files.

4. Apply the WAL from the new master, starting from the checkpoint
created at failover. (pg_rewind doesn't actually apply the WAL, it just 
creates a backup label file indicating that when PostgreSQL is started, 
it will start replay from that checkpoint and apply all the required WAL)



Please take a look: https://github.com/vmware/pg_rewind

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Amit Kapila
On Thursday, May 23, 2013 3:35 PM Atri Sharma wrote:
> Hi all,
> 
> I was searching for an existing mechanism which ensures that a crashed
> process releases a content lock held on a page in buffer cache. Also,
> a similar mechanism for refcount of the page.
> 
> I seem to have missed it. Do we have a mechanism,and if yes,where can I
> find it?

On any process (backend, bgwriter,checkpointer, ..) crash, it reinitialize
the shared memory which will clear any locks held.
Please refer functions HandleChildCrash() and PostmasterStateMachine().


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WARNING : pgstat wait timeout - stats_temp_directory - postgres 9.1

2013-05-23 Thread Mathieu Guerin
Hello,

I am facing a problem with pgstat as my subject says. I known, some topics
are open about that, but I would like to go deeper.

Some person told that the better way to don't have this message anymore is
to configure pgstat.stat to be loaded in the RAM with a tmpfs mount point.

What are the consequences ? Because this file will be remove if the server
reboot.

If we change the parameter stats_temp_directory is it necessary to reboot
the server ?

When I lauch a SHOW ALL; command, the parameter stats_temp_director is not
here.

For information, my pgstat.stat file is up to 1,3MB.

Thank you for your help.
Math


[HACKERS] Time limit for a process to hold Content lock in Buffer Cache

2013-05-23 Thread Atri Sharma
Hi all,

I was searching for an existing mechanism which ensures that a crashed
process releases a content lock held on a page in buffer cache. Also,
a similar mechanism for refcount of the page.

I seem to have missed it. Do we have a mechanism,and if yes,where can I find it?

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Small typo in syncrep.h

2013-05-23 Thread Pavan Deolasee
Hello,

While reading code, I noticed a small typo in syncrep.h. It
says SyncRepUpdateSyncStandbysDefined() is called by wal writer whereas its
called by the checkpointer process, at least in the HEAD. Previously, it
was being called by the bgwriter process.

Attached patches fix this in the relevant releases.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


syncrep-typo-91.patch
Description: Binary data


syncrep-typo-head.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-23 Thread Sergey Konoplev
On Thu, May 23, 2013 at 1:25 AM, Amit Langote  wrote:
> Okay, now I understand. Also, looking at his "ls -l pg_xlog", I could
> find that modified timestamps of all those pre-allocated segments are
> about similar (around 12:10), whereas the latest modified time (15:37)
> is of segment 0001000E00A7.
>
> Wonder if whatever configuration he is using is sub-optimal that these
> many WAL segments can be re-cycled upon a checkpoint? Or is this okay?

Is archive_mode=on?
What is archive_command?
Is the server in the recovery mode?

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray...@gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-23 Thread Amit Langote
>> Can pre-allocation go that further? for example, assuming
>> 0001000E0080 is currently being used, then is it possible
>> that a segment named/numbered 00010010007E (which does
>> exist in his pg_xlog as he reported in pgsql-admin thread) is
>> pre-allocated already?
>
> Yes, if it's so old that it's no longer required for the crash recovery.
>
> WAL recycling is performed by checkpoint. Checkpoint always checks
> whether there are
> WAL files no longer required for crash recovery, IOW, WAL files which
> were generated
> before the prior checkpoint happened, and then if they are found,
> checkpoint tries to recycle
> them.
>

Okay, now I understand. Also, looking at his "ls -l pg_xlog", I could
find that modified timestamps of all those pre-allocated segments are
about similar (around 12:10), whereas the latest modified time (15:37)
is of segment 0001000E00A7.

Wonder if whatever configuration he is using is sub-optimal that these
many WAL segments can be re-cycled upon a checkpoint? Or is this okay?



--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-23 Thread Fujii Masao
On Thu, May 23, 2013 at 5:01 PM, Amit Langote  wrote:
>> I think these are the WAL files that were preallocated by WAL
>> recycling but have not
>> been used yet.
>>
>>> # WAL after wal_level changed from 'minimal' to 'hot_standby'
>>>
>>> -rw--- 1 postgres postgres 16777216 May 21 12:27 
>>> 0001000E007B
>>> -rw--- 1 postgres postgres 16777216 May 21 12:35 
>>> 0001000E007C
>>> -rw--- 1 postgres postgres  276 May 21 12:35
>>> 0001000E007C.0020.backup
>>> -rw--- 1 postgres postgres 16777216 May 21 14:53 
>>> 0001000E007D
>>> -rw--- 1 postgres postgres 16777216 May 21 14:53 
>>> 0001000E007E
>>> -rw--- 1 postgres postgres 16777216 May 21 14:53 
>>> 0001000E007F
>>> -rw--- 1 postgres postgres 16777216 May 21 14:53 
>>> 0001000E0080
>>
>> These are the WAL files that you now used. So I don't think that WAL
>> file sequence rewound
>> in this case.
>>
>
> Can pre-allocation go that further? for example, assuming
> 0001000E0080 is currently being used, then is it possible
> that a segment named/numbered 00010010007E (which does
> exist in his pg_xlog as he reported in pgsql-admin thread) is
> pre-allocated already?

Yes, if it's so old that it's no longer required for the crash recovery.

WAL recycling is performed by checkpoint. Checkpoint always checks
whether there are
WAL files no longer required for crash recovery, IOW, WAL files which
were generated
before the prior checkpoint happened, and then if they are found,
checkpoint tries to recycle
them.

> I think we could ask the user the latest value of "select
> pg_xlogfile_name(pg_xlog_current_location())".

Yep.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-23 Thread Amit Langote
> I think these are the WAL files that were preallocated by WAL
> recycling but have not
> been used yet.
>
>> # WAL after wal_level changed from 'minimal' to 'hot_standby'
>>
>> -rw--- 1 postgres postgres 16777216 May 21 12:27 0001000E007B
>> -rw--- 1 postgres postgres 16777216 May 21 12:35 0001000E007C
>> -rw--- 1 postgres postgres  276 May 21 12:35
>> 0001000E007C.0020.backup
>> -rw--- 1 postgres postgres 16777216 May 21 14:53 0001000E007D
>> -rw--- 1 postgres postgres 16777216 May 21 14:53 0001000E007E
>> -rw--- 1 postgres postgres 16777216 May 21 14:53 0001000E007F
>> -rw--- 1 postgres postgres 16777216 May 21 14:53 0001000E0080
>
> These are the WAL files that you now used. So I don't think that WAL
> file sequence rewound
> in this case.
>

Can pre-allocation go that further? for example, assuming
0001000E0080 is currently being used, then is it possible
that a segment named/numbered 00010010007E (which does
exist in his pg_xlog as he reported in pgsql-admin thread) is
pre-allocated already?

I think we could ask the user the latest value of "select
pg_xlogfile_name(pg_xlog_current_location())".



--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-23 Thread Fujii Masao
On Thu, May 23, 2013 at 10:10 AM, Amit Langote  wrote:
> A PostgreSQL user recently reported on pgsql-admin about an issue:
> when he changed wal_level from 'minimal' to 'hot_standby', the WAL
> segment sequence rewound, that is, it started using old names. A
> snippet of his "ls -lrt pg_xlog":
>
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00B3
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001001A
> -rw--- 1 postgres postgres 16777216 May 21 12:13 000100100036
> -rw--- 1 postgres postgres 16777216 May 21 12:13 00010018
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00F6
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00E4
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00F2
> -rw--- 1 postgres postgres 16777216 May 21 12:13 00010013
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00D5
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00A9
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00D6
> -rw--- 1 postgres postgres 16777216 May 21 12:13 00010010004E
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00D8
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00B4
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00BA
> -rw--- 1 postgres postgres 16777216 May 21 12:13 0001000F00C3
> -rw--- 1 postgres postgres 16777216 May 21 12:13 000100100071
> -rw--- 1 postgres postgres 16777216 May 21 12:13 000100100047
> -rw--- 1 postgres postgres 16777216 May 21 12:13 000100100026

I think these are the WAL files that were preallocated by WAL
recycling but have not
been used yet.

> # WAL after wal_level changed from 'minimal' to 'hot_standby'
>
> -rw--- 1 postgres postgres 16777216 May 21 12:27 0001000E007B
> -rw--- 1 postgres postgres 16777216 May 21 12:35 0001000E007C
> -rw--- 1 postgres postgres  276 May 21 12:35
> 0001000E007C.0020.backup
> -rw--- 1 postgres postgres 16777216 May 21 14:53 0001000E007D
> -rw--- 1 postgres postgres 16777216 May 21 14:53 0001000E007E
> -rw--- 1 postgres postgres 16777216 May 21 14:53 0001000E007F
> -rw--- 1 postgres postgres 16777216 May 21 14:53 0001000E0080

These are the WAL files that you now used. So I don't think that WAL
file sequence rewound
in this case.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers