Re: [HACKERS] ExclusiveLock on extension of relation with huge shared_buffers

2014-12-28 Thread Borodin Vladimir

25 окт. 2014 г., в 4:31, Jim Nasby написал(а):

 Please don't top-post.
 On 10/24/14, 3:40 AM, Borodin Vladimir wrote:
 I have taken some backtraces (they are attached to the letter) of two 
 processes with such command:
 pid=17981; while true; do date; gdb -batch -e back 
 /usr/pgsql-9.4/bin/postgres $pid; echo; echo; echo; echo; sleep 0.1; done
 Process 17981 was holding the lock for a long time -
 And process 13886 was waiting for lock (in different time and from different 
 blocker actually but I don’t think it is really important) -
 As I can see, 17981 is actually waiting for LWLock on BufFreelistLock in 
 StrategyGetBuffer function, freelist.c:134 while holding exclusive lock on 
 relation. I will try to increase NUM_BUFFER_PARTITIONS (on read-only load it 
 also gave us some performance boost) and write the result in this thread.
 BufFreelistLock becomes very contended when shared buffers are under a lot of 
 Here's what I believe is happening:
 If RelationGetBufferForTuple() decides it needs to extend, this happens:
   LockRelationForExtension(relation, ExclusiveLock);
   buffer = ReadBufferBI(relation, P_NEW, bistate);
 Assuming bistate is false (I didn't check the bulk case), ReadBufferBI() ends 
 up at ReadBuffer_common(), which calls BufferAlloc(). In the normal case, 
 BufferAlloc() won't find the necessary buffer, so it will call 
 StrategyGetBuffer(), which will end up getting the freelist lock. Currently 
 the free list is normally empty, which means we now need to run the clock 
 sweep to find a victim buffer. The clock sweep will keep running until it 
 finds a buffer that is not pinned and has usage_count = 0. If shared buffers 
 are under heavy pressure, you can have a huge number of them with usage_count 
 = 5, which for 100GB shared buffers and an 8K BLKSZ, you could have to check 
 buffers *52 million* times (assuming you finally find a buffer on the start 
 of the 5th loop) before you find a victim.
 Keep in mind that's all happening while you're holding both the extension 
 lock *and the freelist lock*, which basically means no one else in the entire 
 system can allocate a new buffer.

I’ll try the same workload with recent patch from Andres Freund [0].

 This is one reason why a large shared_buffers setting is usually 
 counter-productive. Experience with older versions is that setting it higher 
 than about 8GB is more likely to hurt than to help. Newer versions are 
 probably better, but I think you'll be hard-pressed to find a workload where 
 100GB makes sense. It might if your entire database fits in shared_buffers 
 (though, even then there's probably a number of O(n) or worse operations that 
 will hurt you), but if your database is  shared_buffers you're probably in 
 I suggest cutting shared_buffers *way* down. Old-school advice for this 
 machine would be 8G (since 25% of 128G would be too big). You might be able 
 to do better than 8G, but I recommend not even trying unless you've got a 
 good way to test your performance.
 If you can test performance and find an optimal setting for shared_buffers, 
 please do share your test data and findings. :)

Of course, it works well with shared_buffers = 8GB. But we have seen that on 
read-only load when data set fits in RAM with =8GB shared_buffers we hit 
BufFreelistLock LWLock while moving pages between shared buffers and page 
cache. Increasing shared_buffers size to the size of data set improves 
performance up to 2,5X faster on this read-only load. So we started testing 
configuration with huge shared_buffers under writing load and that’s why I 
started this thread.

Since StrategyGetBuffer() does not use BufFreelistLock LWLock any more [1] I’ll 
also re-run tests with read-only load and small shared_buffers.


 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble!


Re: [HACKERS] Streaming replication and WAL archive interactions

2014-12-16 Thread Borodin Vladimir

12 дек. 2014 г., в 16:46, Heikki Linnakangas 

 There have been a few threads on the behavior of WAL archiving, after a 
 standby server is promoted [1] [2]. In short, it doesn't work as you might 
 expect. The standby will start archiving after it's promoted, but it will not 
 archive files that were replicated from the old master via streaming 
 replication. If those files were not already archived in the master before 
 the promotion, they are not archived at all. That's not good if you wanted to 
 restore from a base backup + the WAL archive later.
 The basic setup is a master server, a standby, a WAL archive that's shared by 
 both, and streaming replication between the master and standby. This should 
 be a very common setup in the field, so how are people doing it in practice? 
 Just live with the wisk that you might miss some files in the archive if you 
 promote? Don't even realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared archive between 
master and replicas) and don’t even realize there’s a problem :( And I think 
I’m not the only one. Maybe at least a note should be added to the 

 And how would we like it to work?
 There was some discussion in August on enabling WAL archiving in the standby, 
 always [3]. That's a related idea, but it assumes that you have a separate 
 archive in the master and the standby. The problem at promotion happens when 
 you have a shared archive between the master and standby.

AFAIK most people use the scheme with shared archive.

 - Heikki
 Sent via pgsql-hackers mailing list (
 To make changes to your subscription:


Re: [HACKERS] Getting references for OID

2014-12-02 Thread Borodin Vladimir

 How can I get all depend objects for oid of object? For example, I have oid 
 of db db_id and I want to get all oids of namespaces, which contains this db. 
 Thank you!

Hello, Dmitry.

Actually, this list is for developers (as described here [0]). You should ask 
this question on pgsql-general@, for example, or you can do it on 
pgsql-ru-general@, in Russian.

And if I understood you right you can use oid2name tool which is part of 
contrib. More info about it can be found here [1].


 С уважением, Дмитрий Воронин