from:"Amit kapila"

Re: [HACKERS] Performance degradation in commit 6150a1b0

2016-02-26 Thread Amit Kapila

On Sat, Feb 27, 2016 at 12:41 AM, Andres Freund  wrote:
>
> Hi,
>
> On 2016-02-25 12:56:39 +0530, Amit Kapila wrote:
> > From past few weeks, we were facing some performance degradation in the
> > read-only performance bench marks in high-end machines.  My colleague
> > Mithun, has tried by reverting commit ac1d794 which seems to degrade the
> > performance in HEAD on high-end m/c's as reported previously[1], but
still
> > we were getting degradation, then we have done some profiling to see
what
> > has caused it  and we found that it's mainly caused by spin lock when
> > called via pin/unpin buffer and then we tried by reverting commit
6150a1b0
> > which has recently changed the structures in that area and it turns out
> > that reverting that patch, we don't see any degradation in performance.
> > The important point to note is that the performance degradation doesn't
> > occur every time, but if the tests are repeated twice or thrice, it
> > is easily visible.
>
> > m/c details
> > IBM POWER-8
> > 24 cores,192 hardware threads
> > RAM - 492GB
> >
> > Non-default postgresql.conf settings-
> > shared_buffers=16GB
> > max_connections=200
> > min_wal_size=15GB
> > max_wal_size=20GB
> > checkpoint_timeout=900
> > maintenance_work_mem=1GB
> > checkpoint_completion_target=0.9
> >
> > scale_factor - 300
> >
> > Performance at commit 43cd468cf01007f39312af05c4c92ceb6de8afd8 is
469002 at
> > 64-client count and then at 6150a1b08a9fe7ead2b25240be46dddeae9d98e1, it
> > went down to 200807.  This performance numbers are median of 3 15-min
> > pgbench read-only tests.  The similar data is seen even when we revert
the
> > patch on latest commit.  We have yet to perform detail analysis as to
why
> > the commit 6150a1b08a9fe7ead2b25240be46dddeae9d98e1 lead to degradation,
> > but any ideas are welcome.
>
> Ugh. Especially the varying performance is odd. Does it vary between
> restarts, or is it just happenstance?  If it's the former, we might be
> dealing with some alignment issues.
>

It varies between restarts.

>
> If not, I wonder if the issue is massive buffer header contention. As a
> LL/SC architecture acquiring the content lock might interrupt buffer
> spinlock acquisition and vice versa.
>
> Does applying the patch from
http://archives.postgresql.org/message-id/CAPpHfdu77FUi5eiNb%2BjRPFh5S%2B1U%2B8ax4Zw%3DAUYgt%2BCPsKiyWw%40mail.gmail.com
> change the picture?
>

Not tried, but if this is alignment issue as you are suspecting above, then
does it make sense to try this out?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-02-26 Thread Amit Kapila

On Tue, Feb 23, 2016 at 7:06 PM, Robert Haas  wrote:

> On Sun, Feb 21, 2016 at 7:45 PM, Amit Kapila 
> wrote:
>
>> I mean, my basic feeling is that I would not accept a 2-3% regression in
>>> the single client case to get a 10% speedup in the case where we have 128
>>> clients.
>>>
>>
>>
When I tried by running the pgbench first with patch and then with Head, I
see 1.2% performance increase with patch.  TPS with patch is 976 and with
Head it is 964. For 3, 30 mins TPS data, refer "Patch –
group_clog_update_v5" and before that "HEAD – Commit 481725c0"
in perf_write_clogcontrollock_data_v6.ods attached with this mail.

Nonetheless, I have observed that below new check has been added by the
patch which can effect single client performance.  So I have changed it
such that new check is done only when we there is actually a need of group
update which means when multiple clients tries to update clog at-a-time.

+   if (!InRecovery &&
+   all_trans_same_page &&
+   nsubxids < PGPROC_MAX_CACHED_SUBXIDS &&
+   !IsGXactActive())

> I understand your point.  I think to verify whether it is run-to-run
>> variation or an actual regression, I will re-run these tests on single
>> client multiple times and post the result.
>>
>
> Perhaps you could also try it on a couple of different machines (e.g.
> MacBook Pro and a couple of different large servers).
>

Okay, I have tried latest patch (group_update_clog_v6.patch) on 2 different
big servers and then on Mac-Pro. The detailed data for various runs can be
found in attached document perf_write_clogcontrollock_data_v6.ods.  I have
taken the performance data for higher client-counts with somewhat larger
scale factor (1000) and data for median of same is as below:

M/c configuration
-
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters

max_connections = 1000
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout=35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

Client_Count/Patch_ver 1 8 64 128 256
HEAD 871 5090 17760 17616 13907
PATCH 900 5110 18331 20277 19263

Here, we can see that there is a gain of ~15% to ~38% at higher client
count.

The attached document (perf_write_clogcontrollock_data_v6.ods) contains
data, mainly focussing on single client performance.  The data is for
multiple runs on different machines, so I thought it is better to present
in form of document rather than dumping everything in e-mail.  Do let me
know if there is any confusion in understanding/interpreting the data.

Thanks to Dilip Kumar for helping me in conducting test of this patch on
MacBook-Pro.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

group_update_clog_v6.patch
Description: Binary data

perf_write_clogcontrollock_data_v6.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-02-26 Thread Amit Kapila

On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila 
wrote:
>
>
> Here, we can see that there is a gain of ~15% to ~38% at higher client
> count.
>
> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
> data, mainly focussing on single client performance.  The data is for
> multiple runs on different machines, so I thought it is better to present
> in form of document rather than dumping everything in e-mail.  Do let me
> know if there is any confusion in understanding/interpreting the data.
>
>
Forgot to mention that all these tests have been done by
reverting commit-ac1d794.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-02-28 Thread Amit Kapila

On Sun, Feb 28, 2016 at 9:05 PM, Dilip Kumar  wrote:
>
>
> On Sat, Feb 27, 2016 at 5:14 AM, Andres Freund  wrote:
>>
>> > > > >> ./pgbench  -j$ -c$ -T300 -M prepared -S postgres
>> > > > >>
>> > > > >> ClientBasePatch
>> > > > >> 11716916454
>> > > > >> 8108547105559
>> > > > >> 32241619262818
>> > > > >> 64206868233606
>> > > > >> 128137084217013
>> > >
>> > > So, there's a small regression on low client counts. That's worth
>> > > addressing.
>> > >
>> >
>> > Interesting. I'll try to reproduce it.
>>
>> Any progress here?
>
>
> In Multi socket machine with 8 sockets and 64 cores, I have seen more
regression compared to my previous run in power8 with 2 socket, currently I
tested Read only workload for 5 mins Run, When I get time, I will run for
longer time and confirm again.
>

Have you tried by reverting the commits 6150a1b0 and ac1d794, which I think
effects read-only performance and sometimes create variation in TPS across
different runs, here second might have less impact, but first one could
impact performance?  Is it possible for you to get perf data with and
without patch and share with others?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-02-29 Thread Amit Kapila

On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas  wrote:
>
> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila 
wrote:
> >> I wouldn't bother tinkering with it at this point.  The value isn't
> >> going to be recorded on disk anywhere, so it will be easy to change
> >> the way it's computed in the future if we ever need to do that.
> >>
> >
> > Okay. Find the rebased patch attached with this mail.  I have moved
> > this patch to upcoming CF.
>
> I would call the functions pgstat_report_wait_start() and
> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
> pgstat_report_end_waiting().
>
> I think pgstat_get_wait_event_type should not return HWLock, a term
> that appears nowhere in our source tree at present.  How about just
> "Lock"?
>
> I think the wait event types should be documented - and the wait
> events too, perhaps.
>

By above do you mean to say that we should document the name of each wait
event type and wait event.  Documenting wait event names is okay, but we
have approximately 65~70 wait events (considering individuals LWLocks,
Tranches, Locks, etc), if we want to document all the events, then I think
we can have a separate table having columns (wait event name, description)
just below pg_stat_activity and have link to that table in wait_event row
of pg_stat_activity table.  Does that matches your thought or you have
something else in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-02-29 Thread Amit Kapila

On Thu, Feb 25, 2016 at 2:54 AM, Peter Eisentraut  wrote:
>
> Could you enhance the documentation about the difference between "wait
> event type name" and "wait event name" (examples?)?
>

I am planning to add possible values for each of the wait event type and
wait event and will add few examples as well.  Let me know if you want to
see something else with respect to documentation?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-02-29 Thread Amit Kapila

On Mon, Feb 29, 2016 at 11:10 PM, Robert Haas  wrote:
>
> On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila 
wrote:
> > On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila 
> > wrote:
> >>
> >> Here, we can see that there is a gain of ~15% to ~38% at higher client
> >> count.
> >>
> >> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
> >> data, mainly focussing on single client performance.  The data is for
> >> multiple runs on different machines, so I thought it is better to
present in
> >> form of document rather than dumping everything in e-mail.  Do let me
know
> >> if there is any confusion in understanding/interpreting the data.
> >
> > Forgot to mention that all these tests have been done by reverting
> > commit-ac1d794.
>
> OK, that seems better.  But I have a question: if we don't really need
> to make this optimization apply only when everything is on the same
> page, then why even try?  If we didn't try, we wouldn't need the
> all_trans_same_page flag, which would reduce the amount of code
> change.

I am not sure if I understood your question, do you want to know why at the
first place transactions spanning more than one-page call the
function TransactionIdSetPageStatus()?  If we want to avoid trying the
transaction status update when they are on different page, then I think we
need some
major changes in TransactionIdSetTreeStatus().


>  Would that hurt anything? Taking it even further, we could
> remove the check from TransactionGroupUpdateXidStatus too.  I'd be
> curious to know whether that set of changes would improve performance
> or regress it.  Or maybe it does nothing, in which case perhaps
> simpler is better.
>
> All things being equal, it's probably better if the cases where
> transactions from different pages get into the list together is
> something that is more or less expected rather than a
> once-in-a-blue-moon scenario - that way, if any bugs exist, we'll find
> them.  The downside of that is that we could increase latency for the
> leader that way - doing other work on the same page shouldn't hurt
> much but different pages is a bigger hit.  But that hit might be
> trivial enough not to be worth worrying about.
>

In my tests, the check related to same page
in TransactionGroupUpdateXidStatus() doesn't impact performance in any way
and I think the reason is that, it happens rarely that group contain
multiple pages and even it occurs, there is hardly much impact.  So, I will
remove that check and I think thats what you also want for now.

> +   /*
> +* Now that we've released the lock, go back and wake everybody
up.  We
> +* don't do this under the lock so as to keep lock hold times to a
> +* minimum.  The system calls we need to perform to wake other
processes
> +* up are probably much slower than the simple memory writes
> we did while
> +* holding the lock.
> +*/
>
> This comment was true in the place that you cut-and-pasted it from,
> but it's not true here, since we potentially need to read from disk.
>

Okay, will change.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-01 Thread Amit Kapila

On Mon, Feb 29, 2016 at 3:37 PM, Dilip Kumar  wrote:

>
> On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar 
> wrote:
>
>>
>
> Test2: Identify that improvement in case of multiextend is becuase of
> avoiding context switch or some other factor, like reusing blocks b/w
> backend by putting in FSM..
>
> 1. Test by just extending multiple blocks and reuse in it's own backend
> (Don't put in FSM)
> Insert 1K record data don't fits in shared buffer 512MB Shared Buffer
>
>
>
ClientBaseExtend 800 block self useExtend 1000 Block
> 1  117  131 118
> 2  111  203 140
> 351  242 178
> 451  231 190
> 552  259 224
> 651  263 243
> 743  253  254
> 843  240  254
> 16  40  190  243
>
> We can see the same improvement in case of self using the blocks also, It
> shows that Sharing the blocks between the backend was not the WIN but
> avoiding context switch was the measure win.
>
>
One thing that is slightly unclear is that whether there is any overhead
due to buffer eviction especially when the buffer to be evicted is already
dirty and needs XLogFlush().  One reason why it might not hurt is that by
the time we tries to evict the buffer, corresponding WAL is already flushed
by WALWriter or other possibility could be that even if it is getting done
during buffer eviction, the impact for same is much lesser.  Can we try to
measure the number of flush calls which happen during buffer eviction?



> 2. Tested the Number of ProcSleep during the Run.
> This is the simple script of copy 1 record in one transaction of size
> 4 Bytes
>
> * BASE CODE*
> *PATCH MULTI EXTEND*
> ClientBase_TPSProcSleep CountExtend By 10 BlockProc
> Sleep Count
> 2280   457,506
> 31162,641
> 32351,098,701
> 358   141,624
> 42161,155,735
> 368   188,173
>
> What we can see in above test that, in Base code performance is degrading
> after 2 threads, while Proc Sleep count in increasing with huge amount.
>
> Compared to that in Patch, with extending 10 blocks at a time Proc Sleep
> reduce to ~1/8 and we can see it is constantly scaling.
>
> Proc Sleep test for Insert test when data don't fits in shared buffer and
> inserting big record of 1024 bytes, is currently running
> once I get the data will post the same.
>
>
Okay.  However, I wonder if the performance data for the case when data
doesn't fit into shared buffer also shows similar trend, then it might be
worth to try by doing extend w.r.t load in system.  I mean to say we can
batch the extension requests (as we have done in ProcArrayGroupClearXid)
and extend accordingly, if that works out then the benefit could be that we
don't need any configurable knob for the same.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-01 Thread Amit Kapila

On Fri, Feb 12, 2016 at 9:04 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> Hello, Robert
>
> > It also strikes me that it's probably quite likely that slock_t
> > mutex[NUM_FREELISTS] is a poor way to lay out this data in memory.
> > For example, on a system where slock_t is just one byte, most likely
> > all of those mutexes are going to be in the same cache line, which
> > means you're going to get a LOT of false sharing.  It seems like it
> > would be sensible to define a new struct that contains an slock_t, a
> > long, and a HASHELEMENT *, and then make an array of those structs.
> > That wouldn't completely eliminate false sharing, but it would reduce
> > it quite a bit.  My guess is that if you did that, you could reduce
> > the number of freelists to 8 or less and get pretty much the same
> > performance benefit that you're getting right now with 32. And that,
> > too, seems likely to be good for single-client performance.
>
> I experimented for a while trying to fit every spinlock in a separate
> cache line. Indeed we can gain some speedup this way. Here are
> benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in
> this case difference is more notable):
>
> | FREELISTS | SIZE =  32 | SIZE =  64 | SIZE = 128 | SIZE = 256 |
> |---|||||
> | 4 |   +25.4%   |   +28.7%   |   +28.4%   |   +28.3%   |
> | 8 |   +28.2%   |   +29.4%   |   +31.7%   |   +31.4%   |
> |16 |   +29.9%   |   +32.6%   |   +31.6%   |   +30.8%   |
> |32 |   +33.7%   |   +34.2%   |   +33.6%   |   +32.6%   |
>
> Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to
> align FREELIST structure, see attached patch).


I am not sure, if this is exactly what has been suggested by Robert, so it
is not straightforward to see if his suggestion can allow us to
use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
use FREELISTBUFF, why not do it as Robert has suggested and try with
different values of NUM_FREELISTS?



>
> > I am however wondering if it to set the freelist affinity based on
> > something other than the hash value, like say the PID, so that the
> > same process doesn't keep switching to a different freelist for every
> > buffer eviction.
>
> Also I tried to use PID to determine freeList number:
>
> ```
> #include 
> #include 
>
> ...
>
> #define FREELIST_IDX(hctl, hashcode) \
>   (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)
>
>
...
>
> // now nentries could be negative in this case
> // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
>
>
In which case, do you think entries can go negative?  I think the nentries
is incremented and decremented in the way as without patch, so I am not
getting what can make it go negative.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-01 Thread Amit Kapila

On Mon, Feb 29, 2016 at 11:10 PM, Robert Haas  wrote:
>
> On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila 
wrote:
> > On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila 
> > wrote:
> >>
> >> Here, we can see that there is a gain of ~15% to ~38% at higher client
> >> count.
> >>
> >> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
> >> data, mainly focussing on single client performance.  The data is for
> >> multiple runs on different machines, so I thought it is better to
present in
> >> form of document rather than dumping everything in e-mail.  Do let me
know
> >> if there is any confusion in understanding/interpreting the data.
> >
> > Forgot to mention that all these tests have been done by reverting
> > commit-ac1d794.
>
> OK, that seems better.  But I have a question: if we don't really need
> to make this optimization apply only when everything is on the same
> page, then why even try?
>

This is to save the case when sub-transactions belonging to a transaction
are on different pages, and the reason for same is that currently I am
using XidCache as stored in each proc to pass the information of
subtransactions to TransactionIdSetPageStatusInternal(), now if we allow
subtransactions from different pages then I need to extract subxid's from
that cache which belong to the page on which we are trying to update the
status.  Now this will add few more cycles in the code path under
ExclusiveLock without any clear benefit, thats why I have not implemented
it.  I have explained the same in code comments as well:

This optimization is only applicable if the transaction and

+ * all child sub-transactions belong to same page which we presume to be
the

+ * most common case, we might be able to apply this when they are not on
same

+ * page, but that needs us to map sub-transactions in proc's XidCache based

+ * on pageno for which each time Group leader needs to set the transaction

+ * status and that can lead to some performance penalty as well because it

+ * needs to be done after acquiring CLogControlLock, so let's leave that

+ * case for now.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] POC: Cache data in GetSnapshotData()

2016-03-01 Thread Amit Kapila

On Fri, Feb 26, 2016 at 2:55 PM, Robert Haas  wrote:
>
> On Thu, Feb 25, 2016 at 12:57 PM, Mithun Cy 
wrote:
> > I have fixed all of the issues reported by regress test. Also now when
> > backend try to cache the snapshot we also try to store the self-xid and
sub
> > xid, so other backends can use them.
> >

+ /* Add my own XID to snapshot. */
+ csnap->xip[count] = MyPgXact->xid;
+ csnap->xcnt = count + 1;

Don't we need to add this only when the xid of current transaction is
valid?  Also, I think it will be better if we can explain why we need to
add the our own transaction id while caching the snapshot.

> > I also did some read-only perf tests.
>
> I'm not sure what you are doing that keeps breaking threading for
> Gmail, but this thread is getting split up for me into multiple
> threads with only a few messages in each.  The same seems to be
> happening in the community archives.  Please try to figure out what is
> causing that and stop doing it.
>
> I notice you seem to not to have implemented this suggestion by Andres:
>
>
http://www.postgresql.org//message-id/20160104085845.m5nrypvmmpea5...@alap3.anarazel.de
>

As far as I can understand by reading the patch, it is kind of already
implemented the first suggestion by Andres which is to use try lock, now
the patch is using atomic ops to implement the same instead of using
lwlock, but I think it should show the same impact, do you see any problem
with the same?

Now talking about second suggestion which is to use some form of 'snapshot
slots' to reduce the contention further, it seems that could be beneficial,
if see any gain with try lock.  If you think that can benefit over current
approach taken in patch, then we can discuss how to implement the same.

> Also, you should test this on a machine with more than 2 cores.
> Andres's original test seems to have been on a 4-core system, where
> this would be more likely to work out.
>
> Also, Andres suggested testing this on an 80-20 write mix, where as
> you tested it on 100% read-only.  In that case there is no blocking
> ProcArrayLock, which reduces the chances that the patch will benefit.
>

+1

>
> Also, you could consider repeating the LWLOCK_STATS testing that Amit
> did in his original reply to Andres.  That would tell you whether the
> performance is not increasing because the patch doesn't reduce
> ProcArrayLock contention, or whether the performance is not increasing
> because the patch DOES reduce ProcArrayLock contention but then the
> contention shifts to CLogControlLock or elsewhere.
>

+1

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-01 Thread Amit Kapila

On Tue, Mar 1, 2016 at 8:13 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:
>
> Hello, Amit
>
> > I am not sure, if this is exactly what has been suggested by Robert,
> > so it is not straightforward to see if his suggestion can allow us to
> > use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
> > use FREELISTBUFF, why not do it as Robert has suggested and try with
> > different values of NUM_FREELISTS?
>
> Well, what I did is in fact a bit more general solution then Robert
> suggested. At first I just joined freeList and mutex arrays into one
> array and tried different NUM_FREELISTS values (see FREELISTS column).
> That was Robert's idea. Unfortunately it didn't give any considerable
> speedup comparing with previous approach.
>

Okay, now I can understand the data better and I think your tests indicate
that it is better to keep NUM_FREELISTS as 32.

>
> Then I tried to manually control sizeof every array item (see
> different SIZE values). By making it larger we can put every array item
> into a separate cache line. This approach helped a bit in terms of TPS
> (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
> same (32).
>
> So answering your question - it turned out that we _can't_ reduce
> NUM_FREELISTS this way.
>
> Also I don't believe that +0.3% TPS according to synthetic benchmark
> that doesn't represent any possible workload worth it considering
> additional code complexity.
>

I think data structure HASHHDR is looking simpler, however the code is
looking more complex, especially with the addition of FREELISTBUFF for
cacheline purpose.  I think you might want to try with exactly what has
been suggested and see if the code looks better that way, but if you are
not convinced, then lets leave it to committer unless somebody else wants
to try that suggestion.

> > In which case, do you think entries can go negative?  I think the
> > nentries is incremented and decremented in the way as without patch,
> > so I am not getting what can make it go negative.
>
> In this context we are discussing a quick and dirty fix "what would
> happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST
> instead of hash % NUM_FREELIST". The code is:
>
> int freelist_idx = FREELIST_IDX(hctl, hashvalue);
>
> // ...
>
> switch (action)
> {
>
> // ...
>
>   case HASH_REMOVE:
> if (currBucket != NULL)
> {
>   if (IS_PARTITIONED(hctl))
> SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex));
>
>   // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
>   FREELIST(hctl, freelist_idx).nentries--;
>
>   /* remove record from hash bucket's chain. */
>   *prevBucketPtr = currBucket->link;
>
> // ...
>
> No matter what freelist was used when element was added to a hash table.
> We always try to return free memory to the same freelist number getpid()
> % FREELIST_ITEMS and decrease number of elements in a hash table using
> corresponding nentries field.
>

Won't it always use the same freelist to remove and add the entry from
freelist as for both cases it will calculate the freelist_idx in same way?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-03 Thread Amit Kapila

On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas  wrote:
>
> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila 
wrote:
> >> I wouldn't bother tinkering with it at this point.  The value isn't
> >> going to be recorded on disk anywhere, so it will be easy to change
> >> the way it's computed in the future if we ever need to do that.
> >>
> >
> > Okay. Find the rebased patch attached with this mail.  I have moved
> > this patch to upcoming CF.
>
> I would call the functions pgstat_report_wait_start() and
> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
> pgstat_report_end_waiting().
>

Changed as per suggestion and made these functions inline.

> I think pgstat_get_wait_event_type should not return HWLock, a term
> that appears nowhere in our source tree at present.  How about just
> "Lock"?
>

Changed as per suggestion.

> I think the wait event types should be documented - and the wait
> events too, perhaps.
>

As discussed upthread, I have added documentation for all the possible wait
events and an example.  Some of the LWLocks like AsyncQueueLock and
AsyncCtlLock are used for quite similar purpose, so I have kept there
explanation as same.

> Maybe it's worth having separate wait event type names for lwlocks and
> lwlock tranches.  We could report LWLockNamed and LWLockTranche and
> document the difference: "LWLockNamed indicates that the backend is
> waiting for a specific, named LWLock.  The event name is the name of
> that lock.  LWLockTranche indicates that the backend is waiting for
> any one of a group of locks with similar function.  The event name
> identifies the general purpose of locks in that group."
>

Changed as per suggestion.

> There's no requirement that every session have every tranche
> registered.  I think we should consider displaying "extension" for any
> tranche that's not built-in, or at least for tranches that are not
> registered (rather than "unknown wait event").
>
> +   if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS)
>
> Isn't the second part automatically true at this point?
>

Fixed.

> The changes to LockBufferForCleanup() don't look right to me.  Waiting
> for a buffer pin might be a reasonable thing to define as a wait
> event, but it shouldn't reported as if we were waiting on the LWLock
> itself.
>

As discussed upthread, added a new wait event BufferPin for this case.

> What happens if an error is thrown while we're in a wait?
>

As discussed upthread, added in AbortTransaction and from where
ever LWLockReleaseAll() gets called, point to note is that we can call this
function only when we are sure there is no further possibility of wait on
LWLock.

> Does this patch hurt performance?
>

Performance tests are underway.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


extend_pg_stat_activity_v11.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-03 Thread Amit Kapila

On Thu, Mar 3, 2016 at 1:50 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> > Won't it always use the same freelist to remove and add the entry from
> > freelist as for both cases it will calculate the freelist_idx in same
> > way?
>
> No. If "our" freelist is empty when we try to remove an item from it we
> borrow item from another freelist.


I think the way patch works is if indicated freelist is empty, then it
tries to allocate new elements in that list and if it can't allocate, then
it tries to borrow from other freelist and in both cases the element to be
removed from freelist is considered to be the element of indicated freelist
(basically it always increment nentries[freelist_idx]).


> Then this borrowed item will be
> returned to "our" freelist instead of original. Without some sort of
> additional logic there is no way to figure out what freelist was
> original.
>
>
As the patch always operates on nentries[freelist_idx], so it seems to me
that in both cases (remove, add), the element is considered to be belonging
to same freelist.  Have you faced this problem of negative entries in any
of your test, if so then share the test, so that I can also understand the
scenario?



> Generally speaking negative nentries value is not something that
> couldn't be solved. But I would like to remind that in this context we
> are discussing a quick and dirty solution created for benchmark
> purposes in a first place.
>
>
I thought negative entries could be a problem for the patch as indicated by
Robert[1], but may be I am missing something.



> > You are not convinced, then lets leave it to committer unless
> > somebody else wants to try that suggestion.
>
> Agree. Frankly I'm tired of rewriting this patch over and over and over
> again. So I would like to avoid rewriting it once again unless there is
> a clear indication that this way we would gain something. Benchmarks
> shows that this is not a case thus it's only a matter of taste and
> intuition.


I can understand your feeling here and agree with you that it is a matter
of taste.  However, many a times if we change the patch according to
committer's taste, the chances of patch getting accepted early is better,
but I am not sure if that applies here, so feel free to go in the way you
think is better.

[1] -
http://www.postgresql.org/message-id/CA+TgmoacVsdcY=qn0do7nok7w2-ssqz3kr2y84bavifckqd...@mail.gmail.com


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] ExecGather() + nworkers

2016-03-04 Thread Amit Kapila

On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi 
wrote:

> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila 
> wrote:
> >>
> >
> > Changed the code such that nworkers_launched gets used wherever
> > appropriate instead of nworkers.  This includes places other than
> > pointed out above.
>
> The changes of the patch are simple optimizations that are trivial.
> I didn't find any problem regarding the changes. I think the same
> optimization is required in "ExecParallelFinish" function also.
>
>
There is already one change as below for ExecParallelFinish() in patch.

@@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)

  WaitForParallelWorkersToFinish(pei->pcxt);



  /* Next, accumulate buffer usage. */

- for (i = 0; i < pei->pcxt->nworkers; ++i)

+ for (i = 0; i < pei->pcxt->nworkers_launched; ++i)

  InstrAccumParallelQuery(&pei->buffer_usage[i]);

Can you be slightly more specific, where exactly you are expecting more
changes?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] ExecGather() + nworkers

2016-03-04 Thread Amit Kapila

On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi 
wrote:

> On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila 
> wrote:
> > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi <
> kommi.harib...@gmail.com>
> > wrote:
> >>
> >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila 
> >> wrote:
> >> >>
> >> >
> >> > Changed the code such that nworkers_launched gets used wherever
> >> > appropriate instead of nworkers.  This includes places other than
> >> > pointed out above.
> >>
> >> The changes of the patch are simple optimizations that are trivial.
> >> I didn't find any problem regarding the changes. I think the same
> >> optimization is required in "ExecParallelFinish" function also.
> >>
> >
> > There is already one change as below for ExecParallelFinish() in patch.
> >
> > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
> >
> >   WaitForParallelWorkersToFinish(pei->pcxt);
> >
> >
> >
> >   /* Next, accumulate buffer usage. */
> >
> > - for (i = 0; i < pei->pcxt->nworkers; ++i)
> >
> > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
> >
> >   InstrAccumParallelQuery(&pei->buffer_usage[i]);
> >
> >
> > Can you be slightly more specific, where exactly you are expecting more
> > changes?
>
> I missed it during the comparison with existing code and patch.
> Everything is fine with the patch. I marked the patch as ready for
> committer.
>
>
Thanks!


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-04 Thread Amit Kapila

On Fri, Mar 4, 2016 at 4:01 PM, Alexander Korotkov 
wrote:

> On Fri, Mar 4, 2016 at 7:05 AM, Amit Kapila 
> wrote:
>
>> > I think the wait event types should be documented - and the wait
>> > events too, perhaps.
>> >
>>
>> As discussed upthread, I have added documentation for all the possible
>> wait events and an example.  Some of the LWLocks like AsyncQueueLock and
>> AsyncCtlLock are used for quite similar purpose, so I have kept there
>> explanation as same.
>>
>
> Do you think it worth grouping rows in "wait_event Description" table by
> wait event type?
>

They are already grouped (implicitly), do you mean to say that we should
add wait event type name as well in that table? If yes, then the only
slight worry is that there will lot of repetition in wait_event_type
column, otherwise it is okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] ExecGather() + nworkers

2016-03-04 Thread Amit Kapila

On Fri, Mar 4, 2016 at 11:41 PM, Robert Haas  wrote:

> On Fri, Mar 4, 2016 at 6:55 AM, Amit Kapila 
> wrote:
> > On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi  >
> > wrote:
> >>
> >> On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila 
> >> wrote:
> >> > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi
> >> > 
> >> > wrote:
> >> >>
> >> >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <
> amit.kapil...@gmail.com>
> >> >> wrote:
> >> >> >>
> >> >> >
> >> >> > Changed the code such that nworkers_launched gets used wherever
> >> >> > appropriate instead of nworkers.  This includes places other than
> >> >> > pointed out above.
> >> >>
> >> >> The changes of the patch are simple optimizations that are trivial.
> >> >> I didn't find any problem regarding the changes. I think the same
> >> >> optimization is required in "ExecParallelFinish" function also.
> >> >>
> >> >
> >> > There is already one change as below for ExecParallelFinish() in
> patch.
> >> >
> >> > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
> >> >
> >> >   WaitForParallelWorkersToFinish(pei->pcxt);
> >> >
> >> >
> >> >
> >> >   /* Next, accumulate buffer usage. */
> >> >
> >> > - for (i = 0; i < pei->pcxt->nworkers; ++i)
> >> >
> >> > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
> >> >
> >> >   InstrAccumParallelQuery(&pei->buffer_usage[i]);
> >> >
> >> >
> >> > Can you be slightly more specific, where exactly you are expecting
> more
> >> > changes?
> >>
> >> I missed it during the comparison with existing code and patch.
> >> Everything is fine with the patch. I marked the patch as ready for
> >> committer.
> >>
> >
> > Thanks!
>
> OK, committed.
>
>
Thanks.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-04 Thread Amit Kapila

On Fri, Mar 4, 2016 at 7:23 PM, Thom Brown  wrote:
>
> On 4 March 2016 at 13:41, Alexander Korotkov  wrote:
> >
> >>
> >> If yes, then the only slight worry is that there will lot of
repetition in
> >> wait_event_type column, otherwise it is okay.
> >
> >
> > There is morerows attribute of entry tag in Docbook SGML, it behaves
like
> > rowspan in HTML.
>
> +1
>

I will try to use morerows in documentation.

> Yes, we do this elsewhere in the docs.  And it is difficult to look
> through the wait event names at the moment.
>
> I'm also not keen on all the "A server process is waiting" all the way
> down the list.
>

How about giving the column name as "Wait For" instead of "Description" and
then use text like "finding or allocating space in shared memory"?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-04 Thread Amit Kapila

On Fri, Mar 4, 2016 at 9:59 PM, Robert Haas  wrote:
>
> On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar 
wrote:
> > I have tried the approach of group extend,
> >
> > 1. We convert the extension lock to TryLock and if we get the lock then
> > extend by one block.2.
> > 2. If don't get the Lock then use the Group leader concep where only one
> > process will extend for all, Slight change from ProcArrayGroupClear is
that
> > here other than satisfying the requested backend we Add some extra
blocks in
> > FSM, say GroupSize*10.
> > 3. So Actually we can not get exact load but still we have some factor
like
> > group size tell us exactly the contention size and we extend in
multiple of
> > that.
>
> This approach seems good to me, and the performance results look very
> positive.  The nice thing about this is that there is not a
> user-configurable knob; the system automatically determines when
> larger extensions are needed, which will mean that real-world users
> are much more likely to benefit from this.
>

I think one thing which needs more thoughts about this approach is that we
need to maintain some number of slots so that group extend for different
relations can happen in parallel.  Do we want to provide simultaneous
extension for 1, 2, 3, 4, 5 or more number of relations?  I think providing
it for three or four relations should be okay as higher the number we want
to provide, bigger the size of PGPROC structure will be.

+GroupExtendRelation(PGPROC *proc, Relation relation, BulkInsertState
bistate)

+{

+ volatile PROC_HDR *procglobal = ProcGlobal;

+ uint32 nextidx;

+ uint32 wakeidx;

+ int extraWaits = -1;

+ BlockNumber targetBlock;

+ int count = 0;

+

+ /* Add ourselves to the list of processes needing a group extend. */

+ proc->groupExtendMember = true;

..

..

+ /* We are the leader.  Acquire the lock on behalf of everyone. */

+ LockRelationForExtension(relation, ExclusiveLock);

To provide it for multiple relations, I think you need to advocate the reloid
for relation in each proc and then get the relation descriptor for relation
extension lock.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP: Upper planner pathification

2016-03-05 Thread Amit Kapila

On Fri, Mar 4, 2016 at 11:31 PM, Tom Lane  wrote:
>
> OK, here is a version that I think addresses all of the recent comments:
>
> * Fixed handling of parallel-query fields in new path node types.
> (BTW, I found what seemed to be a couple of pre-existing bugs of
> the same kind, eg create_mergejoin_path was different from the
> other two kinds of join as to setting parallel_degree.)
>

I think the reason for keeping parallel_degree as zero for mergejoin path
is that currently it can't participate in parallelism.


*** create_unique_path(PlannerInfo *root, Re
*** 1440,1446 
  pathnode->path.param_info = subpath-
>param_info;
  pathnode->path.parallel_aware = false;
  pathnode->path.parallel_safe = subpath->parallel_safe;
!
pathnode->path.parallel_degree = 0;

  /*
  * Assume the output is unsorted, since we don't necessarily
have pathkeys
--- 1445,1451 
  pathnode->path.param_info = subpath->param_info;
  pathnode-
>path.parallel_aware = false;
  pathnode->path.parallel_safe = subpath->parallel_safe;
! pathnode-
>path.parallel_degree = subpath->parallel_degree;

Similar to reason for merge join path, I think this should also be set to 0.

Similarly for LimitPath, parallel_degree should be set to 0.


+ RecursiveUnionPath *
+ create_recursiveunion_path(PlannerInfo *root,
+   RelOptInfo *rel,
+   Path *leftpath,
+   Path *rightpath,
+   PathTarget *target,
+   List *distinctList,
+   int wtParam,
+   double numGroups)
+ {
+ RecursiveUnionPath *pathnode = makeNode(RecursiveUnionPath);
+
+ pathnode->path.pathtype = T_RecursiveUnion;
+ pathnode->path.parent = rel;
+ pathnode->path.pathtarget = target;
+ /* For now, assume we are above any joins, so no parameterization */
+ pathnode->path.param_info = NULL;
+ pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_safe =
+ leftpath->parallel_safe && rightpath->parallel_safe;

I think here we should use rel->consider_parallel to set parallel_safe as
is done in create_mergejoin_path.

+  * It's only needed atop a node that doesn't support projection

"needed atop a node", seems unclear to me, typo?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP: Upper planner pathification

2016-03-05 Thread Amit Kapila

On Sat, Mar 5, 2016 at 4:51 PM, Amit Kapila  wrote:
>
> On Fri, Mar 4, 2016 at 11:31 PM, Tom Lane  wrote:
> >
> > OK, here is a version that I think addresses all of the recent comments:
> >
> > * Fixed handling of parallel-query fields in new path node types.
> > (BTW, I found what seemed to be a couple of pre-existing bugs of
> > the same kind, eg create_mergejoin_path was different from the
> > other two kinds of join as to setting parallel_degree.)
> >
>
> I think the reason for keeping parallel_degree as zero for mergejoin path
is that currently it can't participate in parallelism.
>
>
> *** create_unique_path(PlannerInfo *root, Re
> *** 1440,1446 
>   pathnode->path.param_info = subpath-
> >param_info;
>   pathnode->path.parallel_aware = false;
>   pathnode->path.parallel_safe = subpath->parallel_safe;
> !
> pathnode->path.parallel_degree = 0;
>
>   /*
>   * Assume the output is unsorted, since we don't necessarily
> have pathkeys
> --- 1445,1451 
>   pathnode->path.param_info = subpath->param_info;
>   pathnode-
> >path.parallel_aware = false;
>   pathnode->path.parallel_safe = subpath->parallel_safe;
> ! pathnode-
> >path.parallel_degree = subpath->parallel_degree;
>
> Similar to reason for merge join path, I think this should also be set to
0.
>
> Similarly for LimitPath, parallel_degree should be set to 0.
>

Oops, It seems Robert has made comment upthread that we should set
parallel_degree from subpath except for write paths, so I think the above
comment can be ignored.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP: Upper planner pathification

2016-03-05 Thread Amit Kapila

On Sat, Mar 5, 2016 at 10:11 PM, Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Fri, Mar 4, 2016 at 11:31 PM, Tom Lane  wrote:
> >> (BTW, I found what seemed to be a couple of pre-existing bugs of
> >> the same kind, eg create_mergejoin_path was different from the
> >> other two kinds of join as to setting parallel_degree.)
>
> > I think the reason for keeping parallel_degree as zero for mergejoin
path
> > is that currently it can't participate in parallelism.
>
> Is there some reason why hash and nestloop are safe but merge isn't?
>

I think it is because we consider to pushdown hash and nestloop to workers,
but not merge join and the reason for not pushing mergejoin is that
currently we don't have executor support for same, more work is needed
there.  I think even if we set parallel_degree as you are doing in patch
for merge join is harmless, but ideally there is no need to set it as far
as what we support today in terms of parallelism.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP: Upper planner pathification

2016-03-06 Thread Amit Kapila

On Sun, Mar 6, 2016 at 9:02 PM, Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Sat, Mar 5, 2016 at 10:11 PM, Tom Lane  wrote:
> >> Is there some reason why hash and nestloop are safe but merge isn't?
>
> > I think it is because we consider to pushdown hash and nestloop to
workers,
> > but not merge join and the reason for not pushing mergejoin is that
> > currently we don't have executor support for same, more work is needed
> > there.
>
> If that's true, then mergejoin paths ought to be marked parallel-unsafe
> explicitly (with a comment as to why), not just silently reduced to degree
> zero in a manner that looks more like an oversight than anything
> intentional.
>
> I also note that the regression tests pass with this patch and parallel
> mode forced, which seems unlikely if allowing a parallel worker to execute
> a join works for only two out of the three join types.  And checking the
> git history for nodeHashjoin.c, nodeHash.c, and nodeNestloop.c shows no
> evidence that any of those files have been touched for parallel query,
> so it's pretty hard to see a reason why those would work in parallel
> queries but nodeMergejoin.c not.
>

To make hash and nestloop work in parallel queries, we just push those
nodes below gather node.  Refer code
paths match_unsorted_outer()->consider_parallel_nestloop()
and hash_inner_and_outer()->try_partial_hashjoin_path().   Once the partial
path for hash and nestloop gets generated in above code path, we generate
gather path in function add_paths_to_joinrel()->generate_gather_paths().
You won't find the code to generate partial path for merge join.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP: Upper planner pathification

2016-03-06 Thread Amit Kapila

On Mon, Mar 7, 2016 at 9:13 AM, Tom Lane  wrote:
>
> Amit Kapila  writes:
> >>>> Is there some reason why hash and nestloop are safe but merge isn't?
>
> > To make hash and nestloop work in parallel queries, we just push those
> > nodes below gather node.  Refer code
> > paths match_unsorted_outer()->consider_parallel_nestloop()
> > and hash_inner_and_outer()->try_partial_hashjoin_path().
>
> AFAICS, those are about generating partial paths, which is a completely
> different thing from whether a regular path is parallel-safe or not.
>

Okay, but the main point I wanted to convey is that I think setting
parallel_degree = 0 in mergejoin path is not necessarily a copy-paste
error.  If you see the other path generation code like
create_index_path(), create_samplescan_path(), etc. there we set
parallel_degree to zero even though the parallel-safety is determined based
on reloptinfo.  And I don't see any use of setting parallel_degree for path
which can't be pushed beneath gather (aka can be executed by multiple
workers).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] ExecGather() + nworkers

2016-03-07 Thread Amit Kapila

On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan  wrote:
>
> On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas 
wrote:
>
> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).
>
> More importantly, I have other, entirely general concerns. Other major
> RDBMSs have settings that are very similar to max_parallel_degree,
> with a setting of 1 effectively disabling all parallelism. Both Oracle
> and SQL Server have settings that they both call the "maximum degree
> or parallelism". I think it's a bit odd that with Postgres,
> max_parallel_degree = 1 can still use parallelism at all. I have to
> wonder: are we conflating controlling the resources used by parallel
> operations with how shared memory is doled out?
>

Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means
parallelism is disabled then when somebody sets max_parallel_degree = 2,
then it looks somewhat odd to me that, it will mean that 1 worker process
can be used for parallel query.  Also, I think the parallel query will be
able to get parallel workers till max_worker_processes + 1 which seems less
intuitive than the current.

On your point of other databases, I have also checked and it seems like
some of other databases like sybase [1] also provide a similar parameter
and value 1 means serial execution, so we might also want to consider it
similarly, but to me the current setting sounds quite intuitive, however if
more people also feel the same as you, then we should change it.

[1] -
http://infocenter.sybase.com/archive/index.jsp?topic=/com.sybase.help.ase_15.0.sag1/html/sag1/sag1234.htm

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP: Upper planner pathification

2016-03-07 Thread Amit Kapila

On Mon, Mar 7, 2016 at 11:52 AM, Peter Geoghegan  wrote:
>
> On Sun, Mar 6, 2016 at 9:59 PM, Tom Lane  wrote:
> > Perhaps it was intentional when written, but if Robert's advice is
correct
> > that the new upper-planner path nodes should copy up parallel_degree
from
> > their children, then it cannot be the case that parallel_degree>0 in a
> > node above the scan level implies that that node type has any special
> > behavior for parallelism.
> >

Right.

>
> > I continue to bemoan the lack of documentation about what these fields
> > mean.
> >

Point taken and if Robert doesn't feel otherwise, I can try to write a
patch to explain the newly added fields.

> > As far as I can find, the sum total of the documentation about
> > this field is
> >
> > int parallel_degree; /* desired parallel degree; 0 = not
parallel */
>
> While it doesn't particularly relate to parallel joins, I've expressed
> a general concern about the max_parallel_degree GUC that I think is
> worth considering again:
>
>
http://www.postgresql.org/message-id/cam3swzrs1mtvrkkasy1xbshgzxkd6-hnxx3gq7x-p-dz0zt...@mail.gmail.com
>
> In summary, I think it's surprising that a max_parallel_degree of 1
> doesn't disable parallel workers entirely.
>

I have responded on the thread where you have raised that point with my
thoughts, discussing it here on a separate point can dilute the purpose of
this thread.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-08 Thread Amit Kapila

On Mon, Mar 7, 2016 at 8:34 PM, Robert Haas  wrote:
>
> On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila 
wrote:
> > I think one thing which needs more thoughts about this approach is that
we
> > need to maintain some number of slots so that group extend for different
> > relations can happen in parallel.  Do we want to provide simultaneous
> > extension for 1, 2, 3, 4, 5 or more number of relations?  I think
providing
> > it for three or four relations should be okay as higher the number we
want
> > to provide, bigger the size of PGPROC structure will be.
>
> Hmm.  Can we drive this off of the heavyweight lock manager's idea of
> how big the relation extension lock wait queue is, instead of adding
> more stuff to PGPROC?
>

One idea to make it work without adding additional stuff in PGPROC is that
after acquiring relation extension lock, check if there is any available
block in fsm, if it founds any block, then release the lock and proceed,
else extend the relation by one block and then check lock's wait queue size
or number of lock requests (nRequested) and extend the relation further in
proportion to wait queue size and then release the lock and proceed.  Here,
I think we can check for wait queue size even before extending the relation
by one block.

The benefit of doing it with PGPROC is that there will be relatively less
number LockAcquire calls as compare to heavyweight lock approach, which I
think should not matter much because we are planing to extend the relation
in proportion to wait queue size (probably wait queue size * 10).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-08 Thread Amit Kapila

On Tue, Mar 8, 2016 at 7:23 PM, Robert Haas  wrote:
>
> On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila 
wrote:
> >> Hmm.  Can we drive this off of the heavyweight lock manager's idea of
> >> how big the relation extension lock wait queue is, instead of adding
> >> more stuff to PGPROC?
> >
> > One idea to make it work without adding additional stuff in PGPROC is
that
> > after acquiring relation extension lock, check if there is any available
> > block in fsm, if it founds any block, then release the lock and proceed,
> > else extend the relation by one block and then check lock's wait queue
size
> > or number of lock requests (nRequested) and extend the relation further
in
> > proportion to wait queue size and then release the lock and proceed.
Here,
> > I think we can check for wait queue size even before extending the
relation
> > by one block.
> >
> > The benefit of doing it with PGPROC is that there will be relatively
less
> > number LockAcquire calls as compare to heavyweight lock approach, which
I
> > think should not matter much because we are planing to extend the
relation
> > in proportion to wait queue size (probably wait queue size * 10).
>
> I don't think switching relation extension from heavyweight locks to
> lightweight locks is going to work.
>

Sorry, but I am not suggesting to change it to lightweight locks.  I am
just suggesting how to make batching works with heavyweight locks as asked
by you.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission

2016-03-09 Thread Amit Kapila

On Wed, Mar 9, 2016 at 11:46 AM, Haribabu Kommi 
wrote:
>
>
> I tried replacing the random() with PostmaterRandom() for a test and it
worked.
> This is generating different random values, so the issue is not occurring.
>
> "Global/PostgreSQL.2115609797"
>
> I feel, we should add the the data directory path + the random number to
> generate the name for dynamic shared memory, this can fix problem.
>

As mentioned above, I think if we can investigate why this error is
generated, that will be helpful.  Currently the code ensures that if the
segment already exists, it should retry to create a segment with other name
(refer dsm_impl_windows()), so the point of investigation is, why it is not
going via that path?  I am guessing due to some reason CreateFileMapping()
is returning NULL in this case whereas ideally it should return the
existing handle with an error ERROR_ALREADY_EXISTS.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP: Upper planner pathification

2016-03-09 Thread Amit Kapila

On Tue, Mar 8, 2016 at 2:31 AM, Tom Lane  wrote:
>
> I wrote:
> >> Attached is a version that addresses today's concerns, and also
finishes
> >> filling in the loose ends I'd left before, such as documentation and
> >> outfuncs.c support.  I think this is in a committable state now, though
> >> I plan to read through the whole thing again.
>
> The extra read-through located some minor bugs, mainly places where I'd
> forgotten to ensure that Path cost info was transposed into the generated
> Plan.  That would only have the cosmetic effect that EXPLAIN would print
> zeroes for estimated costs, and since we only use EXPLAIN COSTS OFF in
> the regression tests, no test failures ensued :-(.
>
> I've pushed it now; we'll soon see if the buildfarm finds any problems.
>

On latest commit-51c0f63e, I am seeing some issues w.r.t parallel query.
Consider a below case:

create table t1(c1 int, c2 char(1000));
insert into t1 values(generate_series(1,30),'');
analyze t1;
set max_parallel_degree=2;
postgres=# explain select c1, count(c1) from t1 where c1 < 1000 group by c1;
ERROR:  ORDER/GROUP BY expression not found in targetlist

Without setting max_parallel_degree, it works fine and generate the
appropriate results.  Here the issue seems to be that the code in
grouping_planner doesn't apply the required PathTarget to Path below Gather
Path due to which when we generate target list for scan plan, it misses the
required information which in this case is sortgrouprefs and the same
target list is then propagated for upper nodes which eventually leads to
the above mentioned failure.  Due to same reason, if the target list
contains some expression, it will give wrong results when parallel query is
used.  I could see below ways to solve this issue.

Approach-1
-
First way to solve this issue is to jam the PathTarget for partial paths
similar to what we are doing for Paths and I have verified that resolves
the issue, the patch for same is attached with this mail.  However, it
won't work as-is, because this will make target lists pushed to workers as
we have applied them below Gather Path which we don't want if the target
list has any parallel unsafe functions.  To make this approach work, we
need to ensure that we jam the PathTarget for partial paths (or generate a
projection) only if they contain parallel-safe expressions.  Now while
creating Gather Plan (create_gather_plan()), we need to ensure in some way
that if path below doesn't generate the required tlist, then it should
generate it's own tlist rather than using it from subplan.

Approach-2
--
Always generate a tlist in Gather plan as we do for many other cases.  I
think this approach will also resolve the issue but haven't tried yet.

Approach-1 has a benefit that it can allow to push target lists to workers
which will result in speed-up of parallel queries especially for cases when
target list contain costly expressions, so I am slightly inclined to follow
that even though that looks more work.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

apply_tlist_partial_path_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-09 Thread Amit Kapila

On Fri, Mar 4, 2016 at 7:11 PM, Alexander Korotkov 
wrote:
>
>>
>> If yes, then the only slight worry is that there will lot of repetition
in wait_event_type column, otherwise it is okay.
>
>
> There is morerows attribute of entry tag in Docbook SGML, it behaves like
rowspan in HTML.
>

Thanks for the suggestion.  I have updated the patch to include
wait_event_type information in the wait_event table.

As asked above by Robert, below is performance data with the patch.

M/C Details
--
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Performance Data

min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout=15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9

pgbench read-only (median of 3, 5-min runs)

clients BASE PATCH %
1 19703.549206 19992.141542 1.4646718364
8 120105.542849 127717.835367 6.3380026745
64 487334.338764 495861.7211254 1.7498012521

The read-only data shows some improvement with patch, but I think this is
mostly attributed to run-to-run variation.

pgbench read-write (median of 3, 30-min runs)

clients BASE PATCH %
1 1703.275728 1696.568881 -0.3937616729
8 8884.406185 9442.387472 6.2804567394
64 32648.82798 32113.002416 -1.6411785572

In the above data, the read-write data shows small regression (1.6%) at
higher client-count, but when I ran individually that test, the difference
was 0.5%. I think it is mostly attributed to run-to-run variation as we see
with read-only tests.

Thanks to Mithun C Y for doing performance testing of this patch.

As this patch is adding 4-byte variable to shared memory structure PGPROC,
so this is susceptible to memory alignment issues for shared buffers as
discussed in thread [1], but in general the performance data doesn't
indicate any regression.

[1] -
http://www.postgresql.org/message-id/caa4ek1+zeb8pmwwktf+3brs0pt4ux6rs6aom0uip8c6shjw...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

extend_pg_stat_activity_v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-09 Thread Amit Kapila

On Wed, Mar 9, 2016 at 7:17 PM, Robert Haas  wrote:
>
> On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila 
wrote:
> > Thanks for the suggestion.  I have updated the patch to include
wait_event_type information in the wait_event table.
>
> I think we should remove "a server process is" from all of these entries.
>
> Also, I think this kind of thing should be tightened up:
>
> + A server process is waiting on any one of the
commit_timestamp
> + buffer locks to read or write the commit_timestamp page in the
> + pg_commit_ts subdirectory.
>
> I'd just write: Waiting to read or write a commit timestamp buffer.
>

Okay, changed as per suggestion and fixed the morerows issue pointed by
Thom.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


extend_pg_stat_activity_v14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] POC: Cache data in GetSnapshotData()

2016-03-10 Thread Amit Kapila

On Thu, Mar 10, 2016 at 1:04 PM, Mithun Cy 
wrote:

>
>
> On Thu, Mar 3, 2016 at 11:50 PM, Robert Haas 
> wrote:
> >What if you apply both this and Amit's clog control log patch(es)?  Maybe
> the combination of the two helps substantially more than either >one alone.
>
>
> I did the above tests along with Amit's clog patch. Machine :8 socket, 64
> core. 128 hyperthread.
>
> clients BASE ONLY CLOG CHANGES % Increase ONLY SAVE SNAPSHOT % Increase CLOG
> CHANGES + SAVE SNAPSHOT % Increase
> 64 29247.658034 30855.728835 5.4981181711 29254.532186 0.0235032562
> 32691.832776 11.7758992463
> 88 31214.305391 33313.393095 6.7247618606 32109.248609 2.8670931702
> 35433.655203 13.5173592978
> 128 30896.673949 34015.362152 10.0939285832 *** *** 34947.296355
> 13.110221549
> 256 27183.780921 31192.895437 14.7481857938 *** *** 32873.782735
> 20.9316056164
> With clog changes, perf of caching the snapshot patch improves.
>
>
This data looks promising to me and indicates that saving the snapshot has
benefits and we can see noticeable performance improvement especially once
the CLog contention gets reduced.  I wonder if we should try these tests
with unlogged tables, because I suspect most of the contention after
CLogControlLock and ProcArrayLock is for WAL related locks, so you might be
able to see better gain with these patches.  Do you think it makes sense to
check the performance by increasing CLOG buffers (patch for same is posted
in Speed up Clog thread [1]) as that also relieves contention on CLOG as
per the tests I have done?


[1] -
http://www.postgresql.org/message-id/caa4ek1lmmgnq439bum0lcs3p0sb8s9kc-cugu_thnqmwa8_...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.

2016-03-10 Thread Amit Kapila

On Wed, Mar 9, 2016 at 8:18 PM, Mithun Cy 
wrote:
>
> Hi All,
>
> Explain [Analyze] Select Into table. produces the plan which uses
parallel scans.
>
> Possible Fix:
>
> I tried to make a patch to fix this. Now in ExplainOneQuery if into
clause is
>
> defined then parallel plans are disabled as similar to their execution.
>

- /* plan the query */

- plan = pg_plan_query(query, CURSOR_OPT_PARALLEL_OK, params);

+ /*

+ * plan the query.

+ * Note: If Explain is for CreateTableAs / SelectInto Avoid parallel

+ *   plans.

+ */

+ plan = pg_plan_query(query, into ? 0:CURSOR_OPT_PARALLEL_OK, params);

There should be a white space between 0:CURSOR_OPT_PARALLEL_OK.  Also I
don't see this comment is required as similar other usage doesn't have any
such comment.  Fixed these two points in the attached patch.

In general, the patch looks good to me and solves the problem mentioned.  I
have ran the regression tests with force_parallel_mode and doesn't see any
problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Analyze_select_into_disable_parallel_scan_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission

2016-03-10 Thread Amit Kapila

On Wed, Mar 9, 2016 at 5:46 PM, Haribabu Kommi 
wrote:
> On Wed, Mar 9, 2016 at 10:06 PM, Amit Kapila 
wrote:
> > On Wed, Mar 9, 2016 at 11:46 AM, Haribabu Kommi <
kommi.harib...@gmail.com>
> > wrote:
> >>
> >>
> >> I tried replacing the random() with PostmaterRandom() for a test and it
> >> worked.
> >> This is generating different random values, so the issue is not
occurring.
> >>
> >> "Global/PostgreSQL.2115609797"
> >>
> >> I feel, we should add the the data directory path + the random number
to
> >> generate the name for dynamic shared memory, this can fix problem.
> >>
> >
> > As mentioned above, I think if we can investigate why this error is
> > generated, that will be helpful.  Currently the code ensures that if the
> > segment already exists, it should retry to create a segment with other
name
> > (refer dsm_impl_windows()), so the point of investigation is, why it is
not
> > going via that path?  I am guessing due to some reason
CreateFileMapping()
> > is returning NULL in this case whereas ideally it should return the
existing
> > handle with an error ERROR_ALREADY_EXISTS.
>
> DEBUG:  mapped win32 error code 5 to 13
>
> Yes, the CreateFileMapping() is returning NULL with an error of
> ERROR_ACCESS_DENIED.
>

Okay, so one probable theory for such an error could be that when there is
already an object with same name exists, this API requests access to the
that existing object and found that it can't access it due to some reason.
On googling, I found some people suggesting to try by disabling UAC [1] on
your m/c, can you once try that to see what is the result (this experiment
is just to find out the actual reason of failure, rather than a permanent
change suggestion).

>
> I am not able to find the reason for this error. This error is occurring
only
> when the PostgreSQL is started as a service only.
>

Did you use pg_ctl register/unregister to register different services.  Can
you share the detail steps and OS version on which you saw this behaviour?

[1] -
http://windows.microsoft.com/en-in/windows/turn-user-account-control-on-off#1TC=windows-7

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [COMMITTERS] pgsql: Provide much better wait information in pg_stat_activity.

2016-03-10 Thread Amit Kapila

On Fri, Mar 11, 2016 at 9:19 AM, Joel Jacobson  wrote:
>
> On Fri, Mar 11, 2016 at 9:36 AM, Robert Haas 
wrote:
> > Well, this was discussed.  If we keep the Boolean "waiting" column,
then either:
>
> Oh, sorry for missing out on that discussion.
>
> > 1. We make it true only for heavyweight lock waits, and false for
> > other kinds of waits.  That's pretty strange.
> > 2. We make it true for all kinds of waits that we now know how to
> > report.  That still breaks compatibility.
>
> Why not 3: We make it true for exactly the same type of situations as
> in previous versions. Or is it not possible to do so for some reason?
>

Thats exactly the first point (1) of Robert.  One thing that will be
strange according to me is that in some cases where waiting will be false,
but still wait_event and wait_event_type contain some wait information and
I think that will look odd to anybody new looking at the view.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-10 Thread Amit Kapila

On Fri, Mar 11, 2016 at 12:28 AM, Robert Haas  wrote:
>
>
> Committed with some further editing.  In particular, the way you
> determined whether we could safely access the tranche information for
> any given ID was wrong; please check over what I did and make sure
> that isn't also wrong.
>

There are few typos which I have tried to fix with the attached patch.  Can
you tell me what was wrong with the way it was done in patch?

@@ -4541,9 +4542,10 @@ AbortSubTransaction(void)
  */
  LWLockReleaseAll();

+ pgstat_report_wait_end();
+ pgstat_progress_end_command();
  AbortBufferIO();
  UnlockBuffers();
- pgstat_progress_end_command();

  /* Reset WAL record construction state */
  XLogResetInsertion();
@@ -4653,6 +4655,9 @@ AbortSubTransaction(void)
  */
  XactReadOnly = s->prevXactReadOnly;

+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+
  RESUME_INTERRUPTS();
 }

AbortSubTransaction() does call pgstat_report_wait_end() twice, is this
intentional? I have kept it in the end because there is a chance that in
between API's can again set the state to wait and also by that time we have
not released buffer pins and heavyweight locks, so not sure if it makes
sense to report wait end at that stage.  I have noticed that in
WaitOnLock(), on error the wait end is set, but now again thinking on it,
it seems it will be better to set it in
AbortTransaction/AbortSubTransaction at end.  What do you think?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

fix_typo_lwlock_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission

2016-03-11 Thread Amit Kapila

On Fri, Mar 11, 2016 at 5:21 PM, Haribabu Kommi 
wrote:
>
> On Fri, Mar 11, 2016 at 12:00 AM, Amit Kapila 
wrote:
> >
> > Okay, so one probable theory for such an error could be that when there
is
> > already an object with same name exists, this API requests access to the
> > that existing object and found that it can't access it due to some
reason.
> > On googling, I found some people suggesting to try by disabling UAC [1]
on
> > your m/c, can you once try that to see what is the result (this
experiment
> > is just to find out the actual reason of failure, rather than a
permanent
> > change suggestion).
>
> Thanks for the details. Currently I am unable to change the UAC settings
in my
> laptop. I will try to do it in a different system and let you know the
> result later.
>
>
>
> >> I am not able to find the reason for this error. This error is
occurring
> >> only
> >> when the PostgreSQL is started as a service only.
> >>
> >
> > Did you use pg_ctl register/unregister to register different services.
Can
> > you share the detail steps and OS version on which you saw this
behaviour?
>
> Operating system - windows 7
> Binary - PostgreSQL 9.5 (This doesn't matter, 9.4+ can produce the
problem)
>
> 1. Create two standard users in the system (test_user1 and test_user2)

I think one possibility is that one user is not able to access the object
created by another user, if possible can you as well try with just one user
(Have same user for both the services).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-11 Thread Amit Kapila

On Sat, Mar 12, 2016 at 8:16 AM, Dilip Kumar  wrote:
>
>
> On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby 
wrote:
>>
>> FWIW, this is definitely a real possibility in any shop that has very
high downtime costs and high transaction rates.
>>
>> I also think some kind of clamp is a good idea. It's not that uncommon
to run max_connections significantly higher than 100, so the extension
could be way larger than 16MB. In those cases this patch could actually
make things far worse as everyone backs up waiting on the OS to extend many
MB when all you actually needed were a couple dozen more pages.
>
>
> I agree, We can have some max limit on number of extra pages, What other
thinks ?
>
>
>>
>> BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
demand for extension that means you're running lots of small DML
operations, not really big ones. I'd think that would make *1 more
appropriate.
>
>
> *1 will not solve this problem, Here the main problem was many people are
sleep/wakeup on the extension lock and that was causing the bottleneck. So
if we do *1 this will satisfy only current requesters which has already
waited on the lock. But our goal is to avoid backends from requesting this
lock.
>
> Idea of Finding the requester to get the statistics on this locks (load
on the lock) and extend in multiple of load so that in future this
situation will be avoided for long time and again when happen next time
extend in multiple of load.
>
> How 20 comes ?
>   I tested with Multiple clients loads 1..64,  with multiple load size 4
byte records to 1KB Records,  COPY/ INSERT and found 20 works best.
>

Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier
you have tried, so that it is clear that 20 is best?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.

2016-03-11 Thread Amit Kapila

On Fri, Mar 11, 2016 at 3:34 PM, Mithun Cy 
wrote:
>
> On Thu, Mar 10, 2016 at 9:39 PM, Robert Haas 
wrote:
> >I guess there must not be an occurrence of this pattern in the
> >regression tests, or previous force_parallel_mode testing would have
> >found this problem.  Perhaps this patch should add one?
>
> I have added the test to select_into.sql. Added Explain select into
statement.
>

I don't see how this test will fail with force_parallel_mode=regress and
max_parallel_degree > 0 even without the patch proposed to fix the issue in
hand.  In short, I don't think this test would have caught the issue, so I
don't see much advantage in adding such a test.  Even if we want to add
such a test case, I think as proposed this will substantially increase the
timing for "Select Into" test which might not be an acceptable test case
addition especially for testing one corner case.

>
> Explain Analyze produces planning time and execution time even with
TIMING OFF
> so not adding the same to regress tests.
>

Yeah, that makes the addition of test for this functionality difficult.
Robert, do you have any idea what kind of test would have caught this issue?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [COMMITTERS] pgsql: Provide much better wait information in pg_stat_activity.

2016-03-11 Thread Amit Kapila

On Sat, Mar 12, 2016 at 5:01 AM, Jim Nasby  wrote:
>
> On 3/10/16 8:36 PM, Robert Haas wrote:
>>
>> 1. We make it true only for heavyweight lock waits, and false for
>> other kinds of waits.  That's pretty strange.
>> 2. We make it true for all kinds of waits that we now know how to
>> report.  That still breaks compatibility.
>
>
> I would absolutely vote for 2 here. You could even argue that it's a bug
fix, since those were waits we technically should have been indicating.
>

I see it as reverse.  I think waiting=true for only heavyweight locks makes
sense in existing versions as user can still find whats actually going in
the system either by looking at "query" in pg_stat_activity or by referring
pg_locks, but OTOH if waiting is true for all kind of waits (lwlock,
heavyweight lock, I/O, etc) then I think it will be difficult for user to
make any sense out of it.  So I see going for option 2 can confuse users
rather than simplifying anything.

>
> Another random thought... changes like this would probably be easier to
handle if we provided backwards compatibility extensions that created views
> that mimicked the catalog for a specific Postgres version.
>

That makes sense to me if other people agree to it, but I think there will
be some maintenance overhead for it, but I see that as worth the effort in
terms of user convenience.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.

2016-03-12 Thread Amit Kapila

On Sat, Mar 12, 2016 at 2:02 PM, Mithun Cy 
wrote:
>
>
>
> On Sat, Mar 12, 2016 at 12:28 PM, Amit Kapila 
wrote
> >I don't see how this test will fail with force_parallel_mode=regress and
max_parallel_degree > 0 even without the patch proposed to fix the issue in
>hand.  In short, I don't think this test would have caught the issue, so I
don't see much advantage in adding such a test.  Even if we want to add
such a >test case, I think as proposed this will substantially increase the
timing for "Select Into" test which might not be an acceptable test case
addition >especially for testing one corner case.
>
>
> Without above patch the make installcheck fails for select_into.sql with
below diff
>
> when
> force_parallel_mode = on
> max_parallel_degree = 3
>

With force_parallel_mode=on, I could see many other failures as well.  I
think it is better to have test, which tests this functionality with
force_parallel_mode=regress


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Background Processes and reporting

2016-03-12 Thread Amit Kapila

On Sat, Mar 12, 2016 at 3:10 AM, Andres Freund  wrote:
>
>
> > Similarly for the wait event stuff - checkpointer, wal writer,
> > background writer are in many cases processes that very often are
> > blocked on locks, IO and such.  Thus restricting the facility to
> > database connected processes seems like a loss.
>
> I think one way to address this would be to not only report
> PgBackendStatus type processes in pg_stat_activity. While that'd
> obviously be a compatibility break, I think it'd be an improvement.
>

I think here another point which needs more thoughts is that many of the
pg_stat_activity fields are not relevant for background processes, ofcourse
one can say that we can keep those fields as NULL, but still I think that
indicates it is not the most suitable way to expose such information.

Another way could be to have new view like pg_stat_background_activity with
only relevant fields or try expose via individual views like
pg_stat_bgwriter.

Do you intend to get this done for 9.6 considering an add-on patch for wait
event information displayed in pg_stat_activity?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Background Processes and reporting

2016-03-12 Thread Amit Kapila

On Sat, Mar 12, 2016 at 2:38 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:
>
> On Sat, Mar 12, 2016 at 2:45 AM, Andres Freund  wrote:
>>
>>
>> I think I agree with Robert here. Providing hooks into very low level
>> places tends to lead to problems in my experience; tight control over
>> what happens is often important - I certainly don't want any external
>> code to run while we're waiting for an lwlock.
>
>
> So, I get following.
>
> 1) Detailed wait monitoring might cause high overhead on some systems.
> 2) We want wait monitoring to be always on. And we don't want options to
enable additional features of wait monitoring.
>

I am not able to see how any of above comments indicate that wait
monitoring need to be always on, why can't we consider to be off by default
especially for events like timing calculations where we suspect to have
some performance penalty and  during development if it is proven that none
of the additional wait events cause any overhead, then we can keep them on
by default.

> 3) We don't want hook of wait events to be exposed.
>
> Can I conclude that we reject detailed wait monitoring by design?
>

I don't think so.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.

2016-03-12 Thread Amit Kapila

On Sat, Mar 12, 2016 at 7:11 PM, Mithun Cy 
wrote:
>
>
>
> On Sat, Mar 12, 2016 at 2:32 PM, Amit Kapila 
wrote:
> >With force_parallel_mode=on, I could see many other failures as well.  I
think it is better to have test, which tests this functionality with
>force_parallel_mode=regress
>
> as per user manual.
> Setting this value to regress has all of the same effects as setting it
to on plus some additional effect that are intended to facilitate automated
> regression testing.
>

Yes, that is the only reason I mentioned that it better to have a test
which can be checked in automated way and I understand that the way you
have written test using Explain won't work in automated way, so not sure if
it is good idea to add such a test.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission

2016-03-13 Thread Amit Kapila

On Fri, Mar 11, 2016 at 5:21 PM, Haribabu Kommi 
wrote:
>
> On Fri, Mar 11, 2016 at 12:00 AM, Amit Kapila 
wrote:
>
>
> >> I am not able to find the reason for this error. This error is
occurring
> >> only
> >> when the PostgreSQL is started as a service only.
> >>
> >
> > Did you use pg_ctl register/unregister to register different services.
Can
> > you share the detail steps and OS version on which you saw this
behaviour?
>
> Operating system - windows 7
> Binary - PostgreSQL 9.5 (This doesn't matter, 9.4+ can produce the
problem)
>
> 1. Create two standard users in the system (test_user1 and test_user2)
> 2. Create two databases belongs each user listed above.
> 3. Now using pg_ctl register the services for the two users.
> 4. Provide logon permissions to these users to run the services by
changing
> service properties.

Did you mean to say that you changed Log on as: Local System Account in
service properties or something else?

> 5. Now try to start the services, the second service fails with the
> error message.
> 6. Error details can be found out in Event log viewer.
>

If I follow above steps and do as I mentioned for step-4, I am not able to
reproduce the issue on Windows-7 m/c using code of HEAD.

> Yes, it is working as same user services. The main problem is, PostgreSQL
> as a service for two different users in the same system is not working
because
> of same random getting generated for two services.
>

I am not sure why you think same random number is problem, as mentioned
above, even if the dsm name is same due to same random number, the code has
logic to process it appropriately (regenerate the name of dsm).  Having
said that, I don't mean to say that we shouldn't have logic to generate
unique name and I think we might want to add data dir path to name
generation as we do for main shared memory, however it is better to first
completely understand the underneath issue.

If I understand correctly, here the problem is due to the reason that the
second user doesn't have appropriate access rights to access the object
created by first user.  On reading the documentation of
CreateFileMapping(), it seems that user should have SeCreateGlobalPrivilege
privilege to create an object in Global namespace.  Can you once try giving
that privilege to the users created by you?  To give this privilege, go to
control panel->System And Security->Administrative Tools->Local Security
Policy->Local Policies->User Rights Assignment, in the right window, select
Create global objects and double-click the same and add the newly created
users. Rerun your test after these steps.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Prepared Statement support for Parallel query

2016-03-14 Thread Amit Kapila

On Fri, Feb 26, 2016 at 4:37 PM, Robert Haas  wrote:
>
>
> And, I'm going to revert this part.  If you'd run the regression tests
> under force_parallel_mode=regress, max_parallel_degree>0, you would
> have noticed that this part breaks it, because of CREATE TABLE ... AS
> EXECUTE.
>

I have looked into this issue and found that the reason for the failure is
that in force_parallel_mode=regress, we enable parallel mode restrictions
even if the parallel plan is not choosen as part of below code in
standard_planner()

if (force_parallel_mode == FORCE_PARALLEL_OFF || !glob->parallelModeOK)

{

glob->parallelModeNeeded = false;

glob->wholePlanParallelSafe = false; /* either false or don't care */

}

else

{

glob->parallelModeNeeded = true;

glob->wholePlanParallelSafe =

!has_parallel_hazard((Node *) parse, false);

}

The failure cases fall into that category, basically wholePlanParallelSafe
will be false, but parallelModeNeeded will be true which will enable
parallel mode restrictions even though the plan won't contain Gather node.
I think if we want to operate in above way for testing purpose, then we
need to force during execution that statements for non read-only operations
should not enter into parallel mode similar to what we are doing for
non-zero tuple count case.  Attached patch fixes the problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

prepared_stmt_parallel_query_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Prepared Statement support for Parallel query

2016-03-14 Thread Amit Kapila

On Tue, Mar 15, 2016 at 12:21 AM, Robert Haas  wrote:
>
> On Mon, Mar 14, 2016 at 9:18 AM, Amit Kapila 
wrote:
> > On Fri, Feb 26, 2016 at 4:37 PM, Robert Haas 
wrote:
> >
> >
> > The failure cases fall into that category, basically
wholePlanParallelSafe
> > will be false, but parallelModeNeeded will be true which will enable
> > parallel mode restrictions even though the plan won't contain Gather
node.
> > I think if we want to operate in above way for testing purpose, then we
need
> > to force during execution that statements for non read-only operations
> > should not enter into parallel mode similar to what we are doing for
> > non-zero tuple count case.  Attached patch fixes the problem.
>
> This seems like a really ugly fix.  It might be possible to come up
> with a fix along these lines, but I don't have much confidence in the
> specific new test you've injected into executePlan().  Broadly, any
> change of this type implicitly changes the contract between
> executePlan() and the planner infrastructure - the planner can now
> legally generate parallel plans in some cases where that would
> previously have been unacceptable.  But I'm not in a hurry to rethink
> where we've drawn the line there for 9.6.  Let's punt this issue for
> now and come back to it in a future release.
>

No issues.  I felt that it might be good to support parallel query via
Prepare statement as there is no fundamental issue in the same, but as you
say, we can do that in future release as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-14 Thread Amit Kapila

On Tue, Mar 15, 2016 at 12:00 AM, David Steele  wrote:
>
> On 2/26/16 11:37 PM, Amit Kapila wrote:
>
>> On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila >
>> Here, we can see that there is a gain of ~15% to ~38% at higher
>> client count.
>>
>> The attached document (perf_write_clogcontrollock_data_v6.ods)
>> contains data, mainly focussing on single client performance.  The
>> data is for multiple runs on different machines, so I thought it is
>> better to present in form of document rather than dumping everything
>> in e-mail.  Do let me know if there is any confusion in
>> understanding/interpreting the data.
>>
>> Forgot to mention that all these tests have been done by
>> reverting commit-ac1d794.
>
>
> This patch no longer applies cleanly:
>
> $ git apply ../other/group_update_clog_v6.patch
> error: patch failed: src/backend/storage/lmgr/proc.c:404
> error: src/backend/storage/lmgr/proc.c: patch does not apply
> error: patch failed: src/include/storage/proc.h:152
> error: src/include/storage/proc.h: patch does not apply
>

For me, with patch -p1 <  it works, but any how I have
updated the patch based on recent commit.  Can you please check the latest
patch and see if it applies cleanly for you now.

>
> It's not clear to me whether Robert has completed a review of this code
or it still needs to be reviewed more comprehensively.
>
> Other than a comment that needs to be fixed it seems that all questions
have been answered by Amit.
>

I have updated the comments and changed the name of one of a variable from
"all_trans_same_page" to "all_xact_same_page" as pointed out offlist by
Alvaro.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


group_update_clog_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.

2016-03-14 Thread Amit Kapila

On Tue, Mar 15, 2016 at 5:22 AM, Robert Haas  wrote:
>
> On Sat, Mar 12, 2016 at 1:58 AM, Amit Kapila 
wrote:
> > Yeah, that makes the addition of test for this functionality difficult.
> > Robert, do you have any idea what kind of test would have caught this
issue?
>
> Yep.  Committed with that test:
>

Nice.  Thanks!

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Background Processes and reporting

2016-03-15 Thread Amit Kapila

On Tue, Mar 15, 2016 at 1:32 AM, Andres Freund  wrote:
>
> On 2016-03-12 16:29:11 +0530, Amit Kapila wrote:
> > On Sat, Mar 12, 2016 at 3:10 AM, Andres Freund 
wrote:
> > >
> > >
> > > > Similarly for the wait event stuff - checkpointer, wal writer,
> > > > background writer are in many cases processes that very often are
> > > > blocked on locks, IO and such.  Thus restricting the facility to
> > > > database connected processes seems like a loss.
> > >
> > > I think one way to address this would be to not only report
> > > PgBackendStatus type processes in pg_stat_activity. While that'd
> > > obviously be a compatibility break, I think it'd be an improvement.
> > >
> >
> > I think here another point which needs more thoughts is that many of the
> > pg_stat_activity fields are not relevant for background processes,
ofcourse
> > one can say that we can keep those fields as NULL, but still I think
that
> > indicates it is not the most suitable way to expose such information.
>
> But neither are all of them relevant for autovacuum workers, and we
> already show them.
>

Right, currently any process which has been assigned BackendId is probable
candidate for being displayed in pg_stat_activity and all the relative
information is being captured in PGBackendStatus.

>  pg_stat_activity as a name imo doesn't really imply
> that it's about plain queries.  ISTM we should add a 'backend_type'
> column that is one of
backend|checkpointer|autovacuum|autovacuum-worker|wal writer| bgwriter|
bgworker
> (or similar). That makes querying easier.
>

+1 for going that way if we decide to display background process
information in pg_stat_activity view.  However I think we might need some
additional space in shared memory to track some of the statistics as we
track in PGBackendStatus or may be for now just display some minimal stats
like wait events for background processes.

>
> > Another way could be to have new view like pg_stat_background_activity
with
> > only relevant fields or try expose via individual views like
> > pg_stat_bgwriter.
>
> I think the second is a pretty bad alternative; it'll force us to add
> new views with very similar information; and it'll be hard to get
> information about the whole system.   I mean if you want to know which
> locks are causing problems, you don't primarily care whether it's a
> background process or a backend that has contention issues.
>

Agreed.  OTOH adding information from two different kind of processes (one
which has Backendid and one which doesn't have) also doesn't sound to be
neat from the internal code perspective, but may be this is just an initial
fear or may be because we haven't comeup with a patch which can show it is
actually a simple thing to achieve.

Yet another idea could be for 9.6, lets just define statistics functions to
get wait events for background process like we have for backends (similar
to pg_stat_get_backend_idset() and pg_stat_get_backend_wait_event()). I
think we most probably anyway need those kind of functions once we expose
such information for background processes, so having them now will at least
provide someway to user to have some minimal information about background
processes.  I think that won't need much additional work.

>
> > Do you intend to get this done for 9.6 considering an add-on patch for
wait
> > event information displayed in pg_stat_activity?
>
> I think we should fix this for 9.6; it's a weakness in a new
> interface. Let's not yank people around more than we need to.
>
> I'm willing to do some work on that, if we can agree upon a course.
>

Good to hear.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-15 Thread Amit Kapila

On Tue, Mar 15, 2016 at 7:54 PM, David Steele  wrote:
>
> On 3/15/16 1:17 AM, Amit Kapila wrote:
>
> > On Tue, Mar 15, 2016 at 12:00 AM, David Steele  >
> >> This patch no longer applies cleanly:
> >>
> >> $ git apply ../other/group_update_clog_v6.patch
> >> error: patch failed: src/backend/storage/lmgr/proc.c:404
> >> error: src/backend/storage/lmgr/proc.c: patch does not apply
> >> error: patch failed: src/include/storage/proc.h:152
> >> error: src/include/storage/proc.h: patch does not apply
> >
> > For me, with patch -p1 <  it works, but any how I have
> > updated the patch based on recent commit.  Can you please check the
> > latest patch and see if it applies cleanly for you now.
>
> Yes, it now applies cleanly (101fd93).
>

Thanks for verification.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [WIP] speeding up GIN build with parallel workers

2016-03-15 Thread Amit Kapila

On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan  wrote:

> On Mon, 14 Mar 2016 08:42:26 -0400
> David Steele  wrote:
>
> > On 2/18/16 10:10 AM, Constantin S. Pan wrote:
> > > On Wed, 17 Feb 2016 23:01:47 +0300
> > > Oleg Bartunov  wrote:
> > >
> > >> My feedback is (Mac OS X 10.11.3)
> > >>
> > >> set gin_parallel_workers=2;
> > >> create index message_body_idx on messages using gin(body_tsvector);
> > >> LOG:  worker process: parallel worker for PID 5689 (PID 6906) was
> > >> terminated by signal 11: Segmentation fault
> > >
> > > Fixed this, try the new patch. The bug was in incorrect handling
> > > of some GIN categories.
> >
> > Oleg, it looks like Constantin has updated to patch to address the
> > issue you were seeing.  Do you have time to retest and review?
> >
> > Thanks,
>
> Actually, there was some progress since. The patch is
> attached.
>
> 1. Added another GUC parameter for changing the amount of
> shared memory for parallel GIN workers.
>
> 2. Changed the way results are merged. It uses shared memory
> message queue now.
>
> 3. Tested on some real data (GIN index on email message body
> tsvectors). Here are the timings for different values of
> 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
> machine. Seems 'gin_shared_mem' has no visible effect.
>
> wnum mem(MB) time(s)
>0  16 247
>1  16 256
>


It seems from you data that with 1 worker, you are always seeing slowdown,
have you investigated the reason of same?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Pushdown target list below gather node (WAS Re: [HACKERS] WIP: Upper planner pathification)

2016-03-16 Thread Amit Kapila

On Wed, Mar 9, 2016 at 11:58 PM, Robert Haas  wrote:
>
> On Wed, Mar 9, 2016 at 12:33 PM, Tom Lane  wrote:
> >
> > Gather is a bit weird, because although it can project (and needs to,
> > per the example of needing to compute a non-parallel-safe function),
> > you would rather push down as much work as possible to the child node;
> > and doing so is semantically OK for parallel-safe functions.  (Pushing
> > functions down past a Sort node, for a counterexample, is not so OK
> > if you are concerned about function evaluation order, or even number
> > of executions.)
> >
> > In the current code structure it would perhaps be reasonable to teach
> > apply_projection_to_path about that --- although this would require
> > logic to separate parallel-safe and non-parallel-safe subexpressions,
> > which doesn't quite seem like something apply_projection_to_path
> > should be doing.
>
> I think for v1 it would be fine to make this all-or-nothing; that's
> what I had in mind to do.  That is, if the entire tlist is
> parallel-safe, push it all down.  If not, let the workers just return
> the necessary Vars and have Gather compute the final tlist.
>

I find it quite convenient to teach apply_projection_to_path() to push down
target-list beneath Gather node, when targetlist contains parallel-safe
expression.  Attached patch implements pushing targetlist beneath gather
node.

Below is output of a simple test which shows the effect of implementation.

Without Patch -

postgres=# explain verbose select c1+2 from t1 where c1<10;
 QUERY PLAN
-
 Gather  (cost=0.00..44420.43 rows=30 width=4)
   Output: (c1 + 2)
   Number of Workers: 2
   ->  Parallel Seq Scan on public.t1  (cost=0.00..44420.35 rows=13 width=4)
 Output: c1
 Filter: (t1.c1 < 10)
(6 rows)

With Patch -
---
postgres=# explain verbose select c1+2 from t1 where c1<10;
 QUERY PLAN
-
 Gather  (cost=0.00..45063.75 rows=30 width=4)
   Output: ((c1 + 2))
   Number of Workers: 1
   ->  Parallel Seq Scan on public.t1  (cost=0.00..45063.68 rows=18 width=4)
 Output: (c1 + 2)
 Filter: (t1.c1 < 10)
(6 rows)


In the above plans, you can notice that target list expression (c1 + 2) is
pushed beneath Gather node after patch.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


parallel-tlist-pushdown-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-18 Thread Amit Kapila

On Sat, Mar 19, 2016 at 1:41 AM, Robert Haas  wrote:
>
> On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev
>  wrote:
> >
> > So answering your question - it turned out that we _can't_ reduce
> > NUM_FREELISTS this way.
>
> That's perplexing.  I would have expected that with all of the mutexes
> packed in back-to-back like this, we would end up with a considerable
> amount of false sharing.  I don't know why it ever helps to have an
> array of bytes all in the same cache line of which each one is a
> heavily-trafficked mutex.   Anybody else have a theory?
>
> One other thing that concerns me mildly is the way we're handling
> nentries.  It should be true, with this patch, that the individual
> nentries sum to the right value modulo 2^32.  But I don't think
> there's any guarantee that the values are positive any more, and in
> theory after running long enough one of them could manage to overflow
> or underflow.
>

Won't in theory, without patch as well nentries can overflow after running
for very long time?  I think with patch it is more prone to overflow
because we start borrowing from other free lists as well.

  So at a very minimum I think we need to remove the
> Assert() the value is not negative.  But really, I wonder if we
> shouldn't rework things a little more than that.
>
> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition.  So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).
>

To me, your idea sounds much better than current code in terms of
understanding the free list concept as well.  So, +1 for changing the code
in this way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-18 Thread Amit Kapila

On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund  wrote:
>
> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
>
> > I have done some
> > tests on Windows with 0003 patch which includes running the regressions
> > (vcregress check) and it passes.  Will look into it tomorrow once again
and
> > share if I find anything wrong with it, but feel to proceed if you want.
>
> Thanks for the testing thus far! Let's see what the buildfarm has to
> say.
>

Won't the new code needs to ensure that ResetEvent(latchevent) should get
called in case WaitForMultipleObjects() comes out when both
pgwin32_signal_event and latchevent are signalled at the same time?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-18 Thread Amit Kapila

On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund  wrote:
>
>
>
> On March 18, 2016 11:32:53 PM PDT, Amit Kapila 
wrote:
> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund 
> >wrote:
> >>
> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:
> >>
> >> > I have done some
> >> > tests on Windows with 0003 patch which includes running the
> >regressions
> >> > (vcregress check) and it passes.  Will look into it tomorrow once
> >again
> >and
> >> > share if I find anything wrong with it, but feel to proceed if you
> >want.
> >>
> >> Thanks for the testing thus far! Let's see what the buildfarm has to
> >> say.
> >>
> >
> >Won't the new code needs to ensure that ResetEvent(latchevent) should
> >get
> >called in case WaitForMultipleObjects() comes out when both
> >pgwin32_signal_event and latchevent are signalled at the same time?
>
> WaitForMultiple only reports the readiness of on event at a time, no?
>

I don't think so, please read link [1] with a focus on below paragraph
which states how it reports the readiness or signaled state when multiple
objects become signaled.

"When *bWaitAll* is *FALSE*, this function checks the handles in the array
in order starting with index 0, until one of the objects is signaled. If
multiple objects become signaled, the function returns the index of the
first handle in the array whose object was signaled."

[1] -
https://msdn.microsoft.com/en-us/library/windows/desktop/ms687025(v=vs.85).aspx

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Pushdown target list below gather node (WAS Re: [HACKERS] WIP: Upper planner pathification)

2016-03-19 Thread Amit Kapila

On Thu, Mar 17, 2016 at 7:10 PM, Tom Lane  wrote:
>
> Amit Kapila  writes:
> > While reading above code changes, it looks like it is assuming that
subpath
> > and subplan will always be same (as it is verifying projection
capability
> > of subpath and attaching the tlist to subplan), but I think it is
possible
> > that subpath and subplan correspond to different nodes when gating
Result
> > node is added on to top of scan plan by create_scan_plan().
>
> A little more thought will show you that that's not actually relevant,
> because the tlist computation would have happened (or not) below the
> gating Result.  If gating Results had an impact on
> apply_projection_to_path's decisions we'd have had to do something about
> that before this.
>

I understand that gating Results won't impact it, but it was just not
apparent from looking at the code I had referred.  If you think it is quite
obvious thing, then we can leave the comment as it is.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [WIP] speeding up GIN build with parallel workers

2016-03-19 Thread Amit Kapila

On Thu, Mar 17, 2016 at 2:56 PM, Constantin S. Pan  wrote:
>
> On Thu, 17 Mar 2016 13:21:32 +0530
> Amit Kapila  wrote:
>
> > On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan 
> > wrote:
> > >
> > > On Wed, 16 Mar 2016 18:08:38 +0530
> > > Amit Kapila  wrote:
> > >
> > > >
> > > > Why backend just waits, why can't it does the same work as any
> > > > worker does?  In general, for other parallelism features the
> > > > backend also behaves the same way as worker in producing the
> > > > results if the results from workers is not available.
> > >
> > > We can make backend do the same work as any worker, but that
> > > will complicate the code for less than 2 % perfomance boost.
> >
> > Why do you think it will be just 2%?  I think for single worker case,
> > it should be much more as the master backend will be less busy in
> > consuming tuples from tuple queue.  I can't say much about
> > code-complexity, as I haven't yet looked carefully at the logic of
> > patch, but we didn't find much difficulty while doing it for parallel
> > scans.  One of the commit which might help you in understanding how
> > currently heap scans are parallelised is
> > ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can
> > help you in anyway for writing a generic API for Gin parallel builds.
>
> I looked at the timing details some time ago, which showed
> that the backend spent about 1% of total time on data
> transfer from 1 worker, and 3% on transfer and merging from
> 2 workers. So if we use (active backend + 1 worker) instead
> of (passive backend + 2 workers), we still have to spend
> 1.5% on transfer and merging.
>

I think here the comparison should be between the case of (active backend +
1 worker) with (passive backend + 1 worker) or  (active backend + 2 worker)
with (passive backend + 2 workers).  I don't think it is good assumption
that workers are always freely available and you can use them as and when
required for any operation.

>
> Or we can look at these measurements (from yesterday's
> message):
>
> wnum mem(MB) time(s)
>0  16 247
>1  16 256
>2  16 126
>
> If 2 workers didn't have to transfer and merge their
> results, they would have finished in 247 / 2 = 123.5
> seconds. But the transfer and merging took another 2.5
> seconds. The merging takes a little longer than the
> transfer. If we now use backend+worker we get rid of 1
> transfer, but still have to do 1 transfer and then merge, so
> we will save less than a quarter of those 2.5 seconds.
>

If I understand the above data correctly, then it seems to indicate that
majority of the work is done in processing the data, so I think it should
be better if master and worker both can work together.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-19 Thread Amit Kapila

On Thu, Mar 17, 2016 at 7:34 AM, Andres Freund  wrote:
> Hi,
>
> * I can do a blind rewrite of the windows implementation, but I'm
>   obviously not going to get that entirely right. So I need some help
>   from a windows person to test this.
>

I can help you verifying the windows implementation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-19 Thread Amit Kapila

On Wed, Mar 16, 2016 at 11:57 PM, Jesper Pedersen <
jesper.peder...@redhat.com> wrote:
>
> On 03/15/2016 01:17 AM, Amit Kapila wrote:
>>
>> I have updated the comments and changed the name of one of a variable
from
>> "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by
>> Alvaro.
>>
>>
>
> I have done a run, and don't see any regressions.
>

Can you provide the details of test, like is this pgbench read-write test
and if possible steps for doing test execution.

I wonder if you can do the test with unlogged tables (if you are using
pgbench, then I think you need to change the Create Table command to use
Unlogged option).

>
> Intel Xeon 28C/56T @ 2GHz w/ 256GB + 2 x RAID10 (data + xlog) SSD.
>

Can you provide CPU information (probably by using lscpu).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Aggregate

2016-03-19 Thread Amit Kapila

On Thu, Mar 17, 2016 at 10:35 AM, David Rowley 
wrote:
>
> On 17 March 2016 at 01:19, Amit Kapila  wrote:
> > Few assorted comments:
> >
> > 2.
> >  AggPath *
> >  create_agg_path(PlannerInfo *root,
> > @@ -2397,9 +2399,11 @@ create_agg_path(PlannerInfo *root,
> >
> > List *groupClause,
> >   List *qual,
> >   const AggClauseCosts
> > *aggcosts,
> > - double numGroups)
> > + double numGroups,
> > +
> > bool combineStates,
> > + bool finalizeAggs)
> >
> > Don't you need to set parallel_aware flag in this function as we do for
> > create_seqscan_path()?
>
> I don't really know the answer to that... I mean there's nothing
> special done in nodeAgg.c if the node is running in a worker or in the
> main process.
>

On again thinking about it, I think it is okay to set parallel_aware flag
as false.  This flag means whether that particular node has any parallelism
behaviour which is true for seqscan, but I think not for partial aggregate
node.

Few other comments on latest patch:

1.

+ /*
+ * XXX does this need estimated for each partial path, or are they
+ * all
going to be the same anyway?
+ */
+ dNumPartialGroups = get_number_of_groups(root,
+
clamp_row_est(partial_aggregate_path->rows),
+
rollup_lists,
+
rollup_groupclauses);

For considering partial groups, do we need to rollup related lists?

2.
+ hashaggtablesize = estimate_hashagg_tablesize(partial_aggregate_path,
+
 &agg_costs,
+
 dNumPartialGroups);
+
+ /*
+ * Generate a
hashagg Path, if we can, but we'll skip this if the hash
+ * table looks like it'll exceed work_mem.
+
*/
+ if (can_hash && hashaggtablesize < work_mem * 1024L)


hash table size should be estimated only if can_hash is true.

3.
+ foreach(lc, grouped_rel->partial_pathlist)
+ {
+ Path   *path =
(Path *) lfirst(lc);
+ double total_groups;
+
+ total_groups = path-
>parallel_degree * path->rows;
+
+ path = (Path *) create_gather_path(root, grouped_rel, path,
NULL,
+   &total_groups);

Do you need to perform it foreach partial path or just do it for
firstpartial path?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-19 Thread Amit Kapila

On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund  wrote:
>
> On March 18, 2016 11:52:08 PM PDT, Amit Kapila 
wrote:
> >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> >should
> >> >get
> >> >called in case WaitForMultipleObjects() comes out when both
> >> >pgwin32_signal_event and latchevent are signalled at the same time?
> >>
> >> WaitForMultiple only reports the readiness of on event at a time, no?
> >>
> >
> >I don't think so, please read link [1] with a focus on below paragraph
> >which states how it reports the readiness or signaled state when
> >multiple
> >objects become signaled.
> >
> >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> >array
> >in order starting with index 0, until one of the objects is signaled.
> >If
> >multiple objects become signaled, the function returns the index of the
> >first handle in the array whose object was signaled."
>
> I think that's OK. We'll just get the next event the next time we call
waitfor*. It's also not different to the way the routine is currently
handling normal socket and postmaster events, no?
>

I think the primary difference with socket and postmaster event as compare
to latch event is that it won't allow to start waiting with the waitevent
in signalled state.  For socket event, it will close the event in the end
and create again before entring the wait loop in WaitLatchOrSocket.  I
could not see any major problem apart from may be spurious wake ups in few
cases (as we haven't reset the event to non signalled state for latch event
before entering wait, so it can just return immediately) even if we don't
Reset the latch event.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-19 Thread Amit Kapila

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund  wrote:
>
> Hi,
>
> On 2016-03-17 09:01:36 -0400, Robert Haas wrote:
> > 0001: Looking at this again, I'm no longer sure this is a bug.
> > Doesn't your patch just check the same conditions in the opposite
> > order?
>
> Which is important, because what's in what pfds[x] depends on
> wakeEvents. Folded it into a later patch; it's not harmful as long as
> we're only ever testing pfds[0].
>
>
> > 0003: Mostly boring.  But the change to win32_latch.c seems to remove
> > an unrelated check.
>
> Argh.
>

+ * from inside a signal handler in latch_sigusr1_handler().

  *

  * Note: we assume that the kernel calls involved in drainSelfPipe()

  * and SetLatch() will provide adequate synchronization on machines

  * with weak memory ordering, so that we cannot miss seeing is_set if

  * the signal byte is already in the pipe when we drain it.

  */

- drainSelfPipe();

-

Above part of comment looks redundant after this patch.  I have done some
tests on Windows with 0003 patch which includes running the regressions
(vcregress check) and it passes.  Will look into it tomorrow once again and
share if I find anything wrong with it, but feel to proceed if you want.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-19 Thread Amit Kapila

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund  wrote:
>
>
> Attached is a significantly revised version of the earlier series. Most
> importantly I have:
> * Unified the window/unix latch implementation into one file (0004)
>

After applying patch 0004* on HEAD, using command patch -p1 <
, I am getting build failure:

c1 : fatal error C1083: Cannot open source file:
'src/backend/storage/ipc/latch.c': No such file or directory

I think it could not rename port/unix_latch.c => storage/ipc/latch.c.  I
have tried with git apply, but no success.  Am I doing something wrong?

One minor suggestion about patch:

+#ifndef WIN32
 void
 latch_sigusr1_handler(void)
 {
  if (waiting)
  sendSelfPipeByte();
 }
+#endif   /* !WIN32 */

 /* Send one byte to the self-pipe, to wake up WaitLatch */
+#ifndef WIN32
 static void
 sendSelfPipeByte(void)

Instead of individually defining these functions under #ifndef WIN32, isn't
it better to combine them all as they are at end of file.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Pushdown target list below gather node (WAS Re: [HACKERS] WIP: Upper planner pathification)

2016-03-19 Thread Amit Kapila

On Wed, Mar 16, 2016 at 10:57 PM, Robert Haas  wrote:
>
> On Wed, Mar 16, 2016 at 12:57 PM, Tom Lane  wrote:
> > Yeah, I was thinking about the same thing.  The comment block above
> > where you're looking would need some adjustment.
>
> OK, how about this?
>

+ * children.  Alternatively, apply_projection_to_path might have created

+ * a projection path as the subpath of a Gather node even though the

+ * subpath was projection-capable.  So, if the subpath is capable of

+ * projection or the desired tlist is the same expression-wise as the

+ * subplan's, just jam it in there.  We'll have charged for a Result that

+ * doesn't actually appear in the plan, but that's better than having a

+ * Result we don't need.

  */

- if (tlist_same_exprs(tlist, subplan->targetlist))

+ if (is_projection_capable_path(best_path->subpath) ||

+ tlist_same_exprs(tlist, subplan->targetlist))

While reading above code changes, it looks like it is assuming that subpath
and subplan will always be same (as it is verifying projection capability
of subpath and attaching the tlist to subplan), but I think it is possible
that subpath and subplan correspond to different nodes when gating Result
node is added on to top of scan plan by create_scan_plan().  I think it
might be better to explain in comments, why it is safe to rely on
projection capability of subpath to attach tlist to subplan.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [WIP] speeding up GIN build with parallel workers

2016-03-19 Thread Amit Kapila

On Wed, Mar 16, 2016 at 2:55 PM, Constantin S. Pan  wrote:
>
> On Wed, 16 Mar 2016 12:14:51 +0530
> Amit Kapila  wrote:
>
> > On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan 
> > wrote:
> > > 3. Tested on some real data (GIN index on email message body
> > > tsvectors). Here are the timings for different values of
> > > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
> > > machine. Seems 'gin_shared_mem' has no visible effect.
> > >
> > > wnum mem(MB) time(s)
> > >0  16 247
> > >1  16 256
> > >
> >
> >
> > It seems from you data that with 1 worker, you are always seeing
> > slowdown, have you investigated the reason of same?
> >
>
> That slowdown is expected. It slows down because with 1 worker it
> has to transfer the results from the worker to the backend.
>
> The backend just waits for the results from the workers and merges them
> (in case wnum > 0).

Why backend just waits, why can't it does the same work as any worker
does?  In general, for other parallelism features the backend also behaves
the same way as worker in producing the results if the results from workers
is not available.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Aggregate

2016-03-19 Thread Amit Kapila

 of context->subplan_itlist->has_non_vars
as it is handled in fix_upper_expr_mutator()?  If not then, I think adding
the reason for same in comments above function would be better.

7.
tlist.c

+}
\ No newline at end of file

There should be a new line at end of file.


[1] -
http://www.postgresql.org/message-id/CAA4eK1Jk8hm-2j-CKjvdd0CZTsdPX=edk_qhzc4689hq0xt...@mail.gmail.com


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Aggregate

2016-03-19 Thread Amit Kapila

On Thu, Mar 17, 2016 at 6:41 PM, David Rowley 
wrote:
>
> On 18 March 2016 at 01:22, Amit Kapila  wrote:
> > On Thu, Mar 17, 2016 at 10:35 AM, David Rowley
> >  wrote:
>
> Updated patch is attached. Thanks for the re-review.
>

Few more comments:

1.

+ if (parse->groupClause)
+ path = (Path *) create_sort_path(root,
+ grouped_rel,
+ path,
+ root->group_pathkeys,
+ -1.0);

For final path, why do you want to sort just for group by case?

2.
+ path = (Path *) create_gather_path(root, partial_grouped_rel, path,
+   NULL, &total_groups);
+
+ if (parse->groupClause)
+ path = (Path *) create_sort_path(root,
+ grouped_rel,
+ path,
+ root->group_pathkeys,
+ -1.0);
+
+ if (parse->hasAggs)
+ add_path(grouped_rel, (Path *)
+ create_agg_path(root,
+ grouped_rel,
+ path,
+ target,
+ parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+ parse->groupClause,
+ (List *) parse->havingQual,
+ &agg_costs,
+ partial_grouped_rel->rows,
+ true,
+ true));
+ else
+ add_path(grouped_rel, (Path *)
+ create_group_path(root,
+  grouped_rel,
+  path,
+  target,
+  parse->groupClause,
+  (List *) parse->havingQual,
+  total_groups));

In above part of patch, it seems you are using number of groups
differenetly; for create_group_path() and create_gather_path(), you have
used total_groups whereas for create_agg_path()
partial_grouped_rel->rows is used, is there a reason for the same?

3.
+ if (grouped_rel->partial_pathlist)
+ {
+ Path   *path = (Path *)
linitial(grouped_rel->partial_pathlist);
+ double total_groups;
+
+ total_groups =
path->rows * path->parallel_degree;
+ path = (Path *) create_gather_path(root, partial_grouped_rel,
path,
+   NULL, &total_groups);

A. Won't passing partial_grouped_rel lead to incomplete information
required by create_gather_path() w.r.t the case of parameterized path info?
B. You have mentioned that passing grouped_rel will make gather path
contain the information of final path target, but what is the problem with
that? I mean to ask why Gather node is required to contain partial path
target information instead of final path target.
C. Can we consider passing pathtarget to create_gather_path() as that seems
to save us from inventing new UpperRelationKind? If you are worried about
adding the new parameter (pathtarget) to create_gather_path(), then I think
we are already passing it in many other path generation functions, so why
not for gather path generation as well?

4A.
Overall function create_grouping_paths() looks better than previous, but I
think still it is difficult to read.  I think it can be improved by
generating partial aggregate paths separately as we do for nestloop
join refer function consider_parallel_nestloop

4B.
Rather than directly using create_gather_path(), can't we use
generate_gather_paths as for all places where we generate gather node,
generate_gather_paths() is used.

5.
+make_partialgroup_input_target(PlannerInfo *root, PathTarget *final_target)
{
..
..
+ foreach(lc, final_target->exprs)
+ {
+ Expr   *expr = (Expr *) lfirst(lc);
+
+
i++;
+
+ if (parse->groupClause)
+ {
+ Index sgref = final_target-
>sortgrouprefs[i];
+
+ if (sgref && get_sortgroupref_clause_noerr(sgref, parse->groupClause)
+
!= NULL)
+ {
+ /*
+
 * It's a grouping column, so add it to the input target as-is.
+ */
+
add_column_to_pathtarget(input_target, expr, sgref);
+ continue;
+
}
+ }
+
+ /*
+ * Non-grouping column, so just remember the expression for later
+
* call to pull_var_clause.
+ */
+ non_group_cols = lappend(non_group_cols, expr);
+
}
..
}

Do we want to achieve something different in the above foreach loop than
the similar loop in make_group_input_target(), if not then why are they not
exactly same?

6.
+ /* XXX this causes some redundant cost calculation ... */
+ input_target = set_pathtarget_cost_width(root,
input_target);
+ return input_target;

Can't we use return set_pathtarget_cost_width() directly rather than
fetching it in input_target and then returning input_target?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-19 Thread Amit Kapila

On Thu, Mar 17, 2016 at 7:33 AM, Michael Paquier 
wrote:
>
> On Wed, Mar 16, 2016 at 5:28 AM, Robert Haas 
wrote:
> > On Tue, Mar 15, 2016 at 9:17 AM, Thomas Reiss 
wrote:
> >> Here's a small docpatch to fix two typos in the new documentation.
> >
> > Thanks, committed.
>
> I just had a quick look at the wait_event committed, and I got a
> little bit disappointed that we actually do not track latch waits yet,
> which is perhaps not that useful actually as long as an event name is
> not associated to a given latch wait when calling WaitLatch.
>

You are right and few more like I/O wait are also left out from initial
patch intentionally just to get the base functionality in.  One of the
reasons I have not kept it in the patch was it needs much more thorough
performance testing (even though theoretically overhead shouldn't be there)
with specific kind of tests.

>
> I am not
> asking for that with this release, this is just for the archive's
> sake, and I don't mind coding that myself anyway if need be.

Thanks, feel free to pickup in next release (or for this release, if
everybody feels strongly to have it in this release) if you don't see any
patch for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] POC: Cache data in GetSnapshotData()

2016-03-19 Thread Amit Kapila

On Wed, Mar 16, 2016 at 10:59 PM, Robert Haas  wrote:

> On Wed, Mar 16, 2016 at 4:40 AM, Mithun Cy 
> wrote:
>
>>
>> On Thu, Mar 3, 2016 at 6:20 AM, Mithun Cy 
>> wrote:
>> > I will continue to benchmark above tests with much wider range of
>> clients.
>>
>> Latest Benchmarking shows following results for unlogged tables.
>>
>> clients BASE ONLY CLOG CHANGES % Increase CLOG CHANGES + SAVE SNAPSHOT %
>> Increase
>> 1 1198.326337 1328.069656 10.8270439357 1234.078342 2.9834948875
>> 32 37455.181727 38295.250519 2.2428640131 41023.126293 9.5259037641
>> 64 48838.016451 50675.845885 3.7631123611 51662.814319 5.7840143259
>> 88 36878.187766 53173.577363 44.1870671639 56025.454917 51.9203038731
>> 128 35901.537773 52026.024098 44.9130798434 53864.486733 50.0339263281
>> 256 28130.354402 46793.134156 66.3439197647 46817.04602 66.4289235427
>>
>
> Whoa.  At 64 clients, we're hardly getting any benefit, but then by 88
> clients, we're getting a huge benefit.  I wonder why there's that sharp
> change there.
>
>
If you see, for the Base readings, there is a performance increase up till
64 clients and then there is a fall at 88 clients, which to me indicates
that it hits very high-contention around CLogControlLock at 88 clients
which CLog patch is able to control to a great degree (if required, I think
the same can be verified by LWLock stats data).  One reason for hitting
contention at 88 clients is that this machine seems to have 64-cores (it
has 8 sockets and 8 Core(s) per socket) as per below information of lscpu
command.

Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   1064.000
BogoMIPS:  4266.62
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [WIP] speeding up GIN build with parallel workers

2016-03-19 Thread Amit Kapila

On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan  wrote:
>
> On Wed, 16 Mar 2016 18:08:38 +0530
> Amit Kapila  wrote:
>
> >
> > Why backend just waits, why can't it does the same work as any worker
> > does?  In general, for other parallelism features the backend also
> > behaves the same way as worker in producing the results if the
> > results from workers is not available.
>
> We can make backend do the same work as any worker, but that
> will complicate the code for less than 2 % perfomance boost.

Why do you think it will be just 2%?  I think for single worker case, it
should be much more as the master backend will be less busy in consuming
tuples from tuple queue.  I can't say much about code-complexity, as I
haven't yet looked carefully at the logic of patch, but we didn't find much
difficulty while doing it for parallel scans.  One of the commit which
might help you in understanding how currently heap scans are parallelised
is ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can help
you in anyway for writing a generic API for Gin parallel builds.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-20 Thread Amit Kapila

On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas  wrote:
>
> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila 
wrote:
> > Won't in theory, without patch as well nentries can overflow after
running
> > for very long time?  I think with patch it is more prone to overflow
because
> > we start borrowing from other free lists as well.
>
> Uh, I don't think so.  Without the patch, there is just one entries
> counter and it goes up and down.  How would it ever overflow?
>

I thought it can overflow because we haven't kept any upper limit on
incrementing it unless the memory finishes (ofcourse that is just a
theoretical assumption, as the decrements will keep the number in control),
so are you thinking about the risk of overflow with patch because we have
to use sum of all the nentries from all the arrays for total or is there
any thing else which makes you think that changing nentries into arrays of
nentries can make it prone to overflow?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-20 Thread Amit Kapila

On Mon, Mar 21, 2016 at 5:12 AM, Robert Haas  wrote:
>
> On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila 
wrote:
> > On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas 
wrote:
> >> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila 
> >> wrote:
> >> > Won't in theory, without patch as well nentries can overflow after
> >> > running
> >> > for very long time?  I think with patch it is more prone to overflow
> >> > because
> >> > we start borrowing from other free lists as well.
> >>
> >> Uh, I don't think so.  Without the patch, there is just one entries
> >> counter and it goes up and down.  How would it ever overflow?
> >
> > I thought it can overflow because we haven't kept any upper limit on
> > incrementing it unless the memory finishes (ofcourse that is just a
> > theoretical assumption, as the decrements will keep the number in
control),
> > so are you thinking about the risk of overflow with patch because we
have to
> > use sum of all the nentries from all the arrays for total or is there
any
> > thing else which makes you think that changing nentries into arrays of
> > nentries can make it prone to overflow?
>
> Well, I mean, perhaps nentries could overflow if you had more than
> 2^32 elements, but I'm not even positive we support that.  If you
> assume a fixed table with a million entries, the nentries value can
> vary only between 0 and a million.  But now split that into a bunch of
> separate counters.   The increment when you allocate an entry and the
> decrement when you put one back don't have to hit the same bucket,
>

This is the point where I think I am missing something about patch.  As far
as I can understand, it uses the same freelist index (freelist_idx) for
allocating and putting back the entry, so I think the chance of increment
in one list and decrement in another is there when the value
of freelist_idx is calculated differently for the same input, is it so, or
there is something else in patch which I am missing?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-20 Thread Amit Kapila

On Sun, Mar 20, 2016 at 7:13 AM, Andres Freund  wrote:
>
> On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:
> > On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund 
wrote:
> > >
> > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila <
amit.kapil...@gmail.com>
> > wrote:
> > > >> >Won't the new code needs to ensure that ResetEvent(latchevent)
> > > >should
> > > >> >get
> > > >> >called in case WaitForMultipleObjects() comes out when both
> > > >> >pgwin32_signal_event and latchevent are signalled at the same
time?
>
> > > >> WaitForMultiple only reports the readiness of on event at a time,
no?
> > > >>
> > > >
> > > >I don't think so, please read link [1] with a focus on below
paragraph
> > > >which states how it reports the readiness or signaled state when
> > > >multiple
> > > >objects become signaled.
> > > >
> > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the
> > > >array
> > > >in order starting with index 0, until one of the objects is signaled.
> > > >If
> > > >multiple objects become signaled, the function returns the index of
the
> > > >first handle in the array whose object was signaled."
>
> I think this is just incredibly bad documentation. See
> https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
> (Raymond Chen can be considered an authority here imo).
>

The article pointed by you justifies that the way ResetEvent is done by
patch is correct.  I am not sure, but you can weigh, if there is a need of
comment so that if we want enhance this part of code (or want to write
something similar) in future, we don't need to rediscover this fact.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-20 Thread Amit Kapila

On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund  wrote:

>
>
> On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <
> amit.kapil...@gmail.com> wrote:
>
> >The article pointed by you justifies that the way ResetEvent is done by
> >patch is correct.  I am not sure, but you can weigh, if there is a need
> >of
> >comment so that if we want enhance this part of code (or want to write
> >something similar) in future, we don't need to rediscover this fact.
>
> I've added a reference in a comment.
>
> Did you have a chance of running the patched versions on windows?
>
>
I am planning to do it in next few hours.


> I plan to push this sometime today, so I can get on to some performance
> patches I was planning to look into committing.
>
>
have we done testing to ensure that it actually mitigate the impact of
performance degradation due to commit  ac1d794. I wanted to do that, but
unfortunately the hight-end m/c on which this problem is reproducible is
down.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit ac1d794

2016-03-20 Thread Amit Kapila

On Mon, Mar 21, 2016 at 10:26 AM, Amit Kapila 
wrote:
>
> On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund 
wrote:
>>
>>
>>
>> On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <
amit.kapil...@gmail.com> wrote:
>>
>> >The article pointed by you justifies that the way ResetEvent is done by
>> >patch is correct.  I am not sure, but you can weigh, if there is a need
>> >of
>> >comment so that if we want enhance this part of code (or want to write
>> >something similar) in future, we don't need to rediscover this fact.
>>
>> I've added a reference in a comment.
>>
>> Did you have a chance of running the patched versions on windows?
>>
>
> I am planning to do it in next few hours.
>

With 0002-Introduce-new-WaitEventSet-API, initdb is successful, but server
start leads to below problem:

LOG:  database system was shut down at 2016-03-21 11:17:13 IST
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
FATAL:  failed to set up event for socket: error code 10022
LOG:  statistics collector process (PID 668) exited with exit code 1
LOG:  autovacuum launcher process (PID 3868) was terminated by exception
0xC005
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
value.

I haven't investigated the problem still.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-21 Thread Amit Kapila

On Fri, Mar 18, 2016 at 2:38 PM, Dilip Kumar  wrote:
>
>
> On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek 
wrote:
>>
>> Great.
>>
>> Just small notational thing, maybe this would be simpler?:
>> extraBlocks = Min(512, lockWaiters * 20);
>
>
> Done, new patch attached.
>

Review comments:

1.
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState
bistate)

Shall we mention in comments that this API returns locked buffer and it's
the responsibility of caller to unlock it.

2.
+ /* To avoid the cases when there are huge number of lock waiters, and
+ * extend file size by big amount at a
time, put some limit on the

first line in multi-line comments should not contain anything.

3.
+ extraBlocks = Min(512, lockWaiters * 20);

I think we should explain in comments about the reason of choosing 20 in
above calculation.

4. Sometime back [1], you agreed on doing some analysis for the overhead
that XLogFlush can cause during buffer eviction,  but I don't see the
results of same, are you planing to complete the same?

5.
+ if (LOCKACQUIRE_OK
+ != RelationExtensionLockConditional(relation,
ExclusiveLock))

I think the coding style is to keep constant on right side of condition,
did you see any other place in code which uses the check in a similar way?

6.
- /*
- * Now acquire lock on the new page.
+ /* For all case we need to add at least one block to satisfy
our
+ * own request.
  */

Same problem as in point 2.

7.
@@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
  if (needLock)

UnlockRelationForExtension(relation, ExclusiveLock);

+
+

Spurious newline addition.

8.
+int
+RelationExtensionLockWaiter(Relation relation)

How about naming this function as RelationExtensionLockWaiterCount?

9.
+ /* Other waiter has extended the block for us*/

Provide an extra space before ending the comment.

10.
+ if (use_fsm)
+ {
+ if (lastValidBlock !=
InvalidBlockNumber)
+ {
+ targetBlock =
RecordAndGetPageWithFreeSpace(relation,
+
lastValidBlock,
+
pageFreeSpace,
+
len + saveFreeSpace);
+ }

Are you using RecordAndGetPageWithFreeSpace() instead of
GetPageWithFreeSpace() to get the page close to the previous target page?
If yes, then do you see enough benefit of the same that it can compensate
the additional write operation which Record* API might cause due to
additional dirtying of buffer?

11.
+ {
+ /*
+ * First try to get the lock in no-wait mode, if succeed extend one
+
 * block, else get the lock in normal mode and after we get the lock
+ * extend some extra blocks, extra
blocks will be added to satisfy
+ * request of other waiters and avoid future contention. Here instead
+
* of directly taking lock we try no-wait mode, this is to handle the
+ * case, when there is no
contention - it should not find the lock
+ * waiter and execute extra instructions.
+ */
+
if (LOCKACQUIRE_OK
+ != RelationExtensionLockConditional(relation, ExclusiveLock))
+
{
+ LockRelationForExtension(relation, ExclusiveLock);
+
+ if (use_fsm)
+
{
+ if (lastValidBlock != InvalidBlockNumber)
+
{
+ targetBlock = RecordAndGetPageWithFreeSpace(relation,
+
lastValidBlock,
+
pageFreeSpace,
+
len + saveFreeSpace);
+
}
+
+ /* Other waiter has extended the block for us*/
+
if (targetBlock != InvalidBlockNumber)
+ {
+
UnlockRelationForExtension(relation, ExclusiveLock);
+ goto loop;
+
}
+
+ RelationAddExtraBlocks(relation, bistate);
+ }
+
}
+ }

- /*
- * Now acquire lock on the new page.
+ /* For all case we need to add at least one block to
satisfy our
+ * own request.
  */
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ buffer =
RelationAddOneBlock(relation, otherBuffer, bistate);

Won't this cause one extra block addition after backend extends the
relation for multiple blocks, what is the need of same?

12. I think it is good to once test pgbench read-write tests to ensure that
this doesn't introduce any new regression.

[1] -
http://www.postgresql.org/message-id/caa4ek1lonxz4qa_dquqbanspxsctjxrkexjii8h3gnd9z8u...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission

2016-03-21 Thread Amit Kapila

On Mon, Mar 21, 2016 at 6:16 PM, Haribabu Kommi 
wrote:
>
> On Mon, Mar 14, 2016 at 4:51 PM, Amit Kapila 
wrote:
> >> Operating system - windows 7
> >> Binary - PostgreSQL 9.5 (This doesn't matter, 9.4+ can produce the
> >> problem)
> >>
> >> 1. Create two standard users in the system (test_user1 and test_user2)
> >> 2. Create two databases belongs each user listed above.
> >> 3. Now using pg_ctl register the services for the two users.
> >> 4. Provide logon permissions to these users to run the services by
> >> changing
> >> service properties.
> >
> > Did you mean to say that you changed Log on as: Local System Account in
> > service properties or something else?
>
> No. Not as local service. The user should be the new standard user
> that is created
> in the system.
>

So what do you exactly mean by "Provide logon permissions to these users",
can you describe in detail what exactly you have done to give those
permissions.  If I try to do with a new user, it gives me error "could not
open service manager"  at start of service.

>
> >> 5. Now try to start the services, the second service fails with the
> >> error message.
> >> 6. Error details can be found out in Event log viewer.
> >>
> >
> > If I follow above steps and do as I mentioned for step-4, I am not able
to
> > reproduce the issue on Windows-7 m/c using code of HEAD.
>
> I am not able to start a service with HEAD code in the same machine, where
> as it is working for 9.5. I will look into it later and update it.
>

Okay.  But it is confusing for me because you told earlier that you are
able to reproduce problem in 9.5.

> >> Yes, it is working as same user services. The main problem is,
PostgreSQL
> >> as a service for two different users in the same system is not working
> >> because
> >> of same random getting generated for two services.
> >>
> >
> > I am not sure why you think same random number is problem, as mentioned
> > above, even if the dsm name is same due to same random number, the code
has
> > logic to process it appropriately (regenerate the name of dsm).  Having
said
> > that, I don't mean to say that we shouldn't have logic to generate
unique
> > name and I think we might want to add data dir path to name generation
as we
> > do for main shared memory, however it is better to first completely
> > understand the underneath issue.
>
> Yes, same random number generation is not the problem. In windows apart
> from EEXIST error, EACCES also needs to be validated and returned for
> new random number generation, instead of throwing an error.
>

Doing the same handling for EACCES doesn't seem to be sane because
if EACCES came for reason other than duplicate dsm name, then we want to
report the error instead of trying to regenerate the name.  I think here
fix should be to append data_dir path as we do for main shared memory.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-21 Thread Amit Kapila

On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas  wrote:
>
> On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
>  wrote:
> > Thoughts? I have registered that in the CF app, and a patch is attached.
>
> It is very difficult to believe that this is a good idea:
>
> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>  if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>  PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>  PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>  PQstatus(streamConn) == CONNECTION_BAD)
>  break;
>  }
>
> I mean, why would it be a good idea to blindly skip over fatal errors?
>

I think it is not about skipping the FATAL error, rather to stop trying to
get further results on FATAL error.  This is to ensure that OOM error is
reported rather that ignored.  There has been discussion about this
previously as well [1].


[1] -
http://www.postgresql.org/message-id/CAB7nPqT6gKj6iS9VTPth_h6Sz5Jo-177s6QJN_jrW66wyCjJ=w...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission

2016-03-21 Thread Amit Kapila

On Tue, Mar 22, 2016 at 9:13 AM, Haribabu Kommi 
wrote:
>
> On Tue, Mar 22, 2016 at 2:19 PM, Amit Kapila 
wrote:
> > On Mon, Mar 21, 2016 at 6:16 PM, Haribabu Kommi <
kommi.harib...@gmail.com>
> > wrote:
> >>
> >> On Mon, Mar 14, 2016 at 4:51 PM, Amit Kapila 
> >> wrote:
> >> >> Operating system - windows 7
> >> >> Binary - PostgreSQL 9.5 (This doesn't matter, 9.4+ can produce the
> >> >> problem)
> >> >>
> >> >> 1. Create two standard users in the system (test_user1 and
test_user2)
> >> >> 2. Create two databases belongs each user listed above.
> >> >> 3. Now using pg_ctl register the services for the two users.
> >> >> 4. Provide logon permissions to these users to run the services by
> >> >> changing
> >> >> service properties.
> >> >
> >> > Did you mean to say that you changed Log on as: Local System Account
in
> >> > service properties or something else?
> >>
> >> No. Not as local service. The user should be the new standard user
> >> that is created
> >> in the system.
> >>
> >
> > So what do you exactly mean by "Provide logon permissions to these
users",
> > can you describe in detail what exactly you have done to give those
> > permissions.  If I try to do with a new user, it gives me error "could
not
> > open service manager"  at start of service.
>
> 1. Start the cmd with administrator user and add the new postgresql
service
> with a standard user that is created.
> 2. Start the services window with the user having administrator
privileges and
> go to the corresponding added service.
> 3. Right click on the service provides an properties option.
> 4. In the properties, there is an logon tab. Click it
> 5. Provide the password for the new user that is used for creating the
service.
> 6. This adds the user to log on permissions.
>

I am also able to reproduce the issue with these steps.

> >>
> >> Yes, same random number generation is not the problem. In windows apart
> >> from EEXIST error, EACCES also needs to be validated and returned for
> >> new random number generation, instead of throwing an error.
> >>
> >
> > Doing the same handling for EACCES doesn't seem to be sane because if
EACCES
> > came for reason other than duplicate dsm name, then we want to report
the
> > error instead of trying to regenerate the name.  I think here fix
should be
> > to append data_dir path as we do for main shared memory.
>
> Yes, EACCES may be possible other than duplicate dsm name.
>

So as far as I can see there are two ways to resolve this issue, one is to
retry generation of dsm name if CreateFileMapping returns EACCES and second
is to append data_dir name to dsm name as the same is done for main shared
memory, that will avoid the error to occur.  First approach has minor flaw
that if CreateFileMapping returns EACCES due to reason other then duplicate
dsm name which I am not sure is possible to identify, then we should report
error instead try to regenerate the name

Robert and or others, can you share your opinion on what is the best way to
proceed for this issue.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-21 Thread Amit Kapila

On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas 
wrote:
> >> It is very difficult to believe that this is a good idea:
> >>
> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> >>  if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> >>  PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> >>  PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> >> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> >>  PQstatus(streamConn) == CONNECTION_BAD)
> >> break;
> >>
> >> I mean, why would it be a good idea to blindly skip over fatal errors?
>
> > I think it is not about skipping the FATAL error, rather to stop trying
to
> > get further results on FATAL error.
>
> If the code already includes "lost the connection" as a case to break on,
> I'm not quite sure why "got a query error" is not.
>

This error check is exactly same as PQexecFinish() and there some
explanation is given in comments which hints towards the reason for
continuing on error, basically both the functions PQexecFinish()
and libpqrcv_PQexec() returns the last result if there are many
and PQexecFinish() concatenates the errors as well in some cases.  Do we
see any use in continuing to get result after getting PGRES_FATAL_ERROR
error?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-22 Thread Amit Kapila

On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund  wrote:
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId
*subxids,
> >   * Record the final state of transaction entries in the commit log for
> >   * all entries on a single page.  Atomic only on this page.
> >   *
> > + * Group the status update for transactions. This improves the
efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it.  To achieve the group transaction
status
> > + * update, we need to populate the transaction status related
information
> > + * in shared memory and doing it for overflowed sub-transactions would
need
> > + * a big chunk of shared memory, so we are not doing this optimization
for
> > + * such cases. This optimization is only applicable if the transaction
and
> > + * all child sub-transactions belong to same page which we presume to
be the
> > + * most common case, we might be able to apply this when they are not
on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache
based
> > + * on pageno for which each time a group leader needs to set the
transaction
> > + * status and that can lead to some performance penalty as well
because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave
that
> > + * case for now.  We don't do this optimization for prepared
transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group
status
> > + * update.  We choose not to create a semaphore for dummy procs for
this
> > + * purpose as the advantage of using this optimization for prepared
transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>

Okay, I will try to do so in next version.

> I'm actually rather unconvinced that it's all that common that all
> subtransactions are on one page. If you have concurrency - otherwise
> there'd be not much point in this patch - they'll usually be heavily
> interleaved, no?  You can argue that you don't care about subxacts,
> because they're more often used in less concurrent scenarios, but if
> that's the argument, it should actually be made.
>

Note, that we are doing it only when a transaction has less than equal to
64 sub transactions.  Now, I am not denying from the fact that there will
be cases where subtransactions won't fall into different pages, but I think
chances of such transactions to participate in group mode will be less and
this patch is mainly targeting scalability for short transactions.

>
> >   * Otherwise API is same as TransactionIdSetTreeStatus()
> >   */
> >  static void
> >  TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> >  TransactionId
*subxids, XidStatus status,
> > -XLogRecPtr lsn, int
pageno)
> > +XLogRecPtr lsn, int
pageno,
> > +bool
all_xact_same_page)
> > +{
> > + /*
> > +  * If we can immediately acquire CLogControlLock, we update the
status
> > +  * of our own XID and release the lock.  If not, use group XID
status
> > +  * update to improve efficiency and if still not able to update,
then
> > +  * acquire CLogControlLock and update it.
> > +  */
> > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > + {
> > + TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > + LWLockRelease(CLogControlLock);
> > + }
> > + else if (!all_xact_same_page ||
> > +  nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > +  IsGXactActive() ||
> > +  !TransactionGroupUpdateXidStatus(xid, status,
lsn, pageno))
> > + {
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > +
> > + LWLockRelease(CLogControlLock);
> > + }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
>PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
>lock acquire first

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-22 Thread Amit Kapila

On Tue, Mar 22, 2016 at 6:29 PM, Andres Freund  wrote:
>
> On 2016-03-22 18:19:48 +0530, Amit Kapila wrote:
> > > I'm actually rather unconvinced that it's all that common that all
> > > subtransactions are on one page. If you have concurrency - otherwise
> > > there'd be not much point in this patch - they'll usually be heavily
> > > interleaved, no?  You can argue that you don't care about subxacts,
> > > because they're more often used in less concurrent scenarios, but if
> > > that's the argument, it should actually be made.
> > >
> >
> > Note, that we are doing it only when a transaction has less than equal
to
> > 64 sub transactions.
>
> So?
>

They should fall on one page, unless they are heavily interleaved as
pointed by you.  I think either subtransactions are present or not, this
patch won't help for bigger transactions.

I will address your other review comments and send an updated patch.

Thanks for the review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-22 Thread Amit Kapila

On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund  wrote:
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId
*subxids,
> >   * Record the final state of transaction entries in the commit log for
> >   * all entries on a single page.  Atomic only on this page.
> >   *
> > + * Group the status update for transactions. This improves the
efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it.  To achieve the group transaction
status
> > + * update, we need to populate the transaction status related
information
> > + * in shared memory and doing it for overflowed sub-transactions would
need
> > + * a big chunk of shared memory, so we are not doing this optimization
for
> > + * such cases. This optimization is only applicable if the transaction
and
> > + * all child sub-transactions belong to same page which we presume to
be the
> > + * most common case, we might be able to apply this when they are not
on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache
based
> > + * on pageno for which each time a group leader needs to set the
transaction
> > + * status and that can lead to some performance penalty as well
because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave
that
> > + * case for now.  We don't do this optimization for prepared
transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group
status
> > + * update.  We choose not to create a semaphore for dummy procs for
this
> > + * purpose as the advantage of using this optimization for prepared
transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>

Okay, I have simplified the sentences in the comment.

>
>
> >   * Otherwise API is same as TransactionIdSetTreeStatus()
> >   */
> >  static void
> >  TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> >  TransactionId
*subxids, XidStatus status,
> > -XLogRecPtr lsn, int
pageno)
> > +XLogRecPtr lsn, int
pageno,
> > +bool
all_xact_same_page)
> > +{
> > + /*
> > +  * If we can immediately acquire CLogControlLock, we update the
status
> > +  * of our own XID and release the lock.  If not, use group XID
status
> > +  * update to improve efficiency and if still not able to update,
then
> > +  * acquire CLogControlLock and update it.
> > +  */
> > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > + {
> > + TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > + LWLockRelease(CLogControlLock);
> > + }
> > + else if (!all_xact_same_page ||
> > +  nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > +  IsGXactActive() ||
> > +  !TransactionGroupUpdateXidStatus(xid, status,
lsn, pageno))
> > + {
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > +
> > + LWLockRelease(CLogControlLock);
> > + }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
>PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
>lock acquire first can be rather expensive.
> b) I'd rather see an explicit fallback for the
>!TransactionGroupUpdateXidStatus case, this way it's too hard to
>understand. It's also harder to add probes to detect whether that
>

Changed.


>
>
> > The first process to add itself to the list will acquire
> > + * CLogControlLock in exclusive mode and perform
TransactionIdSetPageStatusInternal
> > + * on behalf of all group members.  This avoids a great deal of
contention
> > + * around CLogControlLock when many processes are trying to commit at
once,
> > + * since the lock need not be repeatedly handed off from one committing
> > + * process to the next.
> > + *
> > + * Returns true, if transaction status is updated in clog page, else
return

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-23 Thread Amit Kapila

On Wed, Mar 23, 2016 at 12:26 PM, Amit Kapila 
wrote:
>
> On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund  wrote:
> >
> >
> > I think it's worthwhile to create a benchmark that does something like
> > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> > completely realistic values for network RTT + application computation),
> > the success rate of group updates shrinks noticeably.
> >
>
> Will do some tests based on above test and share results.
>

Forgot to mention that the effect of patch is better visible with unlogged
tables, so will do the test with those and request you to use same if you
yourself is also planning to perform some tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-23 Thread Amit Kapila

On Tue, Mar 22, 2016 at 12:33 PM, Dilip Kumar  wrote:
>
>
> On Thu, Mar 17, 2016 at 11:39 AM, Amit Kapila 
wrote:
>
>  I have reviewed the patch.. here are some review comments, I will
continue to review..
>
> 1.
>
> +
> + /*
> +  * Add the proc to list, if the clog page where we need to update the
>
> +  */
> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
> + return false;
>
> Should we clear all these structure variable what we set above in case we
are not adding our self to group, I can see it will not have any problem
even if we don't clear them,
> I think if we don't want to clear we can add some comment mentioning the
same.
>

I have updated the patch to just clear clogGroupMember as that is what is
done when we wake the processes.

>
> 2.
>
> Here we are updating in our own proc, I think we don't need atomic
operation here, we are not yet added to the list.
>
> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
> + return false;
> +
> + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
>
>

We won't be able to assign nextidx to clogGroupNext is of
type pg_atomic_uint32.

Thanks for the review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-23 Thread Amit Kapila

On Thu, Mar 24, 2016 at 12:09 AM, Robert Haas  wrote:
>
> On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek 
wrote:
>
> I've read this over several times and looked at
> RecordAndGetPageWithFreeSpace() and I'm still confused.  First of all,
> if the lock was acquired by some other backend which did
> RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
> whole point.
>

It doesn't update the FSM uptill root in some cases, as per comments on top
of RecordPageWithFreeSpace and the code as well.

>
>   Second, if the other backend extended the relation in
> some other manner and did not extend the FSM, how does calling
> RecordAndGetPageWithFreeSpace help?  As far as I can see,
> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
> searching the FSM, so if one is stymied the other will be too.  What
> am I missing?
>

RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
it, rather than from top, so even if  RecordPageWithFreeSpace() doesn't
update till root, it will be able to search the newly added page.  I agree
with whatever you have said in another mail that we should introduce a new
API to do a more targeted search for such cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-23 Thread Amit Kapila

On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund  wrote:
>
> On 2016-03-23 21:43:41 +0100, Andres Freund wrote:
> > I'm playing around with SELECT txid_current(); right now - that should
> > be about the most specific load for setting clog bits.
>
> Or so I thought.
>
> In my testing that showed just about zero performance difference between
> the patch and master. And more surprisingly, profiling showed very
> little contention on the control lock. Hacking
> TransactionIdSetPageStatus() to return without doing anything, actually
> only showed minor performance benefits.
>
> [there's also the fact that txid_current() indirectly acquires two
> lwlock twice, which showed up more prominently than control lock, but
> that I could easily hack around by adding a xid_current().]
>
> Similar with an INSERT only workload. And a small scale pgbench.
>
>
> Looking through the thread showed that the positive results you'd posted
> all were with relativey big scale factors.
>

I have seen smaller benefits at 300 scale factor and somewhat larger
benefits at 1000 scale factor.  Also Mithun has done similar testing with
unlogged tables and the results of same [1] also looks good.

>
> Which made me think. Running
> a bigger pgbench showed that most the interesting (i.e. long) lock waits
> were both via TransactionIdSetPageStatus *and* TransactionIdGetStatus().
>

Yes, this is same what I have observed as well.

>
> So I think what happens is that once you have a big enough table, the
> UPDATEs standard pgbench does start to often hit *old* xids (in unhinted
> rows). Thus old pages have to be read in, potentially displacing slru
> content needed very shortly after.
>
>
> Have you, in your evaluation of the performance of this patch, done
> profiles over time? I.e. whether the performance benefits are the
> immediately, or only after a significant amount of test time? Comparing
> TPS over time, for both patched/unpatched looks relevant.
>

I have mainly done it with half-hour read-write tests. What do you want to
observe via smaller tests, sometimes it gives inconsistent data for
read-write tests?

>
> Even after changing to scale 500, the performance benefits on this,
> older 2 socket, machine were minor; even though contention on the
> ClogControlLock was the second most severe (after ProcArrayLock).
>

I have tried this patch on mainly 8 socket machine with 300 & 1000 scale
factor.  I am hoping that you have tried this test on unlogged tables and
by the way at what client count, you have seen these results.

> Afaics that squares with Jesper's result, which basically also didn't
> show a difference either way?
>

One difference was that I think Jesper has done testing with
synchronous_commit mode as off whereas my tests were with synchronous
commit mode on.

>
> I'm afraid that this patch might be putting bandaid on some of the
> absolutely worst cases, without actually addressing the core
> problem. Simon's patch in [1] seems to come closer addressing that
> (which I don't believe it's safe without going doing every status
> manipulation atomically, as individual status bits are smaller than 4
> bytes).  Now it's possibly to argue that the bandaid might slow the
> bleeding to a survivable level, but I have to admit I'm doubtful.
>
> Here's the stats for a -s 500 run btw:
>  Performance counter stats for 'system wide':
> 18,747  probe_postgres:TransactionIdSetTreeStatus
> 68,884  probe_postgres:TransactionIdGetStatus
>  9,718  probe_postgres:PGSemaphoreLock
> (the PGSemaphoreLock is over 50% ProcArrayLock, followed by ~15%
> SimpleLruReadPage_ReadOnly)
>
>
> My suspicion is that a better approach for now would be to take Simon's
> patch, but add a (per-page?) 'ClogModificationLock'; to avoid the need
> of doing something fancier in TransactionIdSetStatusBit().
>

I think we can try that as well and if you see better results by that
Approach, then we can use that instead of current patch.


[1] -
http://www.postgresql.org/message-id/cad__oujrdwqdjdovhahqldg-6ivu6ibci9ij1qpu6atuqpl...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-23 Thread Amit Kapila

On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila 
wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund  wrote:
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>

One more point, I wanted to say here which is that I think the benefit will
be shown mainly when the ClogControlLock has contention more than or near
to ProcArrayLock, otherwise even if patch reduces contention (you can see
via LWLock stats), the performance doesn't increase.  From Mithun's data
[1], related to LWLocks, it seems like at 88 clients in his test, the
contention on CLOGControlLock becomes more than ProcArrayLock and that is
the point where it has started showing noticeable performance gain.  I have
explained some more on that thread [2] about this point.  Is it possible
for you to once test in similar situation and see the behaviour (like for
client count greater than number of cores) w.r.t locking contention and TPS.

[1] -
http://www.postgresql.org/message-id/cad__ouh6pahj+q1mzzjdzlo4v9gjyufegtjnayxc0_lfh-4...@mail.gmail.com
[2] -
http://www.postgresql.org/message-id/caa4ek1lboq4e3ycge+fe0euzvu89cqgtugneajoienxjr0a...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-24 Thread Amit Kapila

On Thu, Mar 24, 2016 at 1:48 PM, Petr Jelinek  wrote:
>
> On 24/03/16 07:04, Dilip Kumar wrote:
>>
>>
>> On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas > <mailto:robertmh...@gmail.com>> wrote:
>>
>> On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila
>> mailto:amit.kapil...@gmail.com>> wrote:
>> > RecordAndGetPageWithFreeSpace() tries to search from the oldPage
passed to
>> > it, rather than from top, so even if  RecordPageWithFreeSpace()
doesn't
>> > update till root, it will be able to search the newly added page.
I agree
>> > with whatever you have said in another mail that we should
introduce a new
>> > API to do a more targeted search for such cases.
>>
>> OK, let's do that, then.
>>
>>
>> Ok, I have added new API which just find the free block from and start
>> search from last given page.
>>
>> 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
>> don't like this name, but I could not come up with any better, Please
>> suggest one.
>>
>

1.
+GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,
+ Size spaceNeeded)
{
..
+ /*
+ * If fsm_set_and_search found a suitable new block, return that.
+ * Otherwise, search as usual.
+ */
..
}

In the above comment, you are referring wrong function.

2.
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+ Buffer buf;
+ int newslot = -1;
+
+ buf = fsm_readbuf(rel, addr, true);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (minValue != 0)
+ {
+ /* Search while we still hold the lock */
+ newslot = fsm_search_avail(buf, minValue,
+   addr.level == FSM_BOTTOM_LEVEL,
+   false);

In this new API, I don't understand why we need minValue != 0 check,
basically if user of API doesn't want to search for space > 0, then what is
the need of calling this API?  I think this API should use Assert for
minValue!=0 unless you see reason for not doing so.

>
> GetNearestPageWithFreeSpace? (although not sure that's accurate
description, maybe Nearby would be better)
>

Better than what is used in patch.

Yet another possibility could be to call it as GetPageWithFreeSpaceExtended
and call it from GetPageWithFreeSpace with value of oldPage
as InvalidBlockNumber.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-24 Thread Amit Kapila

On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila 
wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund  wrote:
> >
> > Have you, in your evaluation of the performance of this patch, done
> > profiles over time? I.e. whether the performance benefits are the
> > immediately, or only after a significant amount of test time? Comparing
> > TPS over time, for both patched/unpatched looks relevant.
> >
>
> I have mainly done it with half-hour read-write tests. What do you want
to observe via smaller tests, sometimes it gives inconsistent data for
read-write tests?
>

I have done some tests on both intel and power m/c (configuration of which
are mentioned at end-of-mail) to see the results at different
time-intervals and it is always showing greater than 50% improvement in
power m/c at 128 client-count and greater than 29% improvement in Intel m/c
at 88 client-count.


Non-default parameters

max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout=35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

pgbench setup

scale factor - 300
used *unlogged* tables : pgbench -i --unlogged-tables -s 300 ..
pgbench -M prepared tpc-b


Results on Intel m/c

client-count - 88

Time (minutes) Base Patch %
5 39978 51858 29.71
10 38169 52195 36.74
20 36992 52173 41.03
30 37042 52149 40.78

Results on power m/c
---
Client-count - 128

Time (minutes) Base Patch %
5 42479 65655 54.55
10 41876 66050 57.72
20 38099 65200 71.13
30 37838 61908 63.61
>
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>
> I have tried this patch on mainly 8 socket machine with 300 & 1000 scale
factor.  I am hoping that you have tried this test on unlogged tables and
by the way at what client count, you have seen these results.
>

Do you think in your tests, we don't see increase in performance in your
tests because of m/c difference (sockets/cpu cores) or client-count?


Intel m/c config (lscpu)
-
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   1064.000
BogoMIPS:  4266.62
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

Power m/c config (lscpu)
-
Architecture:  ppc64le
Byte Order:Little Endian
CPU(s):192
On-line CPU(s) list:   0-191
Thread(s) per core:8
Core(s) per socket:1
Socket(s): 24
NUMA node(s):  4
Model: IBM,8286-42A
L1d cache: 64K
L1i cache: 32K
L2 cache:  512K
L3 cache:  8192K
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
NUMA node2 CPU(s): 96-143
NUMA node3 CPU(s): 144-191

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance degradation in commit 6150a1b0

2016-03-24 Thread Amit Kapila

On Wed, Mar 23, 2016 at 1:59 PM, Ashutosh Sharma 
wrote:
>
> Hi All,
>
> I have been working on this issue for last few days trying to investigate
what could be the probable reasons for Performance degradation at commit
6150a1b0. After going through Andres patch for moving buffer I/O and
content lock out of Main Tranche, the following two things come into my
> mind.
>
> 1. Content Lock is no more used as a pointer in BufferDesc structure
instead it is included as LWLock structure. This  basically increases the
overall structure size from 64bytes to 80 bytes. Just to investigate on
this, I have reverted the changes related to content lock from commit
6150a1b0 and taken at least 10 readings and with this change i can see that
the overall performance is similar to what it was observed earlier i.e.
before commit 6150a1b0.
>
> 2. Secondly, i can see that the BufferDesc structure padding is 64 bytes
however the PG CACHE LINE ALIGNMENT is 128 bytes. Also, after changing the
BufferDesc structure padding size to 128 bytes along with the changes
mentioned in above point #1, I see that the overall performance is again
similar to what is observed before commit 6150a1b0.
>
> Please have a look into the attached test report that contains the
performance test results for all the scenarios discussed above and let me
know your thoughts.
>

So this indicates that changing back content lock as LWLock* in BufferDesc
brings back the performance which indicates that increase in BufferDesc
size to more than 64bytes on this platform has caused regression.  I think
it is worth trying the patch [1] as suggested by Andres as that will reduce
the size of BufferDesc which can bring back the performance.  Can you once
try the same?

[1] -
http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-25 Thread Amit Kapila

On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila 
wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund  wrote:
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>
> I have tried this patch on mainly 8 socket machine with 300 & 1000 scale
factor.  I am hoping that you have tried this test on unlogged tables and
by the way at what client count, you have seen these results.
>
> > Afaics that squares with Jesper's result, which basically also didn't
> > show a difference either way?
> >
>
> One difference was that I think Jesper has done testing with
synchronous_commit mode as off whereas my tests were with synchronous
commit mode on.
>

On again looking at results posted by Jesper [1] and Mithun [2], I have one
more observation which is that in HEAD, the performance doesn't dip even at
higher client count (>75) on tests done by Jesper, whereas the results of
tests done by Mithun indicate that it dips at high client count (>64) in
HEAD and that is where the patch is helping.  Now there is certainly some
difference in test environment like Jesper has done testing on 2 socket m/c
whereas mine and Mithun's tests were done 4 or 8 socket m/c.  So I think
the difference in TPS due to reduced contention on CLogControlLock are
mainly visible with high socket m/c.

Can anybody having access to 4 or more socket m/c help in testing this
patch with --unlogged-tables?

[1] - http://www.postgresql.org/message-id/56e9a596.2000...@redhat.com
[2] -
http://www.postgresql.org/message-id/cad__oujrdwqdjdovhahqldg-6ivu6ibci9ij1qpu6atuqpl...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-27 Thread Amit Kapila

On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas  wrote:
>
> On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar 
wrote:
>
> > Results:
> > --
> > 1. With this performance is little less than v14 but the problem of
extra
> > relation size is solved.
> > 2. With this we can conclude that extra size of relation in v14 is
because
> > some while extending the pages, its not immediately available and at end
> > some of the pages are left unused.
>
> I agree with that conclusion.  I'm not quite sure where that leaves
> us, though.  We can go back to v13, but why isn't that producing extra
> pages?  It seems like it should: whenever a bulk extend rolls over to
> a new FSM page, concurrent backends will search either the old or the
> new one but not both.
>

I have not debugged the flow, but by looking at v13 code, it looks like it
will search both old and new.   In
function 
GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(),
the basic idea of search is: Start the search from the target slot.  At
every step, move one
node to the right, then climb up to the parent.  Stop when we reach a node
with enough free space (as we must, since the root has enough space).
So shouldn't it be able to find the new FSM page where the bulk extend
rolls over?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-28 Thread Amit Kapila

On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila 
wrote:
>
> On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund  wrote:
> >
>
> Updated comments and the patch (increate_clog_bufs_v2.patch)
> containing the same is attached.
>

Andres mentioned to me in off-list discussion, that he thinks we should
first try to fix the clog buffers problem as he sees in his tests that clog
buffer replacement is one of the bottlenecks. He also suggested me a test
to see if the increase in buffers could lead to regression.  The basic idea
of test was to ensure every access on clog access to be a disk one.  Based
on his suggestion, I have written a SQL statement which will allow every
access of CLOG to be a disk access and the query used for same is as below:
With ins AS (INSERT INTO test_clog_access values(default) RETURNING c1)
Select * from test_clog_access where c1 = (Select c1 from ins) - 32768 *
:client_id;

Test Results
-
HEAD - commit d12e5bb7 Clog Buffers - 32
Patch-1 - Clog Buffers - 64
Patch-2 - Clog Buffers - 128

Patch_Ver/Client_Count 1 64
HEAD 12677 57470
Patch-1 12305 58079
Patch-2 12761 58637

Above data is a median of 3 10-min runs.  Above data indicates that there
is no substantial dip in increasing clog buffers.

Test scripts used in testing are attached with this mail.  In
perf_clog_access.sh, you need to change data_directory path as per your
m/c, also you might want to change the binary name, if you want to create
postgres binaries with different names.

Andres, Is this test inline with what you have in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

access_clog_prep.sql
Description: Binary data

access_clog.sql
Description: Binary data

perf_clog_access.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-28 Thread Amit Kapila

On Tue, Mar 29, 2016 at 3:21 AM, Petr Jelinek  wrote:

> On 28/03/16 14:46, Dilip Kumar wrote:
>
>>
>> Conclusion:
>> ---
>> 1. I think v15 is solving the problem exist with v13 and performance
>> is significantly high compared to base, and relation size is also
>> stable, So IMHO V15 is winner over other solution, what other thinks ?
>>
>>
> It seems so, do you have ability to reasonably test with 64 clients? I am
> mostly wondering if we see the performance going further down or just
> plateau.
>
>
Yes, that makes sense.  One more point is that if the reason for v13 giving
better performance is extra blocks (which we believe in certain cases can
leak till the time Vacuum updates the FSM tree), do you think it makes
sense to once test by increasing lockWaiters * 20 limit to may
be lockWaiters * 25 or lockWaiters * 30.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-29 Thread Amit Kapila

On Tue, Mar 29, 2016 at 8:31 PM, David Steele  wrote:

> Hi Amit,
>
> On 3/22/16 3:37 AM, Michael Paquier wrote:
>
> Hope this brings some light in.
>>
>
> Do you know when you'll have time to respond to Michael's last email? I've
> marked this patch "waiting on author" in the meantime.
>
>
I think this patch needs to be looked upon by committer now.  I have done
review and added some code in this patch as well long back, just see the
e-mail [1], patch is just same as it was in October 2015.  I think myself
and Michael are in agreement that this patch solves the reported problem.
There is one similar problem [2] reported by Heikki which I think can be
fixed separately.

I think the main reason of moving this thread to hackers from bugs is to
gain some more traction which I see that it achieves its purpose to some
extent, but I find that more or less we are at same situation as we were
back in October.

Let me know if you think anything more from myside can help in moving patch.

[1] -
http://www.postgresql.org/message-id/cab7npqtnrw8lr1hd7zlnojjc1bp1evw_emadgorv+s-sbcr...@mail.gmail.com
[2] - http://www.postgresql.org/message-id/566ef84f.1030...@iki.fi

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Updated backup APIs for non-exclusive backups

2016-03-30 Thread Amit Kapila

On Tue, Mar 29, 2016 at 11:39 PM, Magnus Hagander 
wrote:

> On Tue, Mar 29, 2016 at 6:40 PM, Magnus Hagander 
> wrote:
>
>> So - I can definitely see the argument for returning the stop wal
>> *location*. But I'm still not sure what the definition of the time would
>> be? We can't return it before we know what it means...
>>
>
>
> I had a chat with Heikki, and here's another suggestion:
>
> 1. We don't touch the current exclusive backups at all, as previously
> discussed, other than deprecating their use. For backwards compat.
>
> 2. For new backups, we return the contents of pg_control as a bytea from
> pg_stop_backup(). We tell backup programs they are supposed to write this
> out as pg_control.backup, *not* as pg_control.
>
> 3a. On recovery, if it's an exlcusive backup, we do as we did before.
>
> 3b. on recovery, in non-exclusive backups (determined from backup_label),
> we check that pg_control.backup exists *and* that pg_control does *not*
> exist.
>

Currently pg_control has been read before backup_label file, so as per this
proposal do you want to change that?  If yes, I think that will make this
patch more invasive with respect to handling of failure modes.  Also as
David points out, I also feel that it will raise the bar for usage of this
API.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3368 matches

Mail list logo