Re: [HACKERS] WAL consistency check facility

2016-09-13 Thread Kuntal Ghosh
On Wed, Sep 14, 2016 at 6:34 AM, Michael Paquier
 wrote:
>>>2) Regarding page comparison:
>>>We could just use memcpy() here. compareImages was useful to get a
>>>clear image of what the inconsistencies were, but you don't do that
>>>anymore.
>> memcmp(), right?
>
>Yep :)
If I use memcmp(), I won't get the byte location where the first mismatch
has occurred. It will be helpful to display the byte location which causes
an inconsistency.

>>>6)
>>>+   /*
>>>+* Remember that, if WAL consistency check is enabled for
>>>the current rmid,
>>>+* we always include backup image with the WAL record.
>>>But, during redo we
>>>+* restore the backup block only if needs_backup is set.
>>>+*/
>>>+   if (needs_backup)
>>>+   bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
>>>This should use wal_consistency[rmid]?
>>
>> needs_backup is set when XLogRecordAssemble decides that backup image
>> should be included in the record for redo purpose. This image will be
>> restored during
>> redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
>> image should be restored during redo(or has_image should be set or not).
>
>When decoding a record, I think that you had better not use has_image
>to assume that a FPW has to be double-checked. This has better be a
>different boolean flag, say check_page or similar. This way after
>decoding a record it is possible to know if there is a PFW, and if a
>check on it is needed or not.
I've done some modifications which discards the necessity of adding
anything in DecodeXLogRecord().

Master
---
- If wal_consistency check is enabled or needs_backup is set in
XLogRecordAssemble(), we do a fpw.
- If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
which in turns set has_image flag while decoding the record.
- If a fpw needs to be restored during redo, i.e., needs_backup is true,
then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.

Standby
---
- In XLogReadBufferForRedoExtended(), if both XLogRecHasBlockImage() and
XLogRecHasBlockImageForRedo()(added by me*) return true, we restore the
backup image.
- In checkConsistency, we only check if XLogRecHasBlockImage() returns true
when wal_consistency check is enabled for this rmid.

*XLogRecHasBlockImageForRedo() checks whether bimg_info is set with
BKPIMAGE_IS_REQUIRED_FOR_REDO.

Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Michael Paquier
On Wed, Sep 14, 2016 at 7:06 AM, Tom Lane  wrote:
> It would be good for someone else to reproduce my results though.
> For one thing, 5%-ish is not that far above the noise level; maybe
> what I'm measuring here is just good luck from relocation of critical
> loops into more cache-line-friendly locations.

>From an OSX laptop with -S, -c 1 and -M prepared (9 runs, removed the
three best and three worst):
- HEAD: 9356/9343/9369
- HEAD + patch: 9433/9413/9461.071168
This laptop has a lot of I/O overhead... Still there is a slight
improvement here as well. Looking at the progress report, per-second
TPS gets easier more frequently into 9500~9600 TPS with the patch. So
at least I am seeing something.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-13 Thread Dilip Kumar
On Mon, Sep 5, 2016 at 9:33 AM, Amit Kapila  wrote:
> USE_CONTENT_LOCK on my windows box, you can try by commenting that as
> well, if it works for you).  So, in short we have to compare three
> approaches here.
>
> 1) Group mode to reduce CLOGControlLock contention
> 2) Use granular locking model
> 3) Use atomic operations

I have tested performance with approach 1 and approach 2.

1. Transaction (script.sql): I have used below transaction to run my
bench mark, We can argue that this may not be an ideal workload, but I
tested this to put more load on ClogControlLock during commit
transaction.

---
\set aid random (1,3000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
END;
---

2. Results
./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
scale factor: 300
Clients   head(tps)grouplock(tps)  granular(tps)
---  -   --   ---
12829367 3932637421
18029777 3781036469
25628523 3741835882


grouplock --> 1) Group mode to reduce CLOGControlLock contention
granular  --> 2) Use granular locking model

I will test with 3rd approach also, whenever I get time.

3. Summary:
1. I can see on head we are gaining almost ~30 % performance at higher
client count (128 and beyond).
2. group lock is ~5% better compared to granular lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-13 Thread Kouhei Kaigai
Hello,

It looks to me pg_buffercache tries to allocate more than 1GB using
palloc(), when shared_buffers is more than 256GB.

# show shared_buffers ;
 shared_buffers

 280GB
(1 row)

# SELECT buffers, d.datname, coalesce(c.relname, '???')
FROM (SELECT count(*) buffers, reldatabase, relfilenode
FROM pg_buffercache group by reldatabase, relfilenode) b
   LEFT JOIN pg_database d ON d.oid = b.reldatabase
   LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
 WHERE datname = current_database())
   AND b.relfilenode = pg_relation_filenode(c.oid)
   ORDER BY buffers desc;
ERROR:  invalid memory alloc request size 1174405120

It is a situation to use MemoryContextAllocHuge(), instead of palloc().
Also, it may need a back patching?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-fix-pg_buffercache-palloc-huge.patch
Description: pgsql-fix-pg_buffercache-palloc-huge.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Thomas Munro
On Wed, Sep 14, 2016 at 12:06 AM, Tom Lane  wrote:
> I wrote:
>> At -j 10 -c 10, all else the same, I get 84928 TPS on HEAD and 90357
>> with the patch, so about 6% better.
>
> And at -j 1 -c 1, I get 22390 and 24040 TPS, or about 7% better with
> the patch.  So what I am seeing on OS X isn't contention of any sort,
> but just a straight speedup that's independent of the number of clients
> (at least up to 10).  Probably this represents less setup/teardown cost
> for kqueue() waits than poll() waits.

Thanks for running all these tests.  I hadn't considered OS X performance.

> So you could spin this as "FreeBSD's poll() implementation is better than
> OS X's", or as "FreeBSD's kqueue() implementation is worse than OS X's",
> but either way I do not think we're seeing the same issue that was
> originally reported against Linux, where there was no visible problem at
> all till you got to a couple dozen clients, cf
>
> https://www.postgresql.org/message-id/CAB-SwXbPmfpgL6N4Ro4BbGyqXEqqzx56intHHBCfvpbFUx1DNA%40mail.gmail.com
>
> I'm inclined to think the kqueue patch is worth applying just on the
> grounds that it makes things better on OS X and doesn't seem to hurt
> on FreeBSD.  Whether anyone would ever get to the point of seeing
> intra-kernel contention on these platforms is hard to predict, but
> we'd be ahead of the curve if so.

I was originally thinking of this as simply the obvious missing
implementation of Andres's WaitEventSet API which would surely pay off
later as we do more with that API (asynchronous execution with many
remote nodes for sharding, built-in connection pooling/admission
control for large numbers of sockets?, ...).  I wasn't really
expecting it to show performance increases in simple one or two
pipe/socket cases on small core count machines, and it's interesting
that it clearly does on OS X.

> It would be good for someone else to reproduce my results though.
> For one thing, 5%-ish is not that far above the noise level; maybe
> what I'm measuring here is just good luck from relocation of critical
> loops into more cache-line-friendly locations.

Similar results here on a 4 core 2.2GHz Core i7 MacBook Pro running OS
X 10.11.5.  With default settings except fsync = off, I ran pgbench -i
-s 100, then took the median result of three runs of pgbench -T 60 -j
4 -c 4 -M prepared -S.  I used two different compilers in case it
helps to see results with different random instruction cache effects,
and got the following numbers:

Apple clang 703.0.31: 51654 TPS -> 55739 TPS = 7.9% improvement
GCC 6.1.0 from MacPorts: 52552 TPS -> 55143 TPS = 4.9% improvement

I reran the tests under FreeBSD 10.3 on a 4 core laptop and again saw
absolutely no measurable difference at 1, 4 or 24 clients.  Maybe a
big enough server could be made to contend on the postmaster pipe's
selinfo->si_mtx, in selrecord(), in pipe_poll() -- maybe that'd be
directly equivalent to what happened on multi-socket Linux with
poll(), but I don't know.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Push down more full joins in postgres_fdw

2016-09-13 Thread Ashutosh Bapat
On Tue, Sep 13, 2016 at 10:28 PM, Robert Haas  wrote:
> On Tue, Sep 6, 2016 at 9:07 AM, Ashutosh Bapat
>  wrote:
>> That's not true with the alias information. As long as we detect which
>> relations need subqueries, their RTIs are enough to create unique aliases
>> e.g. if a relation involves RTIs 1, 2, 3 corresponding subquery can have
>> alias r123 without conflicting with any other alias.
>
> What if RTI 123 is also used in the query?

Good catch. I actually meant some combination of 1, 2 and 3, which is
unique for a join between r1, r2 and r3.  How about r1_2_3 or
r1_r2_r3?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Pavan Deolasee
On Wed, Sep 14, 2016 at 12:21 AM, Robert Haas  wrote:

> On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada 
> wrote:
> > Attached PoC patch changes the representation of dead tuple locations
> > to the hashmap having tuple bitmap.
> > The one hashmap entry consists of the block number and the TID bitmap
> > of corresponding block, and the block number is the hash key of
> > hashmap.
> > Current implementation of this patch is not smart yet because each
> > hashmap entry allocates the tuple bitmap with fixed
> > size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> > LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> > In case where one block can store only the several tens tuples, the
> > most bits are would be waste.
> >
> > After improved this patch as you suggested, I will measure performance
> benefit.
>
>
>
> Now, if I'm reading it correctly, this patch allocates a 132-byte
> structure for every page with at least one dead tuple.  In the worst
> case where there's just one dead tuple per page, that's a 20x
> regression in memory usage.  Actually, it's more like 40x, because
> AllocSetAlloc rounds small allocation sizes up to the next-higher
> power of two, which really stings for a 132-byte allocation, and then
> adds a 16-byte header to each chunk.  But even 20x is clearly not
> good.  There are going to be lots of real-world cases where this uses
> way more memory to track the same number of dead tuples, and I'm
> guessing that benchmarking is going to reveal that it's not faster,
> either.
>
>
Sawada-san offered to reimplement the patch based on what I proposed
upthread. In the new scheme of things, we will allocate a fixed size bitmap
of length 2D bits per page where D is average page density of live + dead
tuples. (The rational behind multiplying D by a factor of 2 is to consider
worst case scenario where every tuple also has a LP_DIRECT line
pointer). The value of D in most real world, large tables should not go
much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That
translates to about 25 bytes/page. So all TIDs with offset less than 2D can
be represented by a single bit. We augment this with an overflow to track
tuples which fall outside this limit. I believe this area will be small,
say 10% of the total allocation.

This representation is at least as good the current representation if there
are at least 4-5% dead tuples. I don't think very large tables will be
vacuumed with a scale factor less than that. And assuming 10% dead tuples,
this representation will actually be much more optimal.

The idea can fail when (a) there are very few dead tuples in the table, say
less than 5% or (b) there are large number of tuples falling outside the 2D
limit. While I don't expect any of these to hold for real world, very large
tables,  (a) we can anticipate when the vacuum starts and use current
representation. (b) we can detect at run time and do a one time switch
between representation. You may argue that managing two representations is
clumsy, which I agree, but the code is completely isolated and probably not
more than a few hundred lines.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] condition variables

2016-09-13 Thread Peter Geoghegan
On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas  wrote:
> Another approach to the problem is to use a latch wait loop.  That
> almost works.  Interrupts can be serviced, and you can recheck shared
> memory to see whether the condition for proceeding is satisfied after
> each iteration of the loop.  There's only one problem: when you do
> something that might cause the condition to be satisfied for other
> waiting backends, you need to set their latch - but you don't have an
> easy way to know exactly which processes are waiting, so how do you
> call SetLatch?  I originally thought of adding a function like
> SetAllLatches(ParallelContext *) and maybe that can work, but then I
> had what I think is a better idea, which is to introduce a notion of
> condition variables.

I don't see a CF entry for this. Are you planning to work on this
again soon, Robert?

I have an eye on this patch due to my work on parallel CREATE INDEX.
It would be nice to have some rough idea of when you intend to commit
this.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Comment typo in execIndexing.c

2016-09-13 Thread Amit Langote
Spotted a typo in the header comment of execIndexing.c.  Attached fixes it.

s/exclusive constraints/exclusion constraints/g

Thanks,
Amit
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0e2d834..009c1b7 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -2,7 +2,7 @@
  *
  * execIndexing.c
  *	  routines for inserting index tuples and enforcing unique and
- *	  exclusive constraints.
+ *	  exclusion constraints.
  *
  * ExecInsertIndexTuples() is the main entry point.  It's called after
  * inserting a tuple to the heap, and it inserts corresponding index tuples

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-13 Thread Amit Kapila
On Tue, Sep 13, 2016 at 5:46 PM, Robert Haas  wrote:
> On Thu, Sep 8, 2016 at 12:32 AM, Amit Kapila  wrote:
>> Hmm.  I think page or block is a concept of database systems and
>> buckets is a general concept used in hashing technology.  I think the
>> difference is that there are primary buckets and overflow buckets. I
>> have checked how they are referred in one of the wiki pages [1],
>> search for overflow on that wiki page. Now, I think we shouldn't be
>> inconsistent in using them. I will change to make it same if I find
>> any inconsistency based on what you or other people think is the
>> better way to refer overflow space.
>
> In the existing source code, the terminology 'overflow page' is
> clearly preferred to 'overflow bucket'.
>

Okay, point taken.  Will update it in next version of patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event trigger and CREATE/ALTER ROLE/USER

2016-09-13 Thread Tatsuo Ishii
> As I understand the issue, the main reason is that event triggers
> execute procedures, and those exist in a single database only.  If you
> were to create an event trigger in database A, then a user gets created
> in database B, your function would not be invoked, which becomes a
> problem.

Can't we just create the event trigger in database B as well? We have
been living with similar situation, for example, functions for years.
(a work around would be creating functions in template1. This only
works for freshly created database though).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event trigger and CREATE/ALTER ROLE/USER

2016-09-13 Thread Craig Ringer
On 14 Sep. 2016 9:44 am, "Alvaro Herrera"  wrote:
>
> Tatsuo Ishii wrote:
> > Simple question: Is there any reason for event trigger to not support
> > CREATE/ALTER ROLE/USER?
>
> As I understand the issue, the main reason is that event triggers
> execute procedures, and those exist in a single database only.  If you
> were to create an event trigger in database A, then a user gets created
> in database B, your function would not be invoked, which becomes a
> problem.

Yeah... You'd need something at the C level as a hook because you can't
rely on pg_proc etc.

For BDR I've been thinking of optionally replicating such actions using a
ProcessUtility_hook and deparse or simply verbatim sql capture. Then
replicating via generic logical wal messages.


Re: [HACKERS] Event trigger and CREATE/ALTER ROLE/USER

2016-09-13 Thread Alvaro Herrera
Tatsuo Ishii wrote:
> Simple question: Is there any reason for event trigger to not support
> CREATE/ALTER ROLE/USER?

As I understand the issue, the main reason is that event triggers
execute procedures, and those exist in a single database only.  If you
were to create an event trigger in database A, then a user gets created
in database B, your function would not be invoked, which becomes a
problem.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DISABLE_PAGE_SKIPPING option for vacuumdb

2016-09-13 Thread Masahiko Sawada
On Wed, Sep 14, 2016 at 2:17 AM, Robert Haas  wrote:
> On Thu, Sep 8, 2016 at 1:32 PM, Masahiko Sawada  wrote:
>> Attached patch add --disable-page-skipping option to vacuumed command for 
>> 9.6.
>> If by any chance the freeze map causes the serious data corruption bug
>> then the user will want to run pg_check_visible() and vacuum with
>> DISABLE_PAGE_SKIPPING option to the multiple tables having problem.
>> In this case, I think that this option for vacuumdb would be convenient.
>>
>> Please give me feedback.
>
> I think this isn't really necessary.
>

Thank you for comment.

Okay, I thought it would be useful for user who want to vacuum with
disable_page_skipping specified table but it's non-essential.
I pull it back. Thanks.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Kouhei Kaigai
> On Tue, Sep 13, 2016 at 3:48 AM, Kouhei Kaigai  wrote:
> > Sorry for my late.
> >
> > The attached patch fixed the wording problems on SGML part.
> 
> I agree that we should have some way for foreign data wrappers and
> custom scans and perhaps also other executor nodes to find out whether
> there's a known limit to the number of tuples that they might need to
> produce, but I wonder if we should be doing something more general
> than this.  For example, suppose we add a new PlanState member "long
> numTuples" where 0 means that the number of tuples that will be needed
> is unknown (so that most node types need not initialize it), a
> positive value is an upper bound on the number of tuples that will be
> fetched, and -1 means that it is known for certain that we will need
> all of the tuples.  This might be relevant to the executor batching
> stuff that Andres has been working on, because you could for example
> set ps->numTuples == 1 on the inner side of a semi-join, warning the
> executor node that it shouldn't bother trying to batch anything.
>
I also think the generic approach is a preferable direction.

In the current implementation calls recompute_limits() on the first
invocation of ExecLimit and ExecReScanLimit. Do we expect the
ps->numTuples will be also passed down to the child nodes on the same
timing?
I also think this new executor contract shall be considered as a hint
(but not a requirement) for the child nodes, because it allows the
parent nodes to re-distribute the upper limit regardless of the type
of the child nodes as long as the parent node can work correctly and
has benefit even if the child node returns a part of tuples. It makes
the decision whether the upper limit should be passed down much simple.
The child node "can" ignore the hint but can utilize for more optimization.

> On a more practical level, I notice that you haven't adapted
> postgres_fdw or file_fdw to benefit from this new callback.  It seems
> like postgres_fdw could benefit, because it could fetch only the
> required number of tuples if that happens to be a smaller number than
> the configured fetch_size.
>
It is because of just my time pressure around the patch submission days.
I'll try to enhance postgres_fdw as a usage of this run-time optimization.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-09-13 Thread Michael Paquier
On Tue, Sep 13, 2016 at 6:07 PM, Kuntal Ghosh
 wrote:
>> For now, I've kept this as a WARNING message to detect all inconsistencies
>> at once. Once, the patch is finalized, I'll modify it as an ERROR message.
>
> Or say FATAL. This way the server is taken down.

What I'd really like to see here is a way to quickly identify in the
buildfarm the moment an inconsistent WAL has been introduced by a new
feature, some bug fix, or perhaps a deficiency in the masking
routines. We could definitely tune that later on, by controlling with
a GUC if this generates a WARNING instead of a FATAL, the former being
more useful for production environments, and the latter for tests. It
would be good to think as well about a set of tests, one rough thing
would be to modify an on-disk page for a table, and work on that to
force an inconsistency to be triggered..

>>A couple of extra thoughts:
>>1) The routines of each rmgr are located in a dedicated file, for
>>example GIN stuff is in ginxlog.c, etc. It seems to me that it would
>>be better to move each masking function where it should be instead
>>being centralized. A couple of routines need to be centralized, so I'd
>>suggest putting them in a new file, like xlogmask.c, xlog because now
>>this is part of WAL replay completely, including the lsn, the hint
>>bint and the other common routines.
> Sounds good. But, I think that the file name for common masking routines
> should be as bufmask.c since we are masking the buffers only.

That makes sense as well. No objections to that.

>>2) Regarding page comparison:
>>We could just use memcpy() here. compareImages was useful to get a
>>clear image of what the inconsistencies were, but you don't do that
>>anymore.
> memcmp(), right?

Yep :)

>>6)
>>+   /*
>>+* Remember that, if WAL consistency check is enabled for
>>the current rmid,
>>+* we always include backup image with the WAL record.
>>But, during redo we
>>+* restore the backup block only if needs_backup is set.
>>+*/
>>+   if (needs_backup)
>>+   bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
>>This should use wal_consistency[rmid]?
>
> needs_backup is set when XLogRecordAssemble decides that backup image
> should be included in the record for redo purpose. This image will be
> restored during
> redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
> image should be restored during redo(or has_image should be set or not).

When decoding a record, I think that you had better not use has_image
to assume that a FPW has to be double-checked. This has better be a
different boolean flag, say check_page or similar. This way after
decoding a record it is possible to know if there is a PFW, and if a
check on it is needed or not.

>>7) This patch has zero documentation. Please add some. Any human being
>>on this list other than those who worked on the first versions
>>(Heikki, Simon and I?) is going to have a hard time to review this
>>patch in details moving on if there is no reference to tell what this
>>feature does for the user...
>>
>>This patch is going to the good direction, but I don't think it's far
>>from being ready for commit yet. So I am going to mark it as returned
>>with feedback if there are no objections
>
> I think only major change that this patch needs a proper and detailed
> documentation. Other than that there are very minor changes which can
> be done quickly. Right?

It seems to me that you need to think about the way to document things
properly first, with for example:
- Have a first documentation patch that explains what is a resource
manager for WAL, and what are the types available with a nice table.
- Add in your patch documentation to explain what are the benefits of
using this facility, the main purpose is testing, but there are also
mention upthread about users that would like to get that into
production, assuming that the overhead is minimal.
- Add more comments in your code to finish. One example is
checkConsistency() that is here, but explains nothing.

Well, if you'd simply use an on/off switch to control the feature, the
documentation load for rmgrs would be zero, but as I am visibly
outnumbered in this fight... We could also have an off/on switch
implemented first, and extend that later on depending on the feedback
from other users. We discussed rmgr-level or relation-level tuning of
FPW compression at some point, but we've finished with the most simple
approach, and we still stick with it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gettimeofday is at the end of its usefulness?

2016-09-13 Thread Haribabu Kommi
On Thu, Aug 25, 2016 at 3:12 PM, Haribabu Kommi 
wrote:

> On Thu, Jun 9, 2016 at 12:56 AM, Tom Lane  wrote:
> > Thom Brown  writes:
> >> On 15 May 2014 at 19:56, Bruce Momjian  wrote:
> >>> On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:
>  A recent question from Tim Kane prompted me to measure the overhead
>  costs of EXPLAIN ANALYZE, which I'd not checked in awhile.  Things
>  are far worse than I thought.  On my current server (by no means
>  lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
>  at something like 110 nsec per row:
> >
> >> Did this idea die, or is it still worth considering?
> >
> > We still have a problem, for sure.  I'm not sure that there was any
> > consensus on what to do about it.  Using clock_gettime(CLOCK_REALTIME)
> > if available would be a straightforward change that should ameliorate
> > gettimeofday()'s 1-usec-precision-limit problem; but it doesn't do
> > anything to fix the excessive-overhead problem.  The ideas about the
> > latter were all over the map, and none of them looked easy.
> >
> > If you're feeling motivated to work on this area, feel free.
>
> How about using both CLOCK_REALTIME and CLOCK_REALTIME_COARSE
> as the clock id's in clock_gettime wherever applicable. COARSE option is
> used
> wherever there is no timing calculation is required, because in my laptop,
> there
> is a significant performance difference is observed (like 8 times)
> compared to
> CLOCK_REALTIME.
>
> If it is fine, I will try to update the code and send a patch.
>

Attached a patch that replaces most of the getimeofday function calls,
except
timeofday(user callable) and GetCurrentTimestamp functions.

Didn't add any configure checks in case if the clock_gettime function is
not available,
the fallback logic to gettimeofday function call.

Any comments in proceeding further?

Regards,
Hari Babu
Fujitsu Australia


clock_gettime_1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event trigger and CREATE/ALTER ROLE/USER

2016-09-13 Thread Tatsuo Ishii
 Simple question: Is there any reason for event trigger to not support
 CREATE/ALTER ROLE/USER?
>>>
>>> Because it performs a global administrative task, no?
>>
>> So is there any technical difficulty?
> 
> I don't recall the exact details but...
> 
>> Can you please elaborate?
> 
> Roles are global objects, and event triggers cannot operate on such
> ones. See for example 296f3a60 mentioning for example why REVOKE/GRANT
> can only operate on local objects.

Thanks. I will look into it.

Pgpool-II has its own password file which needs to be sync with the
md5 hashed password of PostgreSQL. So I thought it would be nice if
the event trigger could be used for the purpose.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event trigger and CREATE/ALTER ROLE/USER

2016-09-13 Thread Michael Paquier
On Wed, Sep 14, 2016 at 9:34 AM, Tatsuo Ishii  wrote:
>>> Simple question: Is there any reason for event trigger to not support
>>> CREATE/ALTER ROLE/USER?
>>
>> Because it performs a global administrative task, no?
>
> So is there any technical difficulty?

I don't recall the exact details but...

> Can you please elaborate?

Roles are global objects, and event triggers cannot operate on such
ones. See for example 296f3a60 mentioning for example why REVOKE/GRANT
can only operate on local objects.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event trigger and CREATE/ALTER ROLE/USER

2016-09-13 Thread Tatsuo Ishii
>> Simple question: Is there any reason for event trigger to not support
>> CREATE/ALTER ROLE/USER?
> 
> Because it performs a global administrative task, no?

So is there any technical difficulty? Can you please elaborate?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event trigger and CREATE/ALTER ROLE/USER

2016-09-13 Thread Michael Paquier
On Wed, Sep 14, 2016 at 9:26 AM, Tatsuo Ishii  wrote:
> Simple question: Is there any reason for event trigger to not support
> CREATE/ALTER ROLE/USER?

Because it performs a global administrative task, no?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Event trigger and CREATE/ALTER ROLE/USER

2016-09-13 Thread Tatsuo Ishii
Simple question: Is there any reason for event trigger to not support
CREATE/ALTER ROLE/USER?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup wish list

2016-09-13 Thread Michael Paquier
On Wed, Sep 14, 2016 at 6:52 AM, Peter Eisentraut
 wrote:
> On 9/12/16 3:31 PM, Tom Lane wrote:
>> Hm, there was just a kerfuffle about spelling things like this
>> "--no-clean" etc.  I wasn't on board with removing existing spellings,
>> but surely all newly added switches should use the "no-" spelling?
>
> This is supposed to be parallel to initdb's option.  So if we rename or
> migrate the one in initdb, then we can just hard-rename this one.  But
> until that is decided, it seems better to keep it the same as initdb has
> right now.

PostgresNode.pm had better use the new --noclean option in its calls,
the new default is not useful for debugging. Please see attached.
-- 
Michael
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index fede1e6..0b8c8e6 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -476,7 +476,7 @@ sub backup
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-p', $port,
-		'-x');
+		'-x', '--noclean');
 	print "# Backup finished\n";
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-09-13 Thread Robert Haas
On Tue, Sep 13, 2016 at 5:06 PM, Petr Jelinek  wrote:
> I mostly agree. I think if this was called something like background
> transactions it might be better. It's definitely useful functionality but
> the naming is clearly contentious. It won't stop people using it for same
> use-cases as autonomous transactions though (which is fine).

Quite right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-13 Thread Andres Freund
Hi,

First read through the current version. Hence no real architectural
comments.

On 2016-09-09 00:59:26 +0200, Petr Jelinek wrote:

> diff --git a/src/backend/commands/publicationcmds.c 
> b/src/backend/commands/publicationcmds.c
> new file mode 100644
> index 000..e0c719d
> --- /dev/null
> +++ b/src/backend/commands/publicationcmds.c
> @@ -0,0 +1,761 @@
> +/*-
> + *
> + * publicationcmds.c
> + *   publication manipulation
> + *
> + * Copyright (c) 2015, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *   publicationcmds.c
>

Not that I'm a fan of this line in the first place, but usually it does
include the path.


> +static void
> +check_replication_permissions(void)
> +{
> + if (!superuser() && !has_rolreplication(GetUserId()))
> + ereport(ERROR,
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +  (errmsg("must be superuser or replication role 
> to manipulate publications";
> +}

Do we want to require owner privileges for replication roles? I'd say
no, but want to raise the question.


> +ObjectAddress
> +CreatePublication(CreatePublicationStmt *stmt)
> +{
> + Relationrel;
> + ObjectAddress myself;
> + Oid puboid;
> + boolnulls[Natts_pg_publication];
> + Datum   values[Natts_pg_publication];
> + HeapTuple   tup;
> + boolreplicate_insert_given;
> + boolreplicate_update_given;
> + boolreplicate_delete_given;
> + boolreplicate_insert;
> + boolreplicate_update;
> + boolreplicate_delete;
> +
> + check_replication_permissions();
> +
> + rel = heap_open(PublicationRelationId, RowExclusiveLock);
> +
> + /* Check if name is used */
> + puboid = GetSysCacheOid1(PUBLICATIONNAME, 
> CStringGetDatum(stmt->pubname));
> + if (OidIsValid(puboid))
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_DUPLICATE_OBJECT),
> +  errmsg("publication \"%s\" already exists",
> + stmt->pubname)));
> + }
> +
> + /* Form a tuple. */
> + memset(values, 0, sizeof(values));
> + memset(nulls, false, sizeof(nulls));
> +
> + values[Anum_pg_publication_pubname - 1] =
> + DirectFunctionCall1(namein, CStringGetDatum(stmt->pubname));
> +
> + parse_publication_options(stmt->options,
> +   
> _insert_given, _insert,
> +   
> _update_given, _update,
> +   
> _delete_given, _delete);
> +
> + values[Anum_pg_publication_puballtables - 1] =
> + BoolGetDatum(stmt->for_all_tables);
> + values[Anum_pg_publication_pubreplins - 1] =
> + BoolGetDatum(replicate_insert);
> + values[Anum_pg_publication_pubreplupd - 1] =
> + BoolGetDatum(replicate_update);
> + values[Anum_pg_publication_pubrepldel - 1] =
> + BoolGetDatum(replicate_delete);
> +
> + tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
> +
> + /* Insert tuple into catalog. */
> + puboid = simple_heap_insert(rel, tup);
> + CatalogUpdateIndexes(rel, tup);
> + heap_freetuple(tup);
> +
> + ObjectAddressSet(myself, PublicationRelationId, puboid);
> +
> + /* Make the changes visible. */
> + CommandCounterIncrement();
> +
> + if (stmt->tables)
> + {
> + List   *rels;
> +
> + Assert(list_length(stmt->tables) > 0);
> +
> + rels = GatherTableList(stmt->tables);
> + PublicationAddTables(puboid, rels, true, NULL);
> + CloseTables(rels);
> + }
> + else if (stmt->for_all_tables || stmt->schema)
> + {
> + List   *rels;
> +
> + rels = GatherTables(stmt->schema);
> + PublicationAddTables(puboid, rels, true, NULL);
> + CloseTables(rels);
> + }

Isn't this (and ALTER) racy? What happens if tables are concurrently
created? This session wouldn't necessarily see the tables, and other
sessions won't see for_all_tables/schema.   Evaluating
for_all_tables/all_in_schema when the publication is used, would solve
that problem.

> +/*
> + * Gather all tables optinally filtered by schema name.
> + * The gathered tables are locked in access share lock mode.
> + */
> +static List *
> +GatherTables(char *nspname)
> +{
> + Oid nspid = InvalidOid;
> + List   *rels = NIL;
> + Relationrel;
> + SysScanDesc scan;
> + ScanKeyData key[1];
> + HeapTuple   tup;
> +
> + /* Resolve and validate the schema if specified */
> + if (nspname)
> + {
> +

Re: [HACKERS] kqueue

2016-09-13 Thread Tom Lane
I wrote:
> At -j 10 -c 10, all else the same, I get 84928 TPS on HEAD and 90357
> with the patch, so about 6% better.

And at -j 1 -c 1, I get 22390 and 24040 TPS, or about 7% better with
the patch.  So what I am seeing on OS X isn't contention of any sort,
but just a straight speedup that's independent of the number of clients
(at least up to 10).  Probably this represents less setup/teardown cost
for kqueue() waits than poll() waits.

So you could spin this as "FreeBSD's poll() implementation is better than
OS X's", or as "FreeBSD's kqueue() implementation is worse than OS X's",
but either way I do not think we're seeing the same issue that was
originally reported against Linux, where there was no visible problem at
all till you got to a couple dozen clients, cf

https://www.postgresql.org/message-id/CAB-SwXbPmfpgL6N4Ro4BbGyqXEqqzx56intHHBCfvpbFUx1DNA%40mail.gmail.com

I'm inclined to think the kqueue patch is worth applying just on the
grounds that it makes things better on OS X and doesn't seem to hurt
on FreeBSD.  Whether anyone would ever get to the point of seeing
intra-kernel contention on these platforms is hard to predict, but
we'd be ahead of the curve if so.

It would be good for someone else to reproduce my results though.
For one thing, 5%-ish is not that far above the noise level; maybe
what I'm measuring here is just good luck from relocation of critical
loops into more cache-line-friendly locations.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v19)

2016-09-13 Thread Tomas Vondra

Hi,

Thanks for looking into this!

On 09/12/2016 04:08 PM, Dean Rasheed wrote:

On 3 August 2016 at 02:58, Tomas Vondra  wrote:

Attached is v19 of the "multivariate stats" patch series


Hi,

I started looking at this - just at a very high level - I've not read
much of the detail yet, but here are some initial review comments.

I think the overall infrastructure approach for CREATE STATISTICS
makes sense, and I agree with other suggestions upthread that it would
be useful to be able to build statistics on arbitrary expressions,
although that doesn't need to be part of this patch, it's useful to
keep that in mind as a possible future extension of this initial
design.

I can imagine it being useful to be able to create user-defined
statistics on an arbitrary list of expressions, and I think that would
include univariate as well as multivariate statistics. Perhaps that's
something to take into account in the naming of things, e.g., as David
Rowley suggested, something like pg_statistic_ext, rather than
pg_mv_statistic.

I also like the idea that this might one day be extended to support
statistics across multiple tables, although I think that might be
challenging to achieve -- you'd need a method of taking a random
sample of rows from a join between 2 or more tables. However, if the
intention is to be able to support that one day, I think that needs to
be accounted for in the syntax now -- specifically, I think it will be
too limiting to only support things extending the current syntax of
the form table1(col1, col2, ...), table2(col1, col2, ...), because
that precludes building statistics on an expression referring to
columns from more than one table. So I think we should plan further
ahead and use a syntax giving greater flexibility in the future, for
example something structured more like a query (like CREATE VIEW):

CREATE STATISTICS name
  [ WITH (options) ]
  ON expression [, ...]
  FROM table [, ...]
  WHERE condition

where the first version of the patch would only support expressions
that are simple column references, and would require at least 2 such
columns from a single table with no WHERE clause, i.e.:

CREATE STATISTICS name
  [ WITH (options) ]
  ON column1, column2 [, ...]
  FROM table

For multi-table statistics, a WHERE clause would typically be needed
to specify how the tables are expected to be joined, but potentially
such a clause might also be useful in single-table statistics, to
build partial statistics on a commonly queried subset of the table,
just like a partial index.


Hmm, the "partial statistics" idea seems interesting, It would allow us 
to provide additional / more detailed statistics only for a subset of a 
table.


I'm however not sure about the join case - how would the syntax work 
with outer joins? But as you said, we only need


 CREATE STATISTICS name
   [ WITH (options) ]
   ON (column1, column2 [, ...])
   FROM table
   WHERE condition

until we add support for join statistics.



Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using

  P(a,b) = P(a) * (f + (1-f) * P(b))



Yeah, I agree that the thresholds resulting in sudden changes between 
"dependent" and "not dependent" are annoying. The question is whether it 
makes sense to fix that, though - the functional dependencies were meant 
as the simplest form of statistics, allowing us to get the rest of the 
infrastructure in.


I'm OK with replacing the true/false dependencies with a degree of 
dependency between 0 and 1, but I'm a bit afraid it'll result in 
complaints that the first patch got too large / complicated.


It also contradicts the idea of using functional dependencies as a 
low-overhead type of statistics, filtering the list of clauses that need 
to be estimated using more expensive types of statistics (MCV lists, 
histograms, ...). Switching to a degree of dependency would prevent 
removal of "unnecessary" clauses.



Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not 

Re: [HACKERS] An extra error for client disconnection on Windows

2016-09-13 Thread Michael Paquier
On Tue, Sep 13, 2016 at 10:00 PM, Alvaro Herrera
 wrote:
> Yeah, I looked into this a few days ago and that was my conclusion also:
> let's drop this.

Okay. Just done so in the CF app.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup wish list

2016-09-13 Thread Peter Eisentraut
On 9/12/16 3:31 PM, Tom Lane wrote:
> Hm, there was just a kerfuffle about spelling things like this
> "--no-clean" etc.  I wasn't on board with removing existing spellings,
> but surely all newly added switches should use the "no-" spelling?

This is supposed to be parallel to initdb's option.  So if we rename or
migrate the one in initdb, then we can just hard-rename this one.  But
until that is decided, it seems better to keep it the same as initdb has
right now.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inappropriate use of NameGetDatum macro

2016-09-13 Thread Tom Lane
I wrote:
> Mark Dilger  writes:
>> there are several places in the code where variables defined as
>> (char *) or as (const char *) are passed to the NameGetDatum()
>> macro.  I believe it would be better form to use CStringGetDatum()
>> in these locations.  I am aware that these two macros are internally
>> the same.

> I'm tempted to propose that we redefine NameGetDatum as
> #define NameGetDatum(X) CStringGetDatum(NameStr(*(X)))

Pushed with that change.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-09-13 Thread Petr Jelinek

On 13/09/16 22:24, Robert Haas wrote:

On Wed, Sep 7, 2016 at 9:14 PM, Craig Ringer
 wrote:

That's probably going to be one of the smaller costs. Doing this with
bgworkers won't be cheap, but you need to consider the alternative.
Factoring out all transaction-specific data currently stored in or
pointed to by globals into a transaction state struct that can be
swapped out. Making PgXact and PGPROC capable of representing
multiple/suspended transactions. Lots more. Much of which would have a
performance impact on all day to day operations whether or not
autononomous xacts are actually in use.

I haven't looked into it in detail. Peter can probably explain more
and better. I'm just pointing out that I doubt there's any way to do
this without a cost somewhere, and having that cost limited to actual
uses of autonomous xacts would be nice.


I don't really believe this line of argument.  I mean, sure, it's nice
to limit the cost of features to the people who are using those
features.  Totally agreed.  But if the cost really wouldn't be that
high anyway, which I suspect is the case here, then that argument
loses its force.  And if that separation increases the performance
cost of the feature by two or three orders of magnitude in realistic
use cases, then you really have to wonder if you've picked the right
approach.

Again, I'm not saying that having ways to run commands in the
background in a worker isn't useful.  If I thought that wasn't useful,
I would not have written pg_background and developed it as far as I
did.  But I still don't think that's the right way to implement an
autonomous transaction facility.



I mostly agree. I think if this was called something like background 
transactions it might be better. It's definitely useful functionality 
but the naming is clearly contentious. It won't stop people using it for 
same use-cases as autonomous transactions though (which is fine).


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()

2016-09-13 Thread Peter Eisentraut
On 7/30/16 2:16 PM, Bruce Momjian wrote:
> The second patch does what Tom suggests above by outputting only "KB",
> and it supports "kB" for backward compatibility.  What it doesn't do is
> to allow arbitrary case, which I think would be a step backward.  The
> second patch actually does match the JEDEC standard, except for allowing
> "kB".

Btw., just to show that I'm not all crazy, the following programs also
display a small "k" for file sizes and download rates:

apt
curl
dnf
pip
yum
vagrant

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-13 Thread Petr Jelinek

On 13/09/16 02:55, Craig Ringer wrote:

On 13 September 2016 at 06:03, Petr Jelinek  wrote:


Oh sure, I don't see that as big problem, the TupleData already contains
type of the data it sends (to distinguish between nulls and text data) so
that's mostly about adding some different type there and we'll also need
type info in the column part of the Relation message but that should be easy
to fence with one if for different protocol version.


The missing piece seems to be negotiation.

If a binary-aware client connects to a non-binary aware server, the
non-binary-aware server needs a way to say "you requested this option
I don't understand, go away" or "you asked for binary but I don't
support that".



Not sure what you mean by negotiation. Why would that be needed? You 
know server version when you connect and when you know that you also 
know what capabilities that version of Postgres has. If you send 
unrecognized option you get corresponding error.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inheriting PostgresNode object

2016-09-13 Thread Tom Lane
Alvaro Herrera  writes:
> Yeah, as I recall the only thing the get_new_node thingy does is assign
> a nonconflicting port number to each instance, and make sure the
> instances are all teared down at END.  I don't remember now why didn't
> we just do the port check in the constructor, but we messed with that
> code a lot after the commit.  Maybe there's no good reason and we should
> change that, for convenience of inheritance.  As for the teardown, I
> remember trying to do that using DESTROY instead of an END block, but
> there was some problem I couldn't figure out (I think there was some
> ugly warning message because the data dir for the node was removed
> before the DESTROY for the object had the chance to run)... maybe you
> can figure that one out.

We changed that in 08af92190 --- changing it back would require
finding a different solution to the order-of-shutdown problem.

> Overall I think it'd be an improvement to use a regular constructor
> instead of the current arrangement.

Constructor si, destructor no.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inheriting PostgresNode object

2016-09-13 Thread Alvaro Herrera
Victor Wagner wrote:
> Hi hackers!
> 
> I've encountered need to extend functionality of PostgresNode class
> from the TAP test framework. What I want can easily be done via perl
> object inheritation. 
> 
> But documentation in the PostgresNode.pm recommends to use get_new_node
> function rather than call PostgresNode constructor directly.

Yeah, as I recall the only thing the get_new_node thingy does is assign
a nonconflicting port number to each instance, and make sure the
instances are all teared down at END.  I don't remember now why didn't
we just do the port check in the constructor, but we messed with that
code a lot after the commit.  Maybe there's no good reason and we should
change that, for convenience of inheritance.  As for the teardown, I
remember trying to do that using DESTROY instead of an END block, but
there was some problem I couldn't figure out (I think there was some
ugly warning message because the data dir for the node was removed
before the DESTROY for the object had the chance to run)... maybe you
can figure that one out.

Overall I think it'd be an improvement to use a regular constructor
instead of the current arrangement.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Tom Lane
Andres Freund  writes:
> On 2016-09-13 15:37:22 -0400, Tom Lane wrote:
>> (It's a 4-core CPU so I saw little point in pressing harder than
>> that.)

> I think in reality most busy machines, were performance and scalability
> matter, are overcommitted in the number of connections vs. cores.  And
> if you look at throughput graphs that makes sense; they tend to increase
> considerably after reaching #hardware-threads, even if all connections
> are full throttle busy.

At -j 10 -c 10, all else the same, I get 84928 TPS on HEAD and 90357
with the patch, so about 6% better.

>> So at this point I'm wondering why Thomas and Heikki could not measure
>> any win.  Based on my results it should be easy.  Is it possible that
>> OS X is better tuned for multi-CPU hardware than FreeBSD?

> Hah!

Well, there must be some reason why this patch improves matters on OS X
and not FreeBSD ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-09-13 Thread Robert Haas
On Wed, Sep 7, 2016 at 9:14 PM, Craig Ringer
 wrote:
> That's probably going to be one of the smaller costs. Doing this with
> bgworkers won't be cheap, but you need to consider the alternative.
> Factoring out all transaction-specific data currently stored in or
> pointed to by globals into a transaction state struct that can be
> swapped out. Making PgXact and PGPROC capable of representing
> multiple/suspended transactions. Lots more. Much of which would have a
> performance impact on all day to day operations whether or not
> autononomous xacts are actually in use.
>
> I haven't looked into it in detail. Peter can probably explain more
> and better. I'm just pointing out that I doubt there's any way to do
> this without a cost somewhere, and having that cost limited to actual
> uses of autonomous xacts would be nice.

I don't really believe this line of argument.  I mean, sure, it's nice
to limit the cost of features to the people who are using those
features.  Totally agreed.  But if the cost really wouldn't be that
high anyway, which I suspect is the case here, then that argument
loses its force.  And if that separation increases the performance
cost of the feature by two or three orders of magnitude in realistic
use cases, then you really have to wonder if you've picked the right
approach.

Again, I'm not saying that having ways to run commands in the
background in a worker isn't useful.  If I thought that wasn't useful,
I would not have written pg_background and developed it as far as I
did.  But I still don't think that's the right way to implement an
autonomous transaction facility.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2016-09-13 Thread Mithun Cy
On Thu, Sep 8, 2016 at 11:21 PM, Jesper Pedersen  wrote:
>
> > For the archives, this patch conflicts with the WAL patch [1].
>
> > [1] https://www.postgresql.org/message-id/CAA4eK1JS%2BSiRSQBzEFp
> nsSmxZKingrRH7WNyWULJeEJSj1-%3D0w%40mail.gmail.com
>

Updated the patch it applies over Amit's concurrent hash index[1] and
Amit's wal for hash index patch[2] together.

[1] Concurrent Hash index.

[2] Wal for hash index.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_hash_index_metapage_onAmit_05_02_with_wall.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Andres Freund
On 2016-09-13 15:37:22 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2016-09-13 14:47:08 -0400, Tom Lane wrote:
> >> Also I notice that the WaitEventSet thread started with a simple
> >> pgbench test, so I don't really buy the claim that that's not a
> >> way that will reach the problem.
> 
> > You can reach it, but not when using 1 core:one pgbench thread:one
> > client connection, there need to be more connections than that. At least
> > that was my observation on x86 / linux.
> 
> Well, that original test was 
> 
> >> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
> >> machine:
> 
> so no, not 1 client ;-)

What I meant wasn't one client, but less than one client per cpu, and
using a pgbench thread per backend. That way usually, at least on linux,
there'll be a relatively small amount of poll/epoll/whatever, because
the recvmsg()s will always have data available.


> Anyway, I decided to put my money where my mouth was and run my own
> benchmark.

Cool.


> (It's a 4-core CPU so I saw little point in pressing harder than
> that.)

I think in reality most busy machines, were performance and scalability
matter, are overcommitted in the number of connections vs. cores.  And
if you look at throughput graphs that makes sense; they tend to increase
considerably after reaching #hardware-threads, even if all connections
are full throttle busy.  It might not make sense if you just run large
analytics queries, or if you want the lowest latency possible, but in
everything else, the reality is that machines are often overcommitted
for good reason.


> So at this point I'm wondering why Thomas and Heikki could not measure
> any win.  Based on my results it should be easy.  Is it possible that
> OS X is better tuned for multi-CPU hardware than FreeBSD?

Hah!


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Tom Lane
Andres Freund  writes:
> On 2016-09-13 14:47:08 -0400, Tom Lane wrote:
>> Also I notice that the WaitEventSet thread started with a simple
>> pgbench test, so I don't really buy the claim that that's not a
>> way that will reach the problem.

> You can reach it, but not when using 1 core:one pgbench thread:one
> client connection, there need to be more connections than that. At least
> that was my observation on x86 / linux.

Well, that original test was 

>> I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
>> machine:

so no, not 1 client ;-)

Anyway, I decided to put my money where my mouth was and run my own
benchmark.  On my couple-year-old Macbook Pro running OS X 10.11.6,
using a straight build of today's HEAD, asserts disabled, fsync off
but no other parameters changed, I did "pgbench -i -s 100" and then
did this a few times:
pgbench -T 60 -j 4 -c 4 -M prepared -S bench
(It's a 4-core CPU so I saw little point in pressing harder than
that.)  Median of 3 runs was 56028 TPS.  Repeating the runs with
kqueue-v5.patch applied, I got a median of 58975 TPS, or 5% better.
Run-to-run variation was only around 1% in each case.

So that's not a huge improvement, but it's clearly above the noise
floor, and this laptop is not what anyone would use for production
work eh?  Presumably you could show even better results on something
closer to server-grade hardware with more active clients.

So at this point I'm wondering why Thomas and Heikki could not measure
any win.  Based on my results it should be easy.  Is it possible that
OS X is better tuned for multi-CPU hardware than FreeBSD?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Claudio Freire
On Tue, Sep 13, 2016 at 4:06 PM, Robert Haas  wrote:
> On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire  
> wrote:
>> I've finished writing that patch, I'm in the process of testing its CPU 
>> impact.
>>
>> First test seemed to hint at a 40% increase in CPU usage, which seems
>> rather steep compared to what I expected, so I'm trying to rule out
>> some methodology error here.
>
> Hmm, wow.  That's pretty steep.  Maybe lazy_heap_reaptid() is hotter
> than I think it is, but even if it accounts for 10% of total CPU usage
> within a vacuum, which seems like an awful lot, you'd have to make it
> 4x as expensive, which also seems like an awful lot.

IIRC perf top reported a combined 45% between layz_heap_reaptid +
vac_cmp_itemptr (after patching).

vac_cmp_itemptr was around 15% on its own

Debug build of couse (I need the assertions and the debug symbols),
I'll retest with optimizations once debug tests make sense.

>>> For example, we could keep a bitmap with one bit per K
>>> pages.  If the bit is set, there is at least 1 dead tuple on that
>>> page; if clear, there are none.  When we see an index tuple, we
>>> consult the bitmap to determine whether we need to search the TID
>>> list.  We select K to be the smallest power of 2 such that the bitmap
>>> uses less memory than some threshold, perhaps 64kB.
>>
>> I've been pondering something like that, but that's an optimization
>> that's quite orthogonal to the multiarray stuff.
>
> Sure, but if this really does increase CPU time, it'd be reasonable to
> do something to decrease it again in order to get the other benefits
> of this patch - i.e. increasing the maintenance_work_mem limit while
> reducing the chances that overallocation will cause OOM.

I was hoping it wouldn't regress performance so much. I'd rather
micro-optimize the multiarray implementation until it doesn't and then
think of orthogonal optimizations.

>>>  Assuming that
>>> updates and deletes to the table have some locality, we should be able
>>> to skip a large percentage of the TID searches with a probe into this
>>> very compact bitmap.
>>
>> I don't think you can assume locality
>
> Really?  If you have a 1TB table, how many 2MB ranges of that table do
> you think will contain dead tuples for a typical vacuum?  I think most
> tables of that size are going to be mostly static, and the all-visible
> and all-frozen bits are going to be mostly set.  You *could* have
> something like a pgbench-type workload that does scattered updates
> across the entire table, but that's going to perform pretty poorly
> because you'll constantly be updating blocks that have to be pulled in
> from disk.

I have a few dozen of those in my biggest database. They do updates
and deletes all over the place and, even if they were few, they're
scattered almost uniformly.

Thing is, I think we really need to not worsen that case, which seems
rather common (almost any OLTP with a big enough user base, or a K-V
type of table, or TOAST tables).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Robert Haas
On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire  wrote:
> I've finished writing that patch, I'm in the process of testing its CPU 
> impact.
>
> First test seemed to hint at a 40% increase in CPU usage, which seems
> rather steep compared to what I expected, so I'm trying to rule out
> some methodology error here.

Hmm, wow.  That's pretty steep.  Maybe lazy_heap_reaptid() is hotter
than I think it is, but even if it accounts for 10% of total CPU usage
within a vacuum, which seems like an awful lot, you'd have to make it
4x as expensive, which also seems like an awful lot.

>> It would require
>> replacing the call to bsearch() in lazy_heap_reaptid() with an
>> open-coded implementation of bsearch, or with one bsearch to find the
>> chunk and another to find the TID within the chunk, but that shouldn't
>> be very expensive.
>
> I did a linear search to find the chunk, with exponentially growing
> chunks, and then a bsearch to find the item inside the chunk.
>
> With the typical number of segments and given the 12GB limit, the
> segment array size is well within the range that favors linear search.

Ah, OK.

>> For example, we could keep a bitmap with one bit per K
>> pages.  If the bit is set, there is at least 1 dead tuple on that
>> page; if clear, there are none.  When we see an index tuple, we
>> consult the bitmap to determine whether we need to search the TID
>> list.  We select K to be the smallest power of 2 such that the bitmap
>> uses less memory than some threshold, perhaps 64kB.
>
> I've been pondering something like that, but that's an optimization
> that's quite orthogonal to the multiarray stuff.

Sure, but if this really does increase CPU time, it'd be reasonable to
do something to decrease it again in order to get the other benefits
of this patch - i.e. increasing the maintenance_work_mem limit while
reducing the chances that overallocation will cause OOM.

>>  Assuming that
>> updates and deletes to the table have some locality, we should be able
>> to skip a large percentage of the TID searches with a probe into this
>> very compact bitmap.
>
> I don't think you can assume locality

Really?  If you have a 1TB table, how many 2MB ranges of that table do
you think will contain dead tuples for a typical vacuum?  I think most
tables of that size are going to be mostly static, and the all-visible
and all-frozen bits are going to be mostly set.  You *could* have
something like a pgbench-type workload that does scattered updates
across the entire table, but that's going to perform pretty poorly
because you'll constantly be updating blocks that have to be pulled in
from disk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-09-13 Thread Robert Haas
On Fri, Sep 9, 2016 at 4:49 AM, Victor Wagner  wrote:
> Random permutation is much more computationally complex than random
> picking.

It really isn't.  The pseudocode given at
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle is all of 4
lines long, and one of those lines is a comment.  The C implementation
is longer, but not by much.

Mind you, I haven't read the patch, so I don't know whether using
something like this would make the code simpler or more complicated.
However, if the two modes of operation are (1) try all hosts in random
order and (2) try all hosts in the listed order, it's worth noting
that the code for the second thing can be used to implement the first
thing just by shuffling the list before you begin.

> So, using random permutation instead  of random pick wouln't help.
> And keeping somewhere number of elements in the list wouldn't help
> either unless we change linked list to completely different data
> structure which allows random access.

Is there a good reason not to use an array rather than a linked list?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Claudio Freire
On Tue, Sep 13, 2016 at 3:51 PM, Robert Haas  wrote:
> On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada  wrote:
>> Attached PoC patch changes the representation of dead tuple locations
>> to the hashmap having tuple bitmap.
>> The one hashmap entry consists of the block number and the TID bitmap
>> of corresponding block, and the block number is the hash key of
>> hashmap.
>> Current implementation of this patch is not smart yet because each
>> hashmap entry allocates the tuple bitmap with fixed
>> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
>> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
>> In case where one block can store only the several tens tuples, the
>> most bits are would be waste.
>>
>> After improved this patch as you suggested, I will measure performance 
>> benefit.
>
> We also need to consider the amount of memory gets used.  What I
> proposed - replacing the array with an array of arrays - would not
> increase memory utilization significantly.  I don't think it would
> have much impact on CPU utilization either.

I've finished writing that patch, I'm in the process of testing its CPU impact.

First test seemed to hint at a 40% increase in CPU usage, which seems
rather steep compared to what I expected, so I'm trying to rule out
some methodology error here.

> It would require
> replacing the call to bsearch() in lazy_heap_reaptid() with an
> open-coded implementation of bsearch, or with one bsearch to find the
> chunk and another to find the TID within the chunk, but that shouldn't
> be very expensive.

I did a linear search to find the chunk, with exponentially growing
chunks, and then a bsearch to find the item inside the chunk.

With the typical number of segments and given the 12GB limit, the
segment array size is well within the range that favors linear search.

> For example, we could keep a bitmap with one bit per K
> pages.  If the bit is set, there is at least 1 dead tuple on that
> page; if clear, there are none.  When we see an index tuple, we
> consult the bitmap to determine whether we need to search the TID
> list.  We select K to be the smallest power of 2 such that the bitmap
> uses less memory than some threshold, perhaps 64kB.

I've been pondering something like that, but that's an optimization
that's quite orthogonal to the multiarray stuff.

>  Assuming that
> updates and deletes to the table have some locality, we should be able
> to skip a large percentage of the TID searches with a probe into this
> very compact bitmap.

I don't think you can assume locality


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-13 Thread Jesper Pedersen

On 09/13/2016 07:26 AM, Amit Kapila wrote:

Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].



I have been testing patch in various scenarios, and it has a positive 
performance impact in some cases.


This is especially seen in cases where the values of the indexed column 
are unique - SELECTs can see a 40-60% benefit over a similar query using 
b-tree. UPDATE also sees an improvement.


In cases where the indexed column value isn't unique, it takes a long 
time to build the index due to the overflow page creation.


Also in cases where the index column is updated with a high number of 
clients, ala


-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;

-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;

w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps 
for hash, and 10062 tps for b-tree).


Jeff mentioned upthread the idea of moving the lock to a bucket meta 
page instead of having it on the main meta page. Likely a question for 
the assigned committer.


Thanks for working on this !

Best regards,
 Jesper




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Peter Geoghegan
On Tue, Sep 13, 2016 at 11:51 AM, Robert Haas  wrote:
> I think it's probably wrong to worry that an array-of-arrays is going
> to be meaningfully slower than a single array here.  It's basically
> costing you some small number of additional memory references per
> tuple, which I suspect isn't all that relevant for a bulk operation
> that does I/O, writes WAL, locks buffers, etc.

This analysis makes perfect sense to me.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Tom Lane
Andres Freund  writes:
> On 2016-09-13 12:43:36 -0400, Tom Lane wrote:
>> Also, if it's only a win on machines with dozens of CPUs, how many
>> people are running *BSD on that kind of iron?  I think Linux is by
>> far the dominant kernel for such hardware.  For sure Apple isn't
>> selling any machines like that.

> I'm not sure you need quite that big a machine, if you test a workload
> that currently reaches the poll().

Well, Thomas stated in
https://www.postgresql.org/message-id/CAEepm%3D1CwuAq35FtVBTZO-mnGFH1xEFtDpKQOf_b6WoEmdZZHA%40mail.gmail.com
that he hadn't been able to measure any performance difference, and
I assume he was trying test cases from the WaitEventSet thread.

Also I notice that the WaitEventSet thread started with a simple
pgbench test, so I don't really buy the claim that that's not a
way that will reach the problem.

I'd be happy to see this go in if it can be shown to provide a measurable
performance improvement, but so far we have only guesses that someday
it *might* make a difference.  That's not good enough to add to our
maintenance burden IMO.

Anyway, the patch is in the archives now, so it won't be hard to resurrect
if the situation changes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Andres Freund
On 2016-09-13 14:47:08 -0400, Tom Lane wrote:
> Also I notice that the WaitEventSet thread started with a simple
> pgbench test, so I don't really buy the claim that that's not a
> way that will reach the problem.

You can reach it, but not when using 1 core:one pgbench thread:one
client connection, there need to be more connections than that. At least
that was my observation on x86 / linux.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Robert Haas
On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada  wrote:
> Attached PoC patch changes the representation of dead tuple locations
> to the hashmap having tuple bitmap.
> The one hashmap entry consists of the block number and the TID bitmap
> of corresponding block, and the block number is the hash key of
> hashmap.
> Current implementation of this patch is not smart yet because each
> hashmap entry allocates the tuple bitmap with fixed
> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> In case where one block can store only the several tens tuples, the
> most bits are would be waste.
>
> After improved this patch as you suggested, I will measure performance 
> benefit.

We also need to consider the amount of memory gets used.  What I
proposed - replacing the array with an array of arrays - would not
increase memory utilization significantly.  I don't think it would
have much impact on CPU utilization either.  It would require
replacing the call to bsearch() in lazy_heap_reaptid() with an
open-coded implementation of bsearch, or with one bsearch to find the
chunk and another to find the TID within the chunk, but that shouldn't
be very expensive.  For one thing, if the array chunks are around the
size I proposed (64MB), you've got more than ten million tuples per
chunk, so you can't have very many chunks unless your table is both
really large and possessed of quite a bit of dead stuff.

Now, if I'm reading it correctly, this patch allocates a 132-byte
structure for every page with at least one dead tuple.  In the worst
case where there's just one dead tuple per page, that's a 20x
regression in memory usage.  Actually, it's more like 40x, because
AllocSetAlloc rounds small allocation sizes up to the next-higher
power of two, which really stings for a 132-byte allocation, and then
adds a 16-byte header to each chunk.  But even 20x is clearly not
good.  There are going to be lots of real-world cases where this uses
way more memory to track the same number of dead tuples, and I'm
guessing that benchmarking is going to reveal that it's not faster,
either.

I think it's probably wrong to worry that an array-of-arrays is going
to be meaningfully slower than a single array here.  It's basically
costing you some small number of additional memory references per
tuple, which I suspect isn't all that relevant for a bulk operation
that does I/O, writes WAL, locks buffers, etc.  But if it is relevant,
then I think there are other ways to buy that performance back which
are likely to be more memory efficient than converting this to use a
hash table.  For example, we could keep a bitmap with one bit per K
pages.  If the bit is set, there is at least 1 dead tuple on that
page; if clear, there are none.  When we see an index tuple, we
consult the bitmap to determine whether we need to search the TID
list.  We select K to be the smallest power of 2 such that the bitmap
uses less memory than some threshold, perhaps 64kB.  Assuming that
updates and deletes to the table have some locality, we should be able
to skip a large percentage of the TID searches with a probe into this
very compact bitmap.  Note that we can set K = 1 for tables up to 4GB
in size, and even a 1TB table only needs K = 256.  Odds are very good
that a 1TB table being vacuumed has many 256-page ranges containing no
dead tuples at all ... and if this proves to be false and the dead
tuples are scattered uniformly throughout the table, then you should
probably be more worried about the fact that you're dumping a bare
minimum of 4GB of random I/O on your hapless disk controller than
about how efficient the TID search is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-13 Thread Jesper Pedersen

On 09/13/2016 07:41 AM, Amit Kapila wrote:

README:
+in_complete split flag.  The reader algorithm works correctly, as it will
scan

What flag ?



in-complete-split flag which indicates that split has to be finished
for that particular bucket.  The value of these flags are
LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
bucket respectively.  It is set in hasho_flag in special area of page.
I have slightly expanded the definition in README, but not sure if it
is good idea to mention flags defined in hash.h. Let me know, if still
it is unclear or you want some thing additional to be added in README.



I think it is better now.


hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+   if (BufferIsValid(bukcetbuf));

->

+   if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.



Fixed all the above cosmetic issues.



I meant there is an extra ';' on the "if" statements:

+   if (BufferIsValid(bukcetbuf)); <--

in hash_xlog_move_page_contents, hash_xlog_squeeze_page and 
hash_xlog_delete.


Best regards,
 Jesper





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Andres Freund
On 2016-09-13 12:43:36 -0400, Tom Lane wrote:
> > I think it's not necessarily about the current system, but more about
> > future uses of the WaitEventSet stuff. Some of that is going to use a
> > lot more sockets. E.g. doing a parallel append over FDWs.

(note that I'm talking about network sockets not cpu sockets here)


> All fine, but the burden of proof has to be on the patch to show that
> it does something significant.  We don't want to be carrying around
> platform-specific code, which necessarily has higher maintenance cost
> than other code, without a darn good reason.

No argument there.


> Also, if it's only a win on machines with dozens of CPUs, how many
> people are running *BSD on that kind of iron?  I think Linux is by
> far the dominant kernel for such hardware.  For sure Apple isn't
> selling any machines like that.

I'm not sure you need quite that big a machine, if you test a workload
that currently reaches the poll().

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Inheriting PostgresNode object

2016-09-13 Thread Victor Wagner
Hi hackers!

I've encountered need to extend functionality of PostgresNode class
from the TAP test framework. What I want can easily be done via perl
object inheritation. 

But documentation in the PostgresNode.pm recommends to use get_new_node
function rather than call PostgresNode constructor directly.

I see following ways to solve this problem:

1. Ignore this recommendation and write new class which calls
SUPER->new() from the constructor and call its constructor directly

2. Get the PostgresNode object from get_new_node function and bless it
into new class

3. Use delegation instead of inheritance. This would require some black
magic with AutoLoader module to avoid writing wrapper for each
PostgresNode method.

Which approach would people recommend to take?

-- 
   Victor Wagner 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)

2016-09-13 Thread Tom Lane
Andres Freund  writes:
> On 2016-09-13 12:07:35 -0400, Tom Lane wrote:
>> /*
>> - * Don't pull up a subquery that has any set-returning functions in its
>> - * targetlist.  Otherwise we might well wind up inserting set-returning
>> - * functions into places where they mustn't go, such as quals of higher
>> - * queries.  This also ensures deletion of an empty jointree is valid.
>> - */
>> -if (expression_returns_set((Node *) subquery->targetList))
>> -return false;

> I don't quite understand parts of the comment you removed here. What
> does "This also ensures deletion of an empty jointree is valid." mean?

TBH, I don't remember what that was about anymore.  Whatever it was might
not apply now, anyway.  If there was something to it, maybe we'll
rediscover it while we're fooling with tSRFs, and then we can insert a
less cryptic comment.

> Looks good, except that you didn't adopt the hunk adjusting
> src/backend/executor/README, which still seems to read:

Ah, I missed that there was anything to change docs-wise.  Will fix.

Thanks for looking it over!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-13 Thread Andrew Borodin
Hi hackers!

Here is gist cube penaly patch v5.
Changes:
1. union version of pack_float() is used, without warings and with
minimum asm commands emitted
2. test for __STDC_IEC_559__ is added, this is a standard way of C99
to determine float compliance with IEEE754. ANSI-only compiler+libs,
along with non-IEEE-float-systems, will fallback to pre-patch
behavior.

Best regards, Andrey Borodin.
diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 3feddef..c86418d 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -91,14 +91,18 @@ PG_FUNCTION_INFO_V1(cube_enlarge);
 /*
 ** For internal use only
 */
-int32  cube_cmp_v0(NDBOX *a, NDBOX *b);
-bool   cube_contains_v0(NDBOX *a, NDBOX *b);
-bool   cube_overlap_v0(NDBOX *a, NDBOX *b);
-NDBOX *cube_union_v0(NDBOX *a, NDBOX *b);
-void   rt_cube_size(NDBOX *a, double *sz);
-NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
-bool   g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber 
strategy);
-bool   g_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static int32   cube_cmp_v0(NDBOX *a, NDBOX *b);
+static boolcube_contains_v0(NDBOX *a, NDBOX *b);
+static boolcube_overlap_v0(NDBOX *a, NDBOX *b);
+static NDBOX  *cube_union_v0(NDBOX *a, NDBOX *b);
+#ifdef __STDC_IEC_559__
+static float   pack_float(float value, int realm);
+#endif
+static voidrt_cube_size(NDBOX *a, double *sz);
+static voidrt_cube_edge(NDBOX *a, double *sz);
+static NDBOX  *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
+static boolg_cube_leaf_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static boolg_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
 
 /*
 ** Auxiliary funxtions
@@ -419,7 +423,36 @@ g_cube_decompress(PG_FUNCTION_ARGS)
}
PG_RETURN_POINTER(entry);
 }
+/*
+ * Function to pack floats of different realms
+ * This function serves to pack bit flags inside float type
+ * Resulted value represent can be from four different "realms"
+ * Every value from realm 3 is greater than any value from realms 2,1 and 0.
+ * Every value from realm 2 is less than every value from realm 3 and greater
+ * than any value from realm 1 and 0, and so on. Values from the same realm
+ * loose two bits of precision. This technique is possible due to floating
+ * point numbers specification according to IEEE 754: exponent bits are highest
+ * (excluding sign bits, but here penalty is always positive). If float a is
+ * greater than float b, integer A with same bit representation as a is greater
+ * than integer B with same bits as b.
+ */
+#ifdef __STDC_IEC_559__
+static float
+pack_float(const float value, const int realm)
+{
+  union {
+float f;
+struct { unsigned value:31, sign:1; } vbits;
+struct { unsigned value:29, realm:2, sign:1; } rbits;
+  } a;
 
+  a.f = value;
+  a.rbits.value = a.vbits.value >> 2;
+  a.rbits.realm = realm;
+
+  return a.f;
+}
+#endif
 
 /*
 ** The GiST Penalty method for boxes
@@ -428,6 +461,11 @@ g_cube_decompress(PG_FUNCTION_ARGS)
 Datum
 g_cube_penalty(PG_FUNCTION_ARGS)
 {
+   /* REALM 0: No extension is required, volume is zero, return edge   
*/
+   /* REALM 1: No extension is required, return nonzero volume 
*/
+   /* REALM 2: Volume extension is zero, return nonzero edge extension 
*/
+   /* REALM 3: Volume extension is nonzero, return it  
*/
+
GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
float  *result = (float *) PG_GETARG_POINTER(2);
@@ -441,9 +479,36 @@ g_cube_penalty(PG_FUNCTION_ARGS)
rt_cube_size(DatumGetNDBOX(origentry->key), );
*result = (float) (tmp1 - tmp2);
 
-   /*
-* fprintf(stderr, "penalty\n"); fprintf(stderr, "\t%g\n", *result);
-*/
+#ifdef __STDC_IEC_559__
+   /* Realm tricks are in use only in case of IEEE754 support(IEC 60559)*/
+   if( *result == 0 )
+   {
+   double tmp3 = tmp1; /* remember entry volume */
+   rt_cube_edge(ud, );
+   rt_cube_edge(DatumGetNDBOX(origentry->key), );
+   *result = (float) (tmp1 - tmp2);
+   if( *result == 0 )
+   {
+   if( tmp3 != 0 )
+   {
+   *result = pack_float(tmp3, 1); /* REALM 1 */
+   }
+   else
+   {
+   *result = pack_float(tmp1, 0); /* REALM 0 */
+   }
+   }
+   else
+   {
+   *result = pack_float(*result, 2); /* REALM 2 */
+   }
+   }
+   else
+   {
+   

Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)

2016-09-13 Thread Andres Freund
On 2016-09-13 12:07:35 -0400, Tom Lane wrote:
> diff --git a/src/backend/optimizer/plan/analindex e28a8dc..74e4245 100644
> --- a/src/backend/optimizer/plan/analyzejoins.c
> +++ b/src/backend/optimizer/plan/analyzejoins.c
> @@ -650,6 +650,11 @@ rel_is_distinct_for(PlannerInfo *root, R
>  bool
>  query_supports_distinctness(Query *query)
>  {
> + /* we don't cope with SRFs, see comment below */
> + if (query->hasTargetSRFs)
> + return false;
> +
> + /* check for features we can prove distinctness with */
>   if (query->distinctClause != NIL ||
>   query->groupClause != NIL ||
>   query->groupingSets != NIL ||
> @@ -695,7 +700,7 @@ query_is_distinct_for(Query *query, List
>* specified columns, since those must be evaluated before 
> de-duplication;
>* but it doesn't presently seem worth the complication to check that.)
>*/
> - if (expression_returns_set((Node *) query->targetList))
> + if (query->hasTargetSRFs)
>   return false;

Maybe make this hasTargetSRFs && expression_returns_set()? The SRF could
have been optimized away. (Oh, I see you recheck below. Forget that then).

> @@ -1419,8 +1419,8 @@ is_simple_subquery(Query *subquery, Rang
>   return false;
>  
>   /*
> -  * Can't pull up a subquery involving grouping, aggregation, sorting,
> -  * limiting, or WITH.  (XXX WITH could possibly be allowed later)
> +  * Can't pull up a subquery involving grouping, aggregation, SRFs,
> +  * sorting, limiting, or WITH.  (XXX WITH could possibly be allowed 
> later)
>*
>* We also don't pull up a subquery that has explicit FOR UPDATE/SHARE
>* clauses, because pullup would cause the locking to occur semantically
> @@ -1430,6 +1430,7 @@ is_simple_subquery(Query *subquery, Rang
>*/
>   if (subquery->hasAggs ||
>   subquery->hasWindowFuncs ||
> + subquery->hasTargetSRFs ||
>   subquery->groupClause ||
>   subquery->groupingSets ||
>   subquery->havingQual ||
> @@ -1543,15 +1544,6 @@ is_simple_subquery(Query *subquery, Rang
>   }
>  
>   /*
> -  * Don't pull up a subquery that has any set-returning functions in its
> -  * targetlist.  Otherwise we might well wind up inserting set-returning
> -  * functions into places where they mustn't go, such as quals of higher
> -  * queries.  This also ensures deletion of an empty jointree is valid.
> -  */
> - if (expression_returns_set((Node *) subquery->targetList))
> - return false;

I don't quite understand parts of the comment you removed here. What
does "This also ensures deletion of an empty jointree is valid." mean?


Looks good, except that you didn't adopt the hunk adjusting
src/backend/executor/README, which still seems to read:
> We disallow set-returning functions in the targetlist of SELECT FOR UPDATE,
> so as to ensure that at most one tuple can be returned for any particular
> set of scan tuples.  Otherwise we'd get duplicates due to the original
> query returning the same set of scan tuples multiple times.  (Note: there
> is no explicit prohibition on SRFs in UPDATE, but the net effect will be
> that only the first result row of an SRF counts, because all subsequent
> rows will result in attempts to re-update an already updated target row.
> This is historical behavior and seems not worth changing.)

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)

2016-09-13 Thread Tom Lane
Andres Freund  writes:
> On September 13, 2016 9:07:35 AM PDT, Tom Lane  wrote:
>> I'd like to go ahead and push this, since it's a necessary prerequisite
>> for either approach we might adopt for the rest of the patch series,
>> and the improved error reporting and avoidance of expensive
>> expression_returns_set searches make it a win IMO even if we were not
>> planning to do anything more with SRFs.

> Can't look are the code just now, on my way to the airport for pgopen, but 
> the idea sounds good to me.

OK, I went ahead and pushed it.  We can tweak later if needed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Tom Lane
Robert Haas  writes:
> I agree that we should have some way for foreign data wrappers and
> custom scans and perhaps also other executor nodes to find out whether
> there's a known limit to the number of tuples that they might need to
> produce, but I wonder if we should be doing something more general
> than this.

I had the same feeling that this could stand to be considered more
generally, but had not had time to consider it in detail.  The bound
passdown from Limit to Sort was never anything but a quick-and-dirty
private hack, and I'm not very comfortable with just exposing it to the
whole world without reconsidering the design.

> On a more practical level, I notice that you haven't adapted
> postgres_fdw or file_fdw to benefit from this new callback.  It seems
> like postgres_fdw could benefit, because it could fetch only the
> required number of tuples if that happens to be a smaller number than
> the configured fetch_size.

I think we should insist on that, either in the base patch or a followup
patch, because you never know if a hook is actually convenient to use
until you try.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DISABLE_PAGE_SKIPPING option for vacuumdb

2016-09-13 Thread Robert Haas
On Thu, Sep 8, 2016 at 1:32 PM, Masahiko Sawada  wrote:
> Attached patch add --disable-page-skipping option to vacuumed command for 9.6.
> If by any chance the freeze map causes the serious data corruption bug
> then the user will want to run pg_check_visible() and vacuum with
> DISABLE_PAGE_SKIPPING option to the multiple tables having problem.
> In this case, I think that this option for vacuumdb would be convenient.
>
> Please give me feedback.

I think this isn't really necessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Push down more full joins in postgres_fdw

2016-09-13 Thread Robert Haas
On Tue, Sep 6, 2016 at 9:07 AM, Ashutosh Bapat
 wrote:
> That's not true with the alias information. As long as we detect which
> relations need subqueries, their RTIs are enough to create unique aliases
> e.g. if a relation involves RTIs 1, 2, 3 corresponding subquery can have
> alias r123 without conflicting with any other alias.

What if RTI 123 is also used in the query?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Robert Haas
On Tue, Sep 13, 2016 at 3:48 AM, Kouhei Kaigai  wrote:
> Sorry for my late.
>
> The attached patch fixed the wording problems on SGML part.

I agree that we should have some way for foreign data wrappers and
custom scans and perhaps also other executor nodes to find out whether
there's a known limit to the number of tuples that they might need to
produce, but I wonder if we should be doing something more general
than this.  For example, suppose we add a new PlanState member "long
numTuples" where 0 means that the number of tuples that will be needed
is unknown (so that most node types need not initialize it), a
positive value is an upper bound on the number of tuples that will be
fetched, and -1 means that it is known for certain that we will need
all of the tuples.  This might be relevant to the executor batching
stuff that Andres has been working on, because you could for example
set ps->numTuples == 1 on the inner side of a semi-join, warning the
executor node that it shouldn't bother trying to batch anything.

On a more practical level, I notice that you haven't adapted
postgres_fdw or file_fdw to benefit from this new callback.  It seems
like postgres_fdw could benefit, because it could fetch only the
required number of tuples if that happens to be a smaller number than
the configured fetch_size.

Andres, anyone, thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Tom Lane
Andres Freund  writes:
> On 2016-09-13 16:08:39 +0300, Heikki Linnakangas wrote:
>> So, if I've understood correctly, the purpose of this patch is to improve
>> performance on a multi-CPU system, which has the kqueue() function. Most
>> notably, FreeBSD?

> I think it's not necessarily about the current system, but more about
> future uses of the WaitEventSet stuff. Some of that is going to use a
> lot more sockets. E.g. doing a parallel append over FDWs.

All fine, but the burden of proof has to be on the patch to show that
it does something significant.  We don't want to be carrying around
platform-specific code, which necessarily has higher maintenance cost
than other code, without a darn good reason.

Also, if it's only a win on machines with dozens of CPUs, how many
people are running *BSD on that kind of iron?  I think Linux is by
far the dominant kernel for such hardware.  For sure Apple isn't
selling any machines like that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-13 Thread Jeff Janes
On Wed, Sep 7, 2016 at 9:32 PM, Amit Kapila  wrote:

> On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes  wrote:
> > On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila 
> wrote:
> >>
> >>
> >> I have fixed all other issues you have raised.  Updated patch is
> >> attached with this mail.
> >
> >
> > I am finding the comments (particularly README) quite hard to follow.
> There
> > are many references to an "overflow bucket", or similar phrases.  I think
> > these should be "overflow pages".  A bucket is a conceptual thing
> consisting
> > of a primary page for that bucket and zero or more overflow pages for the
> > same bucket.  There are no overflow buckets, unless you are referring to
> the
> > new bucket to which things are being moved.
> >
>
> Hmm.  I think page or block is a concept of database systems and
> buckets is a general concept used in hashing technology.  I think the
> difference is that there are primary buckets and overflow buckets. I
> have checked how they are referred in one of the wiki pages [1],
> search for overflow on that wiki page.


That page seems to use "slot" to refer to the primary bucket/page and all
the overflow buckets/pages which cover the same post-masked values.  I
don't think that would be an improvement for us, because "slot" is already
pretty well-used for other things.   Their use of "bucket" does seem to be
mostly the same as "page" (or maybe "buffer" or "block"?) but I don't think
we gain anything from creating yet another synonym for page/buffer/block.
I think the easiest thing would be to keep using the meanings which the
existed committed code uses, so that we at least have internal consistency.



> Now, I think we shouldn't be
> inconsistent in using them. I will change to make it same if I find
> any inconsistency based on what you or other people think is the
> better way to refer overflow space.
>

I think just "overflow page" or "buffer containing the overflow page".


Here are some more notes I've taken, mostly about the README and comments.

It took me a while to understand that once a tuple is marked as moved by
split, it stays that way forever.  It doesn't mean "recently moved by
split", but "ever moved by split".  Which works, but is rather subtle.
Perhaps this deserves a parenthetical comment in the README the first time
the flag is mentioned.



#define INDEX_SIZE_MASK 0x1FFF
/* bit 0x2000 is not used at present */

This is no longer true, maybe:
/* bit 0x2000 is reserved for index-AM specific usage */



   Note that this is designed to allow concurrent splits and scans.  If a
   split occurs, tuples relocated into the new bucket will be visited twice
   by the scan, but that does no harm.  As we are releasing the locks during
   scan of a bucket, it will allow concurrent scan to start on a bucket and
   ensures that scan will always be behind cleanup.

Above, the abrupt transition from splits (first sentence) to cleanup is
confusing.  If the cleanup referred to is vacuuming, it should be a new
paragraph or at least have a transition sentence.  Or is it referring to
clean-up locks used for control purposes, rather than for actual vacuum
clean-up?  I think it is the first one, the vacuum.  (I find the committed
version of this comment confusing as well--how in the committed code would
a tuple be visited twice, and why does that not do harm in the committed
coding? So maybe the issue here is me, not the comment.)


===

+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
tuples
+that are moved due to split.  The need for cleanup lock to remove dead
tuples
+is to ensure that scans' returns correct results.  Scan that returns
multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.

Perhaps it would be better to teach scans to restart anywhere on the page,
than to force more cleanup locks to be taken?

===
This comment no longer seems accurate (as long as it is just an ERROR and
not a PANIC):

 * XXX we have a problem here if we fail to get space for a
 * new overflow page: we'll error out leaving the bucket
split
 * only partially complete, meaning the index is corrupt,
 * since searches may fail to find entries they should find.

The split will still be marked as being in progress, so any scanner will
have to scan the old page and see the tuple there.


in _hash_splitbucket comments, this needs updating:

 * The caller must hold exclusive locks on both buckets to ensure that
 * no one else is trying to access them (see README).

The true prereq here is a buffer clean up lock (pin plus exclusive buffer
content lock), correct?

And then:

 * Split needs to hold pin on primary bucket pages of both old and new
 * buckets till end of operation.

'retain' is probably better than 'hold', to emphasize 

Re: [HACKERS] cost_sort() may need to be updated

2016-09-13 Thread Peter Geoghegan
On Tue, Sep 13, 2016 at 5:59 AM, Robert Haas  wrote:
> I think that the only real way to judge whether cost_sort() is any
> good is to see whether it causes the wrong plan to be chosen in some
> cases.  For example, it could cause merge joins to be picked too
> frequently or too infrequently, or it could cause a wrong decision
> between hash and group aggregates.  Now, there is bound to be an
> argument about whether any test cases that we might pick are
> representative and the results will further differ between hardware
> platforms, but I think changing the formula based on an abstract idea
> about what's happening is probably not a good idea.

I don't disagree with any of this. And, I don't pretend to have the
best understanding of how real world observations on the quality of
plans can make its way back into the query plan, and be integrated,
without regressing too many other things.

The basis of my complaint is that cost_sort() typically costs external
sorts as twice or more as expensive as internal sorts, and that that
does not seem at all consistent with my *recent* observations, without
even bringing caching effects into it. That seems safe enough for
someone that is relatively unfamiliar with the optimizer to have as a
general concern, because external sorts and internal sorts really
aren't so different these days.

> Because it's necessary to set work_mem so conservatively to avoid
> running out of memory, most supposedly-external sorts aren't really
> doing any I/O at all, and therefore arguing about whether we should be
> charging seq_page_cost or random_page_cost for I/O is kinda missing
> the point.

I don't take issue with how internal sorts are costed at all. I think
that the factors that direct the tuning of work_mem might prominently
include the anti-scaling effect of replacement selection. To the DBA,
that can just look like simple memory pressure, perhaps. Time will
tell how true this is.

> Those numbers may just be reflecting other work that the
> sort is doing that isn't being properly accounted for elsewhere, or
> it's possible that the current costing of sorts is fairly far away
> from reality.  How would we know?  The only thing that makes me think
> they probably aren't *too* bad is that in a somewhat-systematic study
> of query performance problems
> (https://sites.google.com/site/robertmhaas/query-performance) done
> several years ago I found no evidence of a problem with sort costing,
> and my experience outside of that study bears that out.  But that's
> pretty back of the envelope.

It's also out of date. As I said, we don't have any real field
experience with the 9.6 external sort changes yet, but we'll have some
soon enough.

I'm happy to leave it at that, for now -- maybe someone will recall
this thread at a later date, when there is a complaint about a
suboptimal query plan from a user.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)

2016-09-13 Thread Andres Freund


On September 13, 2016 9:07:35 AM PDT, Tom Lane  wrote:
>Andres Freund  writes:
>> Attached is a significantly updated patch series (see the mail one up
>> for details about what this is, I don't want to quote it in its
>> entirety).
>
>I've reviewed the portions of 0005 that have to do with making the
>parser
>mark queries with hasTargetSRF.  The code as you had it was wrong
>because
>it would set the flag as a consequence of SRFs in function RTEs, which
>we don't want.

I'd taken it more as the possibility that there's an srf, than guarantee so 
far. There might be cases where the planner removes the srf during folding or 
such.  Makes sense to make it more accurate.


>  It seemed to me that the best way to fix that was to
>rely
>on the parser's p_expr_kind mechanism to tell which part of the query
>we're in, whereupon we might as well make the parser act more like it
>does
>for aggregates and window functions and give a suitable error at parse
>time for misplaced SRFs.

That's a nice improvement. The execution time errors are ugly.

>I also renamed the flag to hasTargetSRFs, which is more parallel to
>hasAggs and hasWindowFuncs, and made some effort to use it in place
>of expression_returns_set() searches.
>
>I'd like to go ahead and push this, since it's a necessary prerequisite
>for either approach we might adopt for the rest of the patch series,
>and the improved error reporting and avoidance of expensive
>expression_returns_set searches make it a win IMO even if we were not
>planning to do anything more with SRFs.

Can't look are the code just now, on my way to the airport for pgopen, but the 
idea sounds good to me.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)

2016-09-13 Thread Tom Lane
Andres Freund  writes:
> Attached is a significantly updated patch series (see the mail one up
> for details about what this is, I don't want to quote it in its
> entirety).

I've reviewed the portions of 0005 that have to do with making the parser
mark queries with hasTargetSRF.  The code as you had it was wrong because
it would set the flag as a consequence of SRFs in function RTEs, which
we don't want.  It seemed to me that the best way to fix that was to rely
on the parser's p_expr_kind mechanism to tell which part of the query
we're in, whereupon we might as well make the parser act more like it does
for aggregates and window functions and give a suitable error at parse
time for misplaced SRFs.  The attached isn't perfect, in that it doesn't
know about nesting restrictions (ie that SRFs must be at top level of a
function RTE), but we could improve that later if we wanted, and anyway
it's definitely a good bit nicer than before.

This also incorporates the part of 0002 that I agree with, namely
disallowing SRFs in UPDATE, since check_srf_call_placement() naturally
would do that.

I also renamed the flag to hasTargetSRFs, which is more parallel to
hasAggs and hasWindowFuncs, and made some effort to use it in place
of expression_returns_set() searches.

I'd like to go ahead and push this, since it's a necessary prerequisite
for either approach we might adopt for the rest of the patch series,
and the improved error reporting and avoidance of expensive
expression_returns_set searches make it a win IMO even if we were not
planning to do anything more with SRFs.

regards, tom lane

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index e997b57..dbd6094 100644
*** a/src/backend/catalog/heap.c
--- b/src/backend/catalog/heap.c
*** cookDefault(ParseState *pstate,
*** 2560,2573 
  
  	/*
  	 * transformExpr() should have already rejected subqueries, aggregates,
! 	 * and window functions, based on the EXPR_KIND_ for a default expression.
! 	 *
! 	 * It can't return a set either.
  	 */
- 	if (expression_returns_set(expr))
- 		ereport(ERROR,
- (errcode(ERRCODE_DATATYPE_MISMATCH),
-  errmsg("default expression must not return a set")));
  
  	/*
  	 * Coerce the expression to the correct type and typmod, if given. This
--- 2560,2568 
  
  	/*
  	 * transformExpr() should have already rejected subqueries, aggregates,
! 	 * window functions, and SRFs, based on the EXPR_KIND_ for a default
! 	 * expression.
  	 */
  
  	/*
  	 * Coerce the expression to the correct type and typmod, if given. This
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 4f39dad..71714bc 100644
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*** _copyQuery(const Query *from)
*** 2731,2736 
--- 2731,2737 
  	COPY_SCALAR_FIELD(resultRelation);
  	COPY_SCALAR_FIELD(hasAggs);
  	COPY_SCALAR_FIELD(hasWindowFuncs);
+ 	COPY_SCALAR_FIELD(hasTargetSRFs);
  	COPY_SCALAR_FIELD(hasSubLinks);
  	COPY_SCALAR_FIELD(hasDistinctOn);
  	COPY_SCALAR_FIELD(hasRecursive);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 4800165..29a090f 100644
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
*** _equalQuery(const Query *a, const Query 
*** 921,926 
--- 921,927 
  	COMPARE_SCALAR_FIELD(resultRelation);
  	COMPARE_SCALAR_FIELD(hasAggs);
  	COMPARE_SCALAR_FIELD(hasWindowFuncs);
+ 	COMPARE_SCALAR_FIELD(hasTargetSRFs);
  	COMPARE_SCALAR_FIELD(hasSubLinks);
  	COMPARE_SCALAR_FIELD(hasDistinctOn);
  	COMPARE_SCALAR_FIELD(hasRecursive);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 90fecb1..7e092d7 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*** _outQuery(StringInfo str, const Query *n
*** 2683,2688 
--- 2683,2689 
  	WRITE_INT_FIELD(resultRelation);
  	WRITE_BOOL_FIELD(hasAggs);
  	WRITE_BOOL_FIELD(hasWindowFuncs);
+ 	WRITE_BOOL_FIELD(hasTargetSRFs);
  	WRITE_BOOL_FIELD(hasSubLinks);
  	WRITE_BOOL_FIELD(hasDistinctOn);
  	WRITE_BOOL_FIELD(hasRecursive);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 894a48f..917e6c8 100644
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*** _readQuery(void)
*** 238,243 
--- 238,244 
  	READ_INT_FIELD(resultRelation);
  	READ_BOOL_FIELD(hasAggs);
  	READ_BOOL_FIELD(hasWindowFuncs);
+ 	READ_BOOL_FIELD(hasTargetSRFs);
  	READ_BOOL_FIELD(hasSubLinks);
  	READ_BOOL_FIELD(hasDistinctOn);
  	READ_BOOL_FIELD(hasRecursive);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 04264b4..99b6bc8 100644
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*** check_output_expressions(Query *subquery
*** 2422,2428 
  			continue;
  
  		/* Functions 

Re: [HACKERS] kqueue

2016-09-13 Thread Robert Haas
On Tue, Sep 13, 2016 at 11:36 AM, Simon Riggs  wrote:
> On 13 September 2016 at 08:08, Heikki Linnakangas  wrote:
>> So, if I've understood correctly, the purpose of this patch is to improve
>> performance on a multi-CPU system, which has the kqueue() function. Most
>> notably, FreeBSD?
>
> I'm getting a little fried from "self-documenting" patches, from
> multiple sources.
>
> I think we should make it a firm requirement to explain what a patch
> is actually about, with extra points for including with it a test that
> allows us to validate that. We don't have enough committer time to
> waste on such things.

You've complained about this a whole bunch of times recently, but in
most of those cases I didn't think there was any real unclarity.  I
agree that it's a good idea for a patch to be submitted with suitable
submission notes, but it also isn't reasonable to expect those
submission notes to be reposted with every single version of every
patch.  Indeed, I'd find that pretty annoying.  Thomas linked back to
the previous thread where this was discussed, which seems more or less
sufficient.  If committers are too busy to click on links in the patch
submission emails, they have no business committing anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Simon Riggs
On 13 September 2016 at 08:08, Heikki Linnakangas  wrote:

> So, if I've understood correctly, the purpose of this patch is to improve
> performance on a multi-CPU system, which has the kqueue() function. Most
> notably, FreeBSD?

I'm getting a little fried from "self-documenting" patches, from
multiple sources.

I think we should make it a firm requirement to explain what a patch
is actually about, with extra points for including with it a test that
allows us to validate that. We don't have enough committer time to
waste on such things.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Andres Freund
Hi,


On 2016-09-13 16:08:39 +0300, Heikki Linnakangas wrote:
> So, if I've understood correctly, the purpose of this patch is to improve
> performance on a multi-CPU system, which has the kqueue() function. Most
> notably, FreeBSD?

I think it's not necessarily about the current system, but more about
future uses of the WaitEventSet stuff. Some of that is going to use a
lot more sockets. E.g. doing a parallel append over FDWs.


> I launched a FreeBSD 10.3 instance on Amazon EC2 (ami-e0682b80), on a
> m4.10xlarge instance. That's a 40 core system, biggest available, I believe.
> I built PostgreSQL master on it, and ran pgbench to benchmark:
> 
> pgbench -i -s 200 postgres
> pgbench -M prepared  -j 36 -c 36 -S postgres -T20 -P1

This seems likely to actually seldomly exercise the relevant code
path. We only do the poll()/epoll_wait()/... when a read() doesn't
return anything, but that seems likely to seldomly occur here.  Using a
lower thread count and a lot higher client count might change that.

Note that the case where poll vs. epoll made a large difference (after
the regression due to ac1d7945f86) on linux was only on fairly large
machines, with high clients counts.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Heikki Linnakangas

On 09/13/2016 04:33 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

So, if I've understood correctly, the purpose of this patch is to
improve performance on a multi-CPU system, which has the kqueue()
function. Most notably, FreeBSD?


OS X also has this, so it might be worth trying on a multi-CPU Mac.


If there's no measurable difference in performance, between kqueue() and
poll(), I think we should forget about this.


I agree that we shouldn't add this unless it's demonstrably a win.
No opinion on whether your test is adequate.


I'm marking this as "Returned with Feedback", waiting for someone to 
post test results that show a positive performance benefit from this.


- Heikki




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-09-13 Thread Robert Haas
On Tue, Aug 2, 2016 at 3:41 AM, Kyotaro HORIGUCHI
 wrote:
> Thank you for the comment.
>
> At Mon, 1 Aug 2016 10:44:56 +0530, Amit Khandekar  
> wrote in 
>> On 21 July 2016 at 15:20, Kyotaro HORIGUCHI > > wrote:
>>
>> >
>> > After some consideration, I found that ExecAsyncWaitForNode
>> > cannot be reentrant because it means that the control goes into
>> > async-unaware nodes while having not-ready nodes, that is
>> > inconsistent state. To inhibit such reentering, I allocated node
>> > identifiers in depth-first order so that ascendant-descendant
>> > relationship can be checked (nested-set model) in simple way and
>> > call ExecAsyncConfigureWait only for the descendant nodes of the
>> > parameter planstate.
>> >
>> >
>> We have estate->waiting_nodes containing a mix of async-aware and
>> non-async-aware nodes. I was thinking, an asynchrony tree would have only
>> async-aware nodes, with possible multiple asynchrony sub-trees in a tree.
>> Somehow, if we restrict the bubbling up of events only upto the root of the
>> asynchrony subtree, do you think we can simplify some of the complexities ?
>
> The current code prohibiting regsitration of nodes outside the
> current subtree to avoid the reentring-disaster.
>
> Indeed leaving the "waiting node" mark or something like on every
> root node at the first visit will enable the propagation to stop
> upto the root of any async-subtree. Neverheless, when an
> async-child in an inactive async-root fires, the new tuple is
> loaded but is not consumed then the succeeding firing on the same
> child leads to a dead-lock (without result queueing). However,
> that can be avoided if ExecAsyncConfigureWait doesn't register
> nodes in ready state.

Why would a node call ExecAsyncConfigureWait in the first place if it
already had a result ready?  I think it shouldn't do that.

> On the other hand, any two or more asynchronous nodes can share a
> syncronization object. For instance, multiple postgres_fdw scan
> node can share one server connection and only one of them can get
> into waitable state at once. If no async-child in the current
> async subtree is waitable, it must be stuck. So I think it is
> crucial for ExecAsyncWaitForNode to force at least one child *in
> the current async subtree* to get into waiting state for such
> situation. The ascendant-descendant relationship is necessary to
> do that anyway.

This is another example of a situation where waiting only for nodes
within a subtree causes problems.

Suppose there are two Foreign Scans in completely different parts of
the plan tree that are going to use, in alternation, the same
connection to the same remote server.  When we encounter the first
one, it kicks off the query, uses ExecAsyncConfigureWait to register
itself as waiting, and returns without becoming ready.  When we
encounter the second one, it can't kick off the query and therefore
has no chance of becoming ready until after the first one has finished
with the connection.  Suppose we then wait for the second Foreign
Scan.  Well, we had better wait for the first one, too!  If we don't,
it will never finish with the connection, so the second node will
never get to use it, and now we're in trouble.

I think what we need is for the ConnCacheEntry to have a place to note
the ForeignScanState that is using the connection and any other
PlanState objects that would like to use it.  When one
ForeignScanState is done with the ConnCacheEntry, it activates the
next one, which then takes over.  That seems simple enough, but
there's a problem here for suspended queries: if we stop executing a
plan while some scan within that plan is holding onto a
ConnCacheEntry, and then we run some other query that wants to use the
same one, we've got a problem.  Maybe we can get by with letting the
other query finish running and then executing our own query, but that
might be messy to implement.  Another idea is to somehow let any
in-progress query finish running before allowing the first query to be
suspended; that would need some new infrastructure.

My main point here is that I think waiting for only a subtree is an
idea that cannot work out well.  Whatever problems are pushing you
into that design, we need to confront those problems directly and fix
them.  There shouldn't be any unsolvable problems in waiting for
everything in the whole query, and I'm pretty sure that's going to be
a more elegant and better-performing design.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-09-13 Thread Robert Haas
On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
 wrote:
> [ new patches ]

+/*
+ * We assume that few nodes are async-aware and async-unaware
+ * nodes cannot be revserse-dispatched from lower nodes that is
+ * async-aware. Firing of an async node that is not a descendant
+ * of the planstate will cause such reverse-diaptching to
+ * async-aware nodes, which is unexpected behavior for them.
+ *
+ * For instance, consider an async-unaware Hashjoin(OUTER, INNER)
+ * where the OUTER is running asynchronously but the Hashjoin is
+ * waiting on the async INNER during inner-hash creation. If the
+ * OUTER fires for the case, since anyone is waiting on it,
+ * ExecAsyncWaitForNode finally dispatches to the Hashjoin which
+ * is now in the middle of thing its work.
+ */
+if (!IsParent(planstate, node))
+continue;

I'm not entirely sure that I understand this comment, but I don't
think it's going in the right direction.  Let's start with the example
in the second paragraph. If the hash join is async-unaware, then it
isn't possible for the hash join to be both running the outer side of
the join asynchronously and at the same time waiting on the inner
side.  Once it tries to pull the first tuple from the outer side, it's
waiting for that to finish and can't do anything else.  So, the inner
side can't possibly get touched in any way until the outer side
finishes.  For anything else to happen, the hash join would have to be
async-aware.  Even if we did that, I don't think it would be right to
kick off both sides of the hash join at the same time.  Right now, if
the outer side turns out to be empty, we never need to build the hash
table, and that's good.

I don't think it's a good idea to wait for only nodes that are in the
current subtree.  For example, consider a plan like this:

Append
-> Foreign Scan on a
-> Hash Join
  -> Foreign Scan on b
  -> Hash
-> Seq Scan on x

Suppose Append and Foreign Scan are parallel-aware but the other nodes
are not.  Append kicks off the Foreign Scan on a and then waits for
the hash join to produce a tuple; the hash join kicks off the Foreign
Scan on b and waits for it to return a tuple.  If, while we're waiting
for the foreign scan on b, the foreign scan on a needs some attention
- either to produce tuples, or maybe just to call PQconsumeInput() so
that more data can be sent from the other side, I think we need to be
able to do that.  There's no real problem here; even if the Append
becomes result-ready before the hash join returns, that is fine.  We
will not actually be able to return from the append until the hash
join returns because of what's on the call stack, but that doesn't
mean that the Append can't be marked result-ready sooner than that.
The situation can be improved by making the hash join node
parallel-aware, but if we don't do that it's still not broken.

I think the reason that you originally got backed into this design was
because of problems with reentrancy.  I don't think I really
understand in any detail exactly what problem you hit there, but it
seems to me that the following problem could occur:
ExecAsyncWaitForNode finds two events and schedules two callbacks.  It
calls the first of those two callbacks.  Before that callback returns,
it again calls ExecAsyncWaitForNode.  But the new invocation of
ExecAsyncWaitForNode doesn't know that there is a second callback
pending, so it somehow gets confused.  However, I think this problem
can fixed using a different method.  The occurred_event and callbacks
arrays defined by ExecAsyncWaitForNode can be made part of the EState
rather than being local variables.  When ExecAsyncWaitForNode is
called, it checks whether there are any pending callbacks; if so, it
removes and calls the first one.  Only if there are no pending
callbacks does it actually wait; when a wait event fires, one or more
new callbacks are generated.  This is very similar to the reason why
ReceiveSharedInvalidMessages uses a static messages array rather than
a normal local variable.  That function is solving a problem which I
suspect is very similar to the one we have here.  However, it would be
helpful if you could provide some more details on what you think the
reentrancy problems are, because I'm not able to figure them out from
your messages so far.

The most mysterious part of this hunk to me is the comment that
"Firing of an async node that is not a descendant of the planstate
will cause such reverse-diaptching to async-aware nodes, which is
unexpected behavior for them."  It is the async-unaware nodes which
might have a problem.  The nodes that have been taught about the new
system should know what to expect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via 

Re: [HACKERS] _hash_alloc_buckets() safety

2016-09-13 Thread Amit Kapila
On Tue, Sep 13, 2016 at 6:58 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> While working on write-ahead-logging of hash indexes, I noticed that
>> this function allocates buckets in batches and the mechanism it uses
>> is that it initialize the last page of batch with zeros and expect
>> that the filesystem will ensure the intervening pages read as zeroes
>> too.
>
> Yes.  AFAIK that filesystem behavior is required by POSIX.
>
>> I think to make it WAL enabled, we need to initialize the page header
>> (using PageInit() or equivalent) instead of initializing it with
>> zeroes as some part of our WAL replay machinery expects that the page
>> should not be new as indicated by me in other thread [1].
>
> I don't really see why that's a problem.  The only way one of the fill
> pages would get to be not-zero is if there is a WAL action later in the
> stream that overwrites it.  So how would things become inconsistent?
>

It could lead to an error/log message indicating "unexpected zero
page". Refer below code in XLogReadBufferExtended():

if (mode == RBM_NORMAL)
{

/* check that page has been initialized */
Page page = (Page) BufferGetPage(buffer);
..
if (PageIsNew(page))
{
..
log_invalid_page(rnode, forknum, blkno, true);
}

Now, during replay of XLOG_FPI (WAL record for new page log_newpage),
it won't hit that condition because it is ensured in the caller that
if block has image, than we use RBM_ZERO_AND_CLEANUP_LOCK or
RBM_ZERO_AND_LOCK buffer mode.  However, a generic reader of blocks
like pgstattuple
(pgstattuple->pgstat_hash_page->_hash_getbuf_with_strategy->_hash_checkpage())
or new check consistency tool [1] being developed by Kuntal would lead
to an error/log - " .. contains unexpected zero page at block ..".  I
think the check consistency tool might use some checks to avoid it,
but not sure.


>> Offhand, I don't see any problem with just
>> initializing the last page and write the WAL for same with
>> log_newpage(), however if we try to initialize all pages, there could
>> be some performance penalty on split operation.
>
> "Some" seems like rather an understatement.  And it's not just the
> added I/O, it's the fact that you'd need to lock each bucket as you
> went through them to avoid clobbering concurrently-inserted data.
>

The question of concurrent-data insertion can only come into picture
when we update metapage information(maxbucket, lowmask, highmask, ..).
Without that, I don't think concurrent users would ever be aware of
the new buckets.  However, as this work is done with metapage locked,
it is not advisable to do more work here, unless it is required.

> If you weren't talking about such an enormous penalty, I might be okay
> with zeroing the intervening pages explicitly rather than depending on
> the filesystem to do it.  But since you are, I think you need a clearer
> explanation of why this is necessary.
>

I am also not in favor of doing the extra work unless it is required.
The point of this mail is to check if anybody sees that it is
required, because I can't find a reason to initialize all pages.


[1] - 
https://www.postgresql.org/message-id/CAGz5QCLC=0jbmgag-barpyaavu9cn0b-scoepecgbe6rhh8...@mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-13 Thread Tom Lane
Craig Ringer  writes:
> While updating an extension for 9.6 I noticed that while the
> $(prove_check) definition is exposed for use by PGXS in
> Makefile.global, extensions can't actually use the TAP tests because
> we don't install the required Perl modules like PostgresNode.pm.

> I don't see any reason not to make this available to extension authors
> and doing so is harmless, so here's a small patch to install it. I
> think it's reasonable to add this to 9.6 even at this late stage; IMO
> it should've been installed from the beginning.

Without taking a position on the merits of this patch per se, I'd like
to say that I find the argument for back-patching into 9.6 and not
further than that to be pretty dubious.  $(prove_check) has been there
since 9.4, and in the past we've often regretted it when we failed
to back-patch TAP infrastructure fixes all the way back to 9.4.

Or to be concrete: how is an extension author, or more to the point an
extension Makefile, supposed to know whether it can use $(prove_check)?
How would this patch change that, and how would extension authors cope
with building against both patched and unpatched trees?  

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Tom Lane
Heikki Linnakangas  writes:
> So, if I've understood correctly, the purpose of this patch is to 
> improve performance on a multi-CPU system, which has the kqueue() 
> function. Most notably, FreeBSD?

OS X also has this, so it might be worth trying on a multi-CPU Mac.

> If there's no measurable difference in performance, between kqueue() and 
> poll(), I think we should forget about this.

I agree that we shouldn't add this unless it's demonstrably a win.
No opinion on whether your test is adequate.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _hash_alloc_buckets() safety

2016-09-13 Thread Tom Lane
Amit Kapila  writes:
> While working on write-ahead-logging of hash indexes, I noticed that
> this function allocates buckets in batches and the mechanism it uses
> is that it initialize the last page of batch with zeros and expect
> that the filesystem will ensure the intervening pages read as zeroes
> too.

Yes.  AFAIK that filesystem behavior is required by POSIX.

> I think to make it WAL enabled, we need to initialize the page header
> (using PageInit() or equivalent) instead of initializing it with
> zeroes as some part of our WAL replay machinery expects that the page
> should not be new as indicated by me in other thread [1].

I don't really see why that's a problem.  The only way one of the fill
pages would get to be not-zero is if there is a WAL action later in the
stream that overwrites it.  So how would things become inconsistent?

> Offhand, I don't see any problem with just
> initializing the last page and write the WAL for same with
> log_newpage(), however if we try to initialize all pages, there could
> be some performance penalty on split operation.

"Some" seems like rather an understatement.  And it's not just the
added I/O, it's the fact that you'd need to lock each bucket as you
went through them to avoid clobbering concurrently-inserted data.
If you weren't talking about such an enormous penalty, I might be okay
with zeroing the intervening pages explicitly rather than depending on
the filesystem to do it.  But since you are, I think you need a clearer
explanation of why this is necessary.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-13 Thread Robert Haas
On Sun, Sep 11, 2016 at 2:05 PM, Peter Geoghegan  wrote:
> On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas  wrote:
>> Pushed this "displace root" patch, with some changes:
>
> Attached is rebased version of the entire patch series, which should
> be applied on top of what you pushed to the master branch today.

0003 looks like a sensible cleanup of our #include structure
regardless of anything this patch series is trying to accomplish, so
I've committed it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-13 Thread Heikki Linnakangas
So, if I've understood correctly, the purpose of this patch is to 
improve performance on a multi-CPU system, which has the kqueue() 
function. Most notably, FreeBSD?


I launched a FreeBSD 10.3 instance on Amazon EC2 (ami-e0682b80), on a 
m4.10xlarge instance. That's a 40 core system, biggest available, I 
believe. I built PostgreSQL master on it, and ran pgbench to benchmark:


pgbench -i -s 200 postgres
pgbench -M prepared  -j 36 -c 36 -S postgres -T20 -P1

I set shared_buffers to 10 GB, so that the whole database fits in cache. 
I tested that with and without kqueue-v5.patch


Result: I don't see any difference in performance. pgbench reports 
between 80,000 and 97,000 TPS, with or without the patch:


[ec2-user@ip-172-31-17-174 ~/postgresql]$ ~/pgsql-install/bin/pgbench -M 
prepared  -j 36 -c 36 -S postgres -T20 -P1

starting vacuum...end.
progress: 1.0 s, 94537.1 tps, lat 0.368 ms stddev 0.145
progress: 2.0 s, 96745.9 tps, lat 0.368 ms stddev 0.143
progress: 3.0 s, 93870.1 tps, lat 0.380 ms stddev 0.146
progress: 4.0 s, 89482.9 tps, lat 0.399 ms stddev 0.146
progress: 5.0 s, 87815.0 tps, lat 0.406 ms stddev 0.148
progress: 6.0 s, 86415.5 tps, lat 0.413 ms stddev 0.145
progress: 7.0 s, 86011.0 tps, lat 0.415 ms stddev 0.147
progress: 8.0 s, 84923.0 tps, lat 0.420 ms stddev 0.147
progress: 9.0 s, 84596.6 tps, lat 0.422 ms stddev 0.146
progress: 10.0 s, 84537.7 tps, lat 0.422 ms stddev 0.146
progress: 11.0 s, 83910.5 tps, lat 0.425 ms stddev 0.150
progress: 12.0 s, 83738.2 tps, lat 0.426 ms stddev 0.150
progress: 13.0 s, 83837.5 tps, lat 0.426 ms stddev 0.147
progress: 14.0 s, 83578.4 tps, lat 0.427 ms stddev 0.147
progress: 15.0 s, 83609.5 tps, lat 0.427 ms stddev 0.148
progress: 16.0 s, 83423.5 tps, lat 0.428 ms stddev 0.151
progress: 17.0 s, 83318.2 tps, lat 0.428 ms stddev 0.149
progress: 18.0 s, 82992.7 tps, lat 0.430 ms stddev 0.149
progress: 19.0 s, 83155.9 tps, lat 0.429 ms stddev 0.151
progress: 20.0 s, 83209.0 tps, lat 0.429 ms stddev 0.152
transaction type: 
scaling factor: 200
query mode: prepared
number of clients: 36
number of threads: 36
duration: 20 s
number of transactions actually processed: 1723759
latency average = 0.413 ms
latency stddev = 0.149 ms
tps = 86124.484867 (including connections establishing)
tps = 86208.458034 (excluding connections establishing)


Is this test setup reasonable? I know very little about FreeBSD, I'm 
afraid, so I don't know how to profile or test that further than that.


If there's no measurable difference in performance, between kqueue() and 
poll(), I think we should forget about this. If there's a FreeBSD hacker 
out there that can demonstrate better results, I'm all for committing 
this, but I'm reluctant to add code if no-one can show the benefit.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An extra error for client disconnection on Windows

2016-09-13 Thread Alvaro Herrera
Michael Paquier wrote:
> On Tue, Sep 13, 2016 at 10:42 AM, Kyotaro HORIGUCHI
>  wrote:
> > If we take a policy to try to imitate the behavior of some
> > reference platform (specifically Linux) on other platforms, this
> > is required disguising. Another potential policy on this problem
> > is "following the platform's behavior". From this viewpoint, this
> > message should be shown to users because Windows says
> > so. Especially for socket operations, the simultion layer is
> > intending the former for non-error behaviors, but I'm not sure
> > about the behaviors on errors.
> 
> The more you hack windows, the more you'll notice that it is full of
> caveats, behavior exceptions, and that it runs in its way as nothing
> else in this world... This patch looks like a tempest in a teapot at
> the end. Why is it actually a problem to show this message? Just
> useless noise? If that's the only reason let's drop the patch and move
> on.

Yeah, I looked into this a few days ago and that was my conclusion also:
let's drop this.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cost_sort() may need to be updated

2016-09-13 Thread Robert Haas
On Sun, Sep 11, 2016 at 12:13 PM, Peter Geoghegan  wrote:
> On Sun, Sep 11, 2016 at 9:01 AM, Tom Lane  wrote:
>> Peter Geoghegan  writes:
>>> I think that we *can* refine this guess, and should, because random
>>> I/O is really quite unlikely to be a large cost these days (I/O in
>>> general often isn't a large cost, actually). More fundamentally, I
>>> think it's a problem that cost_sort() thinks that external sorts are
>>> far more expensive than internal sorts in general. There is good
>>> reason to think that that does not reflect the reality. I think we can
>>> expect external sorts to be *faster* than internal sorts with
>>> increasing regularity in Postgres 10.
>>
>> TBH, if that's true, haven't you broken something?
>
> It's possible for external sorts to be faster some of the time because
> the memory access patterns can be more cache efficient: smaller runs
> are better when accessing tuples in sorted order, scattered across
> memory. More importantly, the sort can start returning tuples earlier
> in the common case where a final on-the-fly merge can be performed. In
> principle, you could adopt internal sorts to have the same advantages,
> but that hasn't and probably won't happen. Finally, the external sort
> I/O costs grow linearly, whereas the CPU costs grow in a linearithmic
> fashion, which will eventually come to dominate. We can hide the
> latency of those costs pretty well, too, with asynchronous I/O.
>
> I'm not arguing that cost_sort() should think that external sorts are
> cheaper under any circumstances, since all of this is very hard to
> model. I only mention this because it illustrates nicely that
> cost_sort() has the wrong idea.

I think that the only real way to judge whether cost_sort() is any
good is to see whether it causes the wrong plan to be chosen in some
cases.  For example, it could cause merge joins to be picked too
frequently or too infrequently, or it could cause a wrong decision
between hash and group aggregates.  Now, there is bound to be an
argument about whether any test cases that we might pick are
representative and the results will further differ between hardware
platforms, but I think changing the formula based on an abstract idea
about what's happening is probably not a good idea.

Because it's necessary to set work_mem so conservatively to avoid
running out of memory, most supposedly-external sorts aren't really
doing any I/O at all, and therefore arguing about whether we should be
charging seq_page_cost or random_page_cost for I/O is kinda missing
the point.  Those numbers may just be reflecting other work that the
sort is doing that isn't being properly accounted for elsewhere, or
it's possible that the current costing of sorts is fairly far away
from reality.  How would we know?  The only thing that makes me think
they probably aren't *too* bad is that in a somewhat-systematic study
of query performance problems
(https://sites.google.com/site/robertmhaas/query-performance) done
several years ago I found no evidence of a problem with sort costing,
and my experience outside of that study bears that out.  But that's
pretty back of the envelope.

Of course, the issue of supposed I/O actually being reads of cached
data is not unique to sorting.  Sequential scans have the same
problem.  We've previously discussed the idea of adding a
cached_page_cost, but this has fallen down over the issue that you
also need to know what percentage of the table is likely to be read
from cache, and you don't want that estimate to be too unstable or
your plans will bounce all over the place.  One idea I had for dealing
with that is to just base it on the size of the table: assume small
tables are probably cached, and large tables probably aren't.  The
latter is certainly true at least in the case where "large" means
"bigger than physical memory", and the former isn't a bad guess
either.  So basically where this would go is to assume that scanning a
small table is cheaper per page than scanning a big one, which I
suspect is probably better than the uniform estimate we're using
today.  However, I haven't got a single test case to prove it, and in
fact I'm not even sure how one could go about figuring out whether
this makes things better or worse in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] _hash_alloc_buckets() safety

2016-09-13 Thread Amit Kapila
While working on write-ahead-logging of hash indexes, I noticed that
this function allocates buckets in batches and the mechanism it uses
is that it initialize the last page of batch with zeros and expect
that the filesystem will ensure the intervening pages read as zeroes
too.

I think to make it WAL enabled, we need to initialize the page header
(using PageInit() or equivalent) instead of initializing it with
zeroes as some part of our WAL replay machinery expects that the page
should not be new as indicated by me in other thread [1].  I think WAL
consistency check tool [2] also uses same part of replay functions and
will show this as problem, if we don't initialize the page header.

The point which is not clear to me is that whether it is okay as-is or
shall we try to initialize each page of batch during
_hash_alloc_buckets() considering now we are trying to make hash
indexes WAL enabled.  Offhand, I don't see any problem with just
initializing the last page and write the WAL for same with
log_newpage(), however if we try to initialize all pages, there could
be some performance penalty on split operation.

Thoughts?


[1] - 
https://www.postgresql.org/message-id/CAA4eK1JS%2BSiRSQBzEFpnsSmxZKingrRH7WNyWULJeEJSj1-%3D0w%40mail.gmail.com
[2] - https://commitfest.postgresql.org/10/741/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-13 Thread Robert Haas
On Thu, Sep 8, 2016 at 12:32 AM, Amit Kapila  wrote:
> Hmm.  I think page or block is a concept of database systems and
> buckets is a general concept used in hashing technology.  I think the
> difference is that there are primary buckets and overflow buckets. I
> have checked how they are referred in one of the wiki pages [1],
> search for overflow on that wiki page. Now, I think we shouldn't be
> inconsistent in using them. I will change to make it same if I find
> any inconsistency based on what you or other people think is the
> better way to refer overflow space.

In the existing source code, the terminology 'overflow page' is
clearly preferred to 'overflow bucket'.

[rhaas pgsql]$ git grep 'overflow page' | wc -l
  75
[rhaas pgsql]$ git grep 'overflow bucket' | wc -l
   1

In our off-list conversations, I too have found it very confusing when
you've made reference to an overflow bucket.  A hash table has a fixed
number of buckets, and depending on the type of hash table the storage
for each bucket may be linked together into some kind of a chain;
here, a chain of pages.  The 'bucket' logically refers to all of the
entries that have hash codes such that (hc % nbuckets) == bucketno,
regardless of which pages contain them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-13 Thread Jesper Pedersen

On 09/12/2016 10:42 PM, Amit Kapila wrote:

The following script hangs on idx_val creation - just with v5, WAL patch
not applied.



Are you sure it is actually hanging? I see 100% cpu for a few minutes but
the index eventually completes ok for me (v5 patch applied to today's
master).



It completed for me as well.  The second index creation is taking more
time and cpu, because it is just inserting duplicate values which need
lot of overflow pages.



Yeah, sorry for the false alarm. It just took 3m45s to complete on my 
machine.


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-13 Thread Amit Kapila
Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JuKt%3D-%3DY0FheiFL-i8Z5_5660%3D3n8JUA8s3zG53t_ArQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


concurrent_hash_index_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-13 Thread Amit Kapila
On Mon, Sep 12, 2016 at 11:29 AM, Jeff Janes  wrote:
>
>
> My test program (as posted) injects crashes and then checks the
> post-crash-recovery system for consistency, so it cannot be run as-is
> without the WAL patch.  I also ran the test with crashing turned off (just
> change the JJ* variables at the stop of the do.sh to all be set to the empty
> string), and in that case I didn't see either problem, but it it could just
> be that I that I didn't run it long enough.
>
> It should have been long enough to detect the rather common "buffer  is
> not owned by resource owner Portal" problem, so that one I think is specific
> to the WAL patch (probably the part which tries to complete bucket splits
> when it detects one was started but not completed?)
>

Both the problem seems to be due to same reason and in concurrent hash
index patch.  So what is happening here is that we are trying to unpin
the old bucket buffer twice.  We need to scan the old bucket when
there is a incomplete-split, so the issue here is that during split
the system has crashed and after restart, during old bucket scan it
tries to unpin the old primary bucket buffer twice when the new bucket
has additional overflow pages.  I will post the updated patch on
concurrent hash index thread.  Once, I post the patch, it is better if
you try to reproduce the issue once more.

Thanks to Ashutosh Sharma who has offlist shared the call stack after
reproducing the problem.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-13 Thread Alvaro Herrera
Craig Ringer wrote:

> I suggest that the above patches be applied to 9.6 and v10. Then for
> v10

I don't object to patching 9.6 in this way, but kindly do not pollute
this thread with future ideas on what to do on pg10, at least until the
current release is sorted out.  You'll only distract people from your
present objective.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nested loop join condition does not get pushed down to foreign scan

2016-09-13 Thread Ashutosh Bapat
On Tue, Sep 13, 2016 at 4:05 PM, Albe Laurenz  wrote:
> I just noticed something surprising:
>
> -- create a larger local table
> CREATE TABLE llarge (id integer NOT NULL, val integer NOT NULL);
> INSERT INTO llarge SELECT i, i%100 FROM generate_series(1, 1) i;
> ALTER TABLE llarge ADD PRIMARY KEY (id);
>
> -- create a small local table
> CREATE TABLE small (id integer PRIMARY KEY, val text NOT NULL);
> INSERT INTO small VALUES (1, 'one');
>
> -- create a foreign table based on llarge
> CREATE EXTENSION postgres_fdw;
> CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 
> 'localhost', port '5432', dbname 'test');
> CREATE USER MAPPING FOR myself SERVER loopback OPTIONS (user 'myself', 
> password 'mypassword');
> CREATE FOREIGN TABLE rlarge (id integer NOT NULL, val integer NOT NULL) 
> SERVER loopback OPTIONS (table_name 'llarge');
>
> SET enable_hashjoin = off;
> -- plan for a nested loop join with a local table
> EXPLAIN (COSTS off) SELECT * FROM small JOIN llarge USING (id);
>   QUERY PLAN
> --
>  Nested Loop
>->  Seq Scan on small
>->  Index Scan using llarge_pkey on llarge
>  Index Cond: (id = small.id)
> (4 rows)
>
> -- plan for a nested loop join with a foreign table
> EXPLAIN (COSTS off) SELECT * FROM small JOIN rlarge USING (id);
>   QUERY PLAN
> ---
>  Nested Loop
>Join Filter: (small.id = rlarge.id)
>->  Seq Scan on small
>->  Foreign Scan on rlarge
> (4 rows)
>
>
> Is there a fundamental reason why the join condition does not get pushed down 
> into
> the foreign scan or is that an omission that can easily be fixed?
>

While creating the foreign table, if you specify use_remote_estimate =
true for the table OR do so for the foreign server, postgres_fdw
creates parameterized paths for that foreign relation. If using a
parameterized path reduces cost of the join, it will use a nested loop
join with inner relation parameterized by the outer relation, pushing
join conditions down into the foreign scan.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Nested loop join condition does not get pushed down to foreign scan

2016-09-13 Thread Albe Laurenz
I just noticed something surprising:

-- create a larger local table
CREATE TABLE llarge (id integer NOT NULL, val integer NOT NULL);
INSERT INTO llarge SELECT i, i%100 FROM generate_series(1, 1) i;
ALTER TABLE llarge ADD PRIMARY KEY (id);

-- create a small local table
CREATE TABLE small (id integer PRIMARY KEY, val text NOT NULL);
INSERT INTO small VALUES (1, 'one');

-- create a foreign table based on llarge
CREATE EXTENSION postgres_fdw;
CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 
'localhost', port '5432', dbname 'test');
CREATE USER MAPPING FOR myself SERVER loopback OPTIONS (user 'myself', password 
'mypassword');
CREATE FOREIGN TABLE rlarge (id integer NOT NULL, val integer NOT NULL) SERVER 
loopback OPTIONS (table_name 'llarge');

SET enable_hashjoin = off;
-- plan for a nested loop join with a local table
EXPLAIN (COSTS off) SELECT * FROM small JOIN llarge USING (id);
  QUERY PLAN
--
 Nested Loop
   ->  Seq Scan on small
   ->  Index Scan using llarge_pkey on llarge
 Index Cond: (id = small.id)
(4 rows)

-- plan for a nested loop join with a foreign table
EXPLAIN (COSTS off) SELECT * FROM small JOIN rlarge USING (id);
  QUERY PLAN
---
 Nested Loop
   Join Filter: (small.id = rlarge.id)
   ->  Seq Scan on small
   ->  Foreign Scan on rlarge
(4 rows)


Is there a fundamental reason why the join condition does not get pushed down 
into
the foreign scan or is that an omission that can easily be fixed?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-09-13 Thread Kuntal Ghosh
 - If WAL consistency check is enabled for a rmgrID, we always include
 the backup image in the WAL record.
>>>
>>>What happens if wal_consistency has different settings on a standby
>>>and its master? If for example it is set to 'all' on the standby, and
>>>'none' on the master, or vice-versa, how do things react? An update of
>>>this parameter should be WAL-logged, no?
>>
>> If wal_consistency is enabled for a rmid, standby will always check whether
>> backup image exists or not i.e. BKPBLOCK_HAS_IMAGE is set or not.
>> (I guess Amit and Robert also suggested the same in the thread)
>> Basically, BKPBLOCK_HAS_IMAGE is set if a block contains image and
>> BKPIMAGE_IS_REQUIRED_FOR_REDO (I've added this one) is set if that backup
>> image is required during redo. When we decode a wal record, has_image
>> flag of DecodedBkpBlock is set to BKPIMAGE_IS_REQUIRED_FOR_REDO.
>
>Ah I see. But do we actually store the status in the record itself,
>then at replay we don't care of the value of wal_consistency at
>replay. That's the same concept used by wal_compression. So shouldn't
>you have more specific checks related to that in checkConsistency? You
>actually don't need to check for anything in xlogreader.c, just check
>for the consistency if there is a need to do so, or do nothing.
I'm sorry, but I don't quite follow you here. If a wal record contains
an image, has_image should be set since it helps decoding the
record. But, during redo if XLogRecHasBlockImage() returns true, i.e.,
has_image is set, then it always restore the block. But, in our case,
a record can have a backup image which should not be restored. So, we need
to decide two things:
1. Does a record contain backup image? (required for decoding the record)
2. If it has an image, should we restore it during redo?
I think we sould decide these in DecodeXLogRecord() only. BKPBLOCK_HAS_IMAGE
answers the first question whereas BKPIMAGE_IS_REQUIRED_FOR_REDO
answers the second one. In DecodeXLogRecord(), we check that
BKPBLOCK_HAS_IMAGE should be set if wal_consistency is enabled for
this record. The flag has_image is set to
BKPIMAGE_IS_REQUIRED_FOR_REDO which is later used to decide whether we
want to restore a block or not.

> For now, I've kept this as a WARNING message to detect all inconsistencies
> at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Or say FATAL. This way the server is taken down.

> Thoughts?
+1. I'll do that.

>A couple of extra thoughts:
>1) The routines of each rmgr are located in a dedicated file, for
>example GIN stuff is in ginxlog.c, etc. It seems to me that it would
>be better to move each masking function where it should be instead
>being centralized. A couple of routines need to be centralized, so I'd
>suggest putting them in a new file, like xlogmask.c, xlog because now
>this is part of WAL replay completely, including the lsn, the hint
>bint and the other common routines.
Sounds good. But, I think that the file name for common masking routines
should be as bufmask.c since we are masking the buffers only.

>2) Regarding page comparison:
>+/*
>+ * Compare the contents of two pages.
>+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
>+ * it returns the location where the first mismatch has occurred.
>+ */
>+int
>+comparePages(char *page1, char *page2)
>We could just use memcpy() here. compareImages was useful to get a
>clear image of what the inconsistencies were, but you don't do that
>anymore.
memcmp(), right?

>5)
>+   has_image = record->blocks[block_id].has_image;
>+   record->blocks[block_id].has_image = true;
>+   if (!RestoreBlockImage(record, block_id, old_page))
>+   elog(ERROR, "failed to restore block image");
>+   record->blocks[block_id].has_image = has_image;
>Er, what? And BKPIMAGE_IS_REQUIRED_FOR_REDO?
Sorry, I completely missed this.

>6)
>+   /*
>+* Remember that, if WAL consistency check is enabled for
>the current rmid,
>+* we always include backup image with the WAL record.
>But, during redo we
>+* restore the backup block only if needs_backup is set.
>+*/
>+   if (needs_backup)
>+   bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
>This should use wal_consistency[rmid]?
needs_backup is set when XLogRecordAssemble decides that backup image
should be included in the record for redo purpose. This image will be
restored during
redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
image should be restored during redo(or has_image should be set or not).

>7) This patch has zero documentation. Please add some. Any human being
>on this list other than those who worked on the first versions
>(Heikki, Simon and I?) is going to have a hard time to review this
>patch in details moving on if there is no reference to tell what this
>feature does for the user...
>
>This patch is going to the good direction, but I don't think it's far

Re: [HACKERS] Supporting SJIS as a database encoding

2016-09-13 Thread Heikki Linnakangas

On 09/08/2016 09:35 AM, Kyotaro HORIGUCHI wrote:

Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.


Nice!

I was hoping that this would also make the binaries smaller. A few dozen 
kB of storage is perhaps not a big deal these days, but still. And 
smaller tables would also consume less memory and CPU cache.


I removed the #include "../../Unicode/utf8_to_sjis.map" line, so that 
the old table isn't included anymore, compiled, and ran "strip 
utf8_and_sjis.so". Without this patch, it's 126 kB, and with it, it's 
160 kB. So the radix tree takes a little bit more space.


That's not too bad, and I'm sure we could live with that, but with a few 
simple tricks, we could do better. First, since all the values we store 
in the tree are < 0x, we could store them in int16 instead of int32, 
and halve the size of the table right off the bat. That won't work for 
all encodings, of course, but it might be worth it to have two versions 
of the code, one for int16 and another for int32.


Another trick is to eliminate redundancies in the tables. Many of the 
tables contain lots of zeros, as in:



  /*   c3xx */{
/*   c380 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/*   c388 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/*   c390 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x817e,
/*   c398 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/*   c3a0 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/*   c3a8 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/*   c3b0 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x8180,
/*   c3b8 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x
  },


and


  /* e388xx */{
/* e38880 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/* e3 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/* e38890 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/* e38898 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/* e388a0 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/* e388a8 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
/* e388b0 */ 0x, 0xfa58, 0x878b, 0x, 0x, 0x, 0x, 0x,
/* e388b8 */ 0x, 0x878c, 0x, 0x, 0x, 0x, 0x, 0x
  },


You could overlay the last row of the first table, which is all zeros, 
with the first row of the second table, which is also all zeros. (Many 
of the tables have a lot more zero-rows than this example.)


But yes, this patch looks very promising in general. I think we should 
switch over to radix trees for the all the encodings.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench - allow to store select results into variables

2016-09-13 Thread Fabien COELHO


Hello Amit,


[...]
There still seems to be a change in behavior of the -r option due to the
patch. Consider the following example:

select a from a where a = 1 \;
select a+1 from a where a = 1;
...
- statement latencies in milliseconds:
2.889  select a from a where a = 1 ;


vs


select a from a where a = 1 \; \into a
select a+1 from a where a = 1; \into b
...
2.093  select a from a where a = 1 ; select a+1 from a where a = 1;


Yep.

Note that there is a small logical conumdrum in this argument: As the 
script is not the same, especially as into was not possible before, 
strictly speaking there is no behavior "change".


This said, what you suggest can be done.

After giving it some thought, I suggest that it is not needed nor 
desirable. If you want to achieve the initial effect, you just have to put 
the "into a" on the next line:


  select a from a where a = 1 \;
  \into a
  select a+1 from a where a = 1; \into b

Then you would get the -r cut at the end of the compound command. Thus the 
current version gives full control of what will appear in the summary. If 
I change "\into xxx\n" to mean "also cut here", then there is less control 
on when the cut occurs when into is used.



One more thing I observed which I am not sure if it's a fault of this
patch is illustrated below:

$ cat /tmp/into.sql
\;
select a from a where a = 1 \;
select a+1 from a where a = 1;

$ pgbench -r -n -t 1 -f /tmp/into.sql postgres

- statement latencies in milliseconds:
2.349  ;

Note that the compound select statement is nowhere to be seen in the
latencies output. The output remains the same even if I use the \into's.
What seems to be going on is that the empty statement on the first line
(\;) is the only part kept of the compound statement spanning lines 1-3.


Yes.

This is really the (debatable) current behavior, and is not affected by 
the patch. The "-r" summary takes the first line of the command, whatever 
it is. In your example the first line is "\;", so you get what you asked 
for, even if it looks rather strange, obviously.



+boolsql_command_in_progress = false;

[...]
I understood that it refers to what you explain here.  But to me it
sounded like the name is referring to the progress of *execution* of a SQL
command whereas the code in question is simply expecting to finish
*parsing* the SQL command using the next lines.


Ok. I changed it "sql_command_lexing_in_progress".


The attached patch takes into all your comments but:
 - comment about discarded results...
 - the sql_command_in_progress variable name change
 - the longer message on into at the start of a script


The patch seems fine without these, although please consider the concern I
raised with regard to the -r option above.


I have considered it. As the legitimate behavior you suggested can be 
achieved just by putting the into on the next line, ISTM that the current 
proposition gives more control than doing a mandatory cut when into is 
used.


Attached is a new version with the boolean renaming.

The other thing I have considered is whether to implemented a "\gset" 
syntax, as suggested by Pavel and Tom. Bar the aesthetic, the main issue I 
have with it is that it does not work with compound commands, and what I 
want is to get the values out of compound commands... because of my focus 
on latency... so basically "\gset" does not do the job I want... Now I 
recognize that other people would like it, so probably I'll do it anyway 
in another patch.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 285608d..0a474e1 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -809,6 +809,30 @@ pgbench  options  dbname
   
 
   
+   
+
+ \into var1 [var2 ...]
+
+
+
+ 
+  Stores the first fields of the resulting row from the current or preceding
+  SQL command into these variables.
+  The queries must yield exactly one row and the number of provided
+  variables must be less than the total number of columns of the results.
+  This meta command does not end the current SQL command.
+ 
+
+ 
+  Example:
+
+SELECT abalance \into abalance
+  FROM pgbench_accounts WHERE aid=5432;
+
+ 
+
+   
+

 
  \set varname expression
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 56c37d5..8817ac5 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -302,11 +302,14 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
 
 typedef struct
 {
-	char	   *line;			/* text of command line */
+	char	   *line;			/* first line for short display */
+	char	   *lines;			/* full multi-line text of command */
 	int			command_num;	/* unique index of this Command struct */
 	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
 	int			argc;			/* number of command words */
 	char	   *argv[MAX_ARGS]; /* command word list */
+	int			

Re: [HACKERS] PoC: Partial sort

2016-09-13 Thread Alexander Korotkov
On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan  wrote:

> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
>  wrote:
> > Hmm... I'm not completely agree with that. In typical usage partial sort
> > should definitely use quicksort.  However, fallback to other sort
> methods is
> > very useful.  Decision of partial sort usage is made by planner.  But
> > planner makes mistakes.  For example, our HashAggregate is purely
> in-memory.
> > In the case of planner mistake it causes OOM.  I met such situation in
> > production and not once.  This is why I'd like partial sort to have
> graceful
> > degradation for such cases.
>
> I think that this should be moved to the next CF, unless a committer
> wants to pick it up today.
>

Patch was rebased to current master.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


partial-sort-basic-9.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-13 Thread Michael Paquier
On Fri, Sep 9, 2016 at 1:25 PM, Tom Lane  wrote:
> Andres Freund  writes:
>> On 2016-09-08 20:15:46 -0400, Peter Eisentraut wrote:
>>> We don't support build directories with spaces in them, but we support
>>> installation directories with spaces in them.  So I guess that means
>>> your point is valid.
>
>> Even if not necessary in this specific case, I personally think it's
>> just a good to quote halfway sensibly. Especially when it's trivial to
>> do so.
>
> The problem is it's only halfway ... won't paths with quotes in them
> still break it?

Trying to build Postgres from source with top_srcdir containing a
quote already fails at ./configure... So I don't see a new problem
here but...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Kouhei Kaigai
Sorry for my late.

The attached patch fixed the wording problems on SGML part.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Jeevan Chalke [mailto:jeevan.cha...@enterprisedb.com]
> Sent: Tuesday, September 06, 2016 11:22 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Etsuro Fujita
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> 
> Hi,
> 
> Changes look good to me.
> 
> However there are couple of minor issues need to be fixed.
> 
> 1.
> "under" repeated on second line. Please remove.
> +if and when CustomScanState is located under
> +under LimitState; which implies the underlying node is not
> 
> 2.
> Typo: dicsussion => discussion
> Please fix.
> 
> Apart from this I see no issues.
> 
> 
> Thanks
> 
> 
> --
> 
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> 



pgsql-v10-fdw-css-limit-bound.v3.patch
Description: pgsql-v10-fdw-css-limit-bound.v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring of heapam code.

2016-09-13 Thread Michael Paquier
On Mon, Sep 12, 2016 at 7:12 PM, Pavan Deolasee
 wrote:
> I was thinking about locking, bulk reading (like page-mode API) etc. While
> you've added an API for vacuuming, we would probably also need an API to
> collect dead tuples, pruning etc (not sure if that can be combined with
> vacuum). Of course, these are probably very specific to current
> implementation of heap/MVCC and not all storages will need that.

Likely not, but it is likely that people would like to be able to get
that if some custom method needs it. I think that what would be a good
first step here is to brainstorm a potential set of custom AMs for
tables, get the smallest, meaningful, one from the set of ideas
proposed, and define a basic set of relation/storage/table AM routines
that we can come up to support it. And then build up a prototype using
it. We have been talking a lot about refactoring and reordering stuff,
which is something that would be good in the long term to make things
more generic with heap, but getting an idea of "we-want-that" may
prove to help in designing a minimum set if features that we'd like to
have here.

This would likely need anyway to extend CREATE TABLE to be able to
pass a custom AM for a given relation and store in pg_catalog, but
that's a detail in the whole picture...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-13 Thread Craig Ringer
On 13 September 2016 at 14:36, Craig Ringer  wrote:

>
>
> prove_check:
> rm -rf $(CURDIR)/tmp_check/log
> cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(shell pg_config
> --bindir):$$PATH" PGPORT='6$(DEF_PGPORT)'
> top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='pg_regress'
> $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
>
> .PHONY: prove_check

Actually, that should be


prove_check:
rm -rf $(CURDIR)/tmp_check/log
cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(shell $(PG_CONFIG)
--bindir):$$PATH" PGPORT='6$(DEF_PGPORT)'
top_builddir='$(CURDIR)/$(top_builddir)'
PG_REGRESS='$(pgxsdir)/src/test/regress/pg_regress' $(PROVE)
$(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl


for a typical install.

If PGXS defined that as a var that can be invoked with $(PGXS_PROVE)
or something that'd be handy, but can be done later. It's trivial to
do in an extension Makefile so long as the required files actually get
installed.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An extra error for client disconnection on Windows

2016-09-13 Thread Michael Paquier
On Tue, Sep 13, 2016 at 10:42 AM, Kyotaro HORIGUCHI
 wrote:
> If we take a policy to try to imitate the behavior of some
> reference platform (specifically Linux) on other platforms, this
> is required disguising. Another potential policy on this problem
> is "following the platform's behavior". From this viewpoint, this
> message should be shown to users because Windows says
> so. Especially for socket operations, the simultion layer is
> intending the former for non-error behaviors, but I'm not sure
> about the behaviors on errors.

The more you hack windows, the more you'll notice that it is full of
caveats, behavior exceptions, and that it runs in its way as nothing
else in this world... This patch looks like a tempest in a teapot at
the end. Why is it actually a problem to show this message? Just
useless noise? If that's the only reason let's drop the patch and move
on. It seems that the extra information that could be fetched
depending on what caused the connection reset is not worth the risk.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-09-13 Thread Michael Paquier
On Sun, Sep 11, 2016 at 12:03 AM, Robert Haas  wrote:
> It seems entirely unnecessary for the master and the standby to agree
> here.  I think what we need is two GUCs.  One of them, which affects
> only the master, controls whether the validation information is
> including in the WAL, and the other, which affects only the standby,
> affects whether validation is performed when the necessary information
> is present.  Or maybe skip the second one and just decree that
> standbys will always validate if the necessary information is present.
> Using the same GUC on both the master and the standby but making it
> mean different things in each of those places (whether to log the
> validation info in one case, whether to perform validation in the
> other case) is another option that also avoids needing to enforce that
> the setting is the same in both places, but probably an inferior one.

Thinking more about that, there is no actual need to do anything
complicated here. We could just track at the record level if a
consistency check is needs to be done at replay and do it. If nothing
is set, just do nothing. That would allow us to promote this parameter
to SIGHUP. wal_compression does something similar.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-13 Thread Craig Ringer
On 13 September 2016 at 13:27, Craig Ringer  wrote:
> This was wrong for out-of-tree builds, updated.
>
> Still pending fix for PG_REGRESS path when invoked using
> $(prove_check) from PGXS

Looking further at this, I think a pgxs-specific patch to add support
for prove tests and isolation tests would be best, and can be done
separately. Possibly even snuck into a point release, but if not, at
least extension authors can invoke prove in their own Makefile if the
required modules get installed. It just needs an adaptation of the
command used in the $(prove_check) definition.

Extension makefiles run tests by listing the tests in REGRESS .
Something similar would make sense for isolation checks. For prove,
probably just a macro that can be invoked to enable prove tests in
pgxs makefiles.

I suggest that the above patches be applied to 9.6 and v10. Then for
v10 I'll look at enhancing PGXS to make it easier to use isolation
tests and prove tests; extensions that want to use them in 9.6 can
just add something like:


prove_check:
rm -rf $(CURDIR)/tmp_check/log
cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(shell pg_config
--bindir):$$PATH" PGPORT='6$(DEF_PGPORT)'
top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='pg_regress'
$(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl

.PHONY: prove_check



to their Makefile , so it's not necessary to have PGXS support for
this for it to be useful in 9.6.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-09-13 Thread Michael Paquier
On Mon, Sep 12, 2016 at 5:06 AM, Kuntal Ghosh
 wrote:
>>+   void(*rm_checkConsistency) (XLogReaderState *record);
>>All your _checkConsistency functions share the same pattern, in short
>>they all use a for loop for each block, call each time
>>XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
>>You would get a reduction by a couple of hundreds of lines by having a
>>smarter refactoring. And to be honest, if I look at your patch what I
>>think is the correct way of doing things is to add to the rmgr not
>>this check consistency function, but just a pointer to the masking
>>function.
>
> +1. In rmgrlist, I've added a pointer to the masking function for each rmid.
> A common function named checkConsistency calls these masking functions
> based on their rmid and does comparison for each block.

The patch size is down from 79kB to 38kB. That gets better :)

>>> - If WAL consistency check is enabled for a rmgrID, we always include
>>> the backup image in the WAL record.
>>
>>What happens if wal_consistency has different settings on a standby
>>and its master? If for example it is set to 'all' on the standby, and
>>'none' on the master, or vice-versa, how do things react? An update of
>>this parameter should be WAL-logged, no?
>
> If wal_consistency is enabled for a rmid, standby will always check whether
> backup image exists or not i.e. BKPBLOCK_HAS_IMAGE is set or not.
> (I guess Amit and Robert also suggested the same in the thread)
> Basically, BKPBLOCK_HAS_IMAGE is set if a block contains image and
> BKPIMAGE_IS_REQUIRED_FOR_REDO (I've added this one) is set if that backup
> image is required during redo. When we decode a wal record, has_image
> flag of DecodedBkpBlock is set to BKPIMAGE_IS_REQUIRED_FOR_REDO.

Ah I see. But do we actually store the status in the record itself,
then at replay we don't care of the value of wal_consistency at
replay. That's the same concept used by wal_compression. So shouldn't
you have more specific checks related to that in checkConsistency? You
actually don't need to check for anything in xlogreader.c, just check
for the consistency if there is a need to do so, or do nothing.

> For now, I've kept this as a WARNING message to detect all inconsistencies
> at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Or say FATAL. This way the server is taken down.

> Thoughts?

A couple of extra thoughts:
1) The routines of each rmgr are located in a dedicated file, for
example GIN stuff is in ginxlog.c, etc. It seems to me that it would
be better to move each masking function where it should be instead
being centralized. A couple of routines need to be centralized, so I'd
suggest putting them in a new file, like xlogmask.c, xlog because now
this is part of WAL replay completely, including the lsn, the hint
bint and the other common routines.

2) Regarding page comparison:
+/*
+ * Compare the contents of two pages.
+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
+ * it returns the location where the first mismatch has occurred.
+ */
+int
+comparePages(char *page1, char *page2)
We could just use memcpy() here. compareImages was useful to get a
clear image of what the inconsistencies were, but you don't do that
anymore.

3)
+static void checkConsistency(RmgrId rmid, XLogReaderState *record);
The RMGR if is part of the record decoded, so you could just remove
RmgrId from the list of arguments and simplify this interface.

4) If this patch still goes with the possibility to set up a list of
RMGRs, documentation is needed for that. I'd suggest writing first a
patch to explain what are RMGRs for WAL, then apply the WAL
consistency facility on top of it and link wal_consistency to it.

5)
+   has_image = record->blocks[block_id].has_image;
+   record->blocks[block_id].has_image = true;
+   if (!RestoreBlockImage(record, block_id, old_page))
+   elog(ERROR, "failed to restore block image");
+   record->blocks[block_id].has_image = has_image;
Er, what? And BKPIMAGE_IS_REQUIRED_FOR_REDO?

6)
+   /*
+* Remember that, if WAL consistency check is enabled for
the current rmid,
+* we always include backup image with the WAL record.
But, during redo we
+* restore the backup block only if needs_backup is set.
+*/
+   if (needs_backup)
+   bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
This should use wal_consistency[rmid]?

7) This patch has zero documentation. Please add some. Any human being
on this list other than those who worked on the first versions
(Heikki, Simon and I?) is going to have a hard time to review this
patch in details moving on if there is no reference to tell what this
feature does for the user...

This patch is going to the good direction, but I don't think it's far
from being ready for commit yet. So I am going to mark it as returned
with feedback if there