date:20160914

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Ashutosh Bapat

On Wed, Sep 14, 2016 at 5:31 PM, Pavan Deolasee
 wrote:
>
> On Wed, Sep 14, 2016 at 3:46 PM, Pavan Deolasee 
> wrote:
>>
>>
>>
>>  lately I'm using LVM debugger (which probably does not have something
>> equivalent),
>
>
> And I was so clueless about lldb's powerful scripting interface. For
> example, you can write something like this in bms_utils.py:
>
> import lldb
>
> def print_bms_members (bms):
> words = bms.GetChildMemberWithName("words")
> nwords = int(bms.GetChildMemberWithName("nwords").GetValue())
>
> ret = 'nwords = {0} bitmap: '.format(nwords,)
> for i in range(0, nwords):
> ret += hex(int(words.GetChildAtIndex(0, lldb.eNoDynamicValues,
> True).GetValue()))
>
> return ret
>

Thanks a lot for digging into it.

> And then do this while attached to lldb debugger:
>
> Process 99659 stopped
> * thread #1: tid = 0x59ba69, 0x0001090b012f
> postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at bitmapset.c:673,
> queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
> frame #0: 0x0001090b012f
> postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at bitmapset.c:673
>670 int wordnum,
>671 bitnum;
>672
> -> 673 if (x < 0)
>674 elog(ERROR, "negative bitmapset member not allowed");
>675 if (a == NULL)
>676 return bms_make_singleton(x);
> (lldb) script
> Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D.
 from bms_utils import *
 bms = lldb.frame.FindVariable ("a")
 print print_bms_members(bms)
> nwords = 1 bitmap: 0x200
>

I can get that kind of output by command
p *bms
p/x (or p/b) *bms->words@(bms->nwords) in gdb.

But I can certainly extend the script you wrote above to print more
meaningful output similar to outBitmapset(). But then this would be
specific to LLVM. GDB too seems to have a similar interface
https://sourceware.org/gdb/wiki/PythonGdbTutorial, so I can probably
use above script with some modifications with GDB as well. Python
script will be easier to maintain as compared to maintaining a patch
that needs to be applied and compiled.

Said that, I am not sure if every debugger supported on every platform
we support has these features. Or may be developers work on only those
platforms which have such powerful debuggers, so it's ok.

Every debugger usually has much easier way to call a function and
print its output, so having a function like the one I have in the
patch makes things easy for all the debuggers and may be developers
not familiar with python.

>
> The complete API reference is available here
> http://lldb.llvm.org/python_reference/index.html
>
> Looks like an interesting SoC project to write useful lldb/gdb scripts to
> print internal structures for ease of debugging :-)
>

+1, if we can include something like that in the repository so as to
avoid every developer maintaining a script of his/her own.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-14 Thread Amit Kapila

On Thu, Sep 15, 2016 at 4:44 AM, Jeff Janes  wrote:
> On Tue, May 10, 2016 at 5:09 AM, Amit Kapila 
> wrote:
>>
>>
>>
>> Although, I don't think it is a very good idea to take any performance
>> data with WIP patch, still I couldn't resist myself from doing so and below
>> are the performance numbers.  To get the performance data, I have dropped
>> the primary key constraint on pgbench_accounts and created a hash index on
>> aid column as below.
>>
>> alter table pgbench_accounts drop constraint pgbench_accounts_pkey;
>> create index pgbench_accounts_pkey on pgbench_accounts using hash(aid);
>
>
>
> To be rigorously fair, you should probably replace the btree primary key
> with a non-unique btree index and use that in the btree comparison case.  I
> don't know how much difference that would make, probably none at all for a
> read-only case.
>
>>
>>
>>
>> Below data is for read-only pgbench test and is a median of 3 5-min runs.
>> The performance tests are executed on a power-8 m/c.
>
>
> With pgbench -S where everything fits in shared_buffers and the number of
> cores I have at my disposal, I am mostly benchmarking interprocess
> communication between pgbench and the backend.  I am impressed that you can
> detect any difference at all.
>
> For this type of thing, I like to create a server side function for use in
> benchmarking:
>
> create or replace function pgbench_query(scale integer,size integer)
> RETURNS integer AS $$
> DECLARE sum integer default 0;
> amount integer;
> account_id integer;
> BEGIN FOR i IN 1..size LOOP
>account_id=1+floor(random()*scale);
>SELECT abalance into strict amount FROM pgbench_accounts
>   WHERE aid = account_id;
>sum := sum + amount;
> END LOOP;
> return sum;
> END $$ LANGUAGE plpgsql;
>
> And then run using a command like this:
>
> pgbench -f <(echo 'select pgbench_query(40,1000)')  -c$j -j$j -T 300
>
> Where the first argument ('40', here) must be manually set to the same value
> as the scale-factor.
>
> With 8 cores and 8 clients, the values I get are, for btree, hash-head,
> hash-concurrent, hash-concurrent-cache, respectively:
>
> 598.2
> 577.4
> 668.7
> 664.6
>
> (each transaction involves 1000 select statements)
>
> So I do see that the concurrency patch is quite an improvement.  The cache
> patch does not produce a further improvement, which was somewhat surprising
> to me (I thought that that patch would really shine in a read-write
> workload, but I expected at least improvement in read only)
>

To see the benefit from cache meta page patch, you might want to test
with clients more than the number of cores, atleast that is what data
by Mithun [1] indicates or probably in somewhat larger m/c.


[1] - 
https://www.postgresql.org/message-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-14 Thread Amit Kapila

On Thu, Sep 15, 2016 at 4:04 AM, Jeff Janes  wrote:
> On Tue, Sep 13, 2016 at 9:31 AM, Jeff Janes  wrote:
>>
>> ===
>>
>> +Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
>> tuples
>> +that are moved due to split.  The need for cleanup lock to remove dead
>> tuples
>> +is to ensure that scans' returns correct results.  Scan that returns
>> multiple
>> +tuples from the same bucket page always restart the scan from the
>> previous
>> +offset number from which it has returned last tuple.
>>
>> Perhaps it would be better to teach scans to restart anywhere on the page,
>> than to force more cleanup locks to be taken?
>
>
> Commenting on one of my own questions:
>
> This won't work when the vacuum removes the tuple which an existing scan is
> currently examining and thus will be used to re-find it's position when it
> realizes it is not visible and so takes up the scan again.
>
> The index tuples in a page are stored sorted just by hash value, not by the
> combination of (hash value, tid).  If they were sorted by both, we could
> re-find our position even if the tuple had been removed, because we would
> know to start at the slot adjacent to where the missing tuple would be were
> it not removed. But unless we are willing to break pg_upgrade, there is no
> feasible way to change that now.
>

I think it is possible without breaking pg_upgrade, if we match all
items of a page at once (and save them as local copy), rather than
matching item-by-item as we do now.  We are already doing similar for
btree, refer explanation of BTScanPosItem and BTScanPosData in
nbtree.h.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”

2016-09-14 Thread Ashutosh Bapat

On Wed, Sep 14, 2016 at 4:03 PM, valeriof  wrote:
> Hi, I'm kind of new to Postgres and I'm trying to create a custom output
> plugin for logical replication (Postgres is 9.5.4 version and I'm building
> the project from a Windows 8/64 bit machine - same machine where the db is
> installed).
> I started from the code of the sample test_decoding, and I was trying to
> simply rebuild it under a new name and install it into Postgres to see if
> the module works. Once done, I will start modifying the code.
>
> My project is built in Visual Studio 2013 and the only steps I took were to
> copy the built assembly under Postgres lib folder. When I run the command:
>
> postgres=# SELECT * FROM
> pg_create_logical_replication_slot('regression_slot', 'my_decoding');
>
> I get an error saying that "output plugins have to declare the
> _PG_output_plugin_init symbol".
>

The error comes from LoadOutputPlugin(), when the call to
load_external_function() fails. load_external_function() tries to
search for given function in the given file. The file name is same as
plugin name. So, it may be that it doesn't find a file with
my_decoding library in the installation. If test_decoding plugin is
working in your case, please check if you can find my_decoding library
in the same location. If not, that's the problem.

> In my code, I have the function declared as extern:
>
> extern void _PG_init(void);
> extern void _PG_output_plugin_init(OutputPluginCallbacks *cb);
>
> and the body of the function is defined somewhere below it.
>
> The actual code is just a copy of the test_decoding sample:
> https://github.com/postgres/postgres/blob/REL9_5_STABLE/contrib/test_decoding/test_decoding.c
>
> I'm not sure if the deployment process is ok (just copying the dll) or if
> there is some other step to take. Can anyone shed some light?
>

It's hard to tell what's wrong exactly, without seeing the changes you
have made. But, it looks like while copying test_decoding directory,
you have forgot to replace some instance/s of test_decoding with
my_decoding.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make async slave to wait for lsn to be replayed

2016-09-14 Thread Thomas Munro

On Thu, Sep 1, 2016 at 2:16 AM, Ivan Kartyshov
 wrote:
> Hi hackers,
>
> Few days earlier I've finished my work on WAITLSN statement utility, so I’d
> like to share it.
> [...]
> Your feedback is welcome!
>
> [waitlsn_10dev.patch]

Hi Ivan,

Thanks for working on this.  Here are some general thoughts on the
feature, and an initial review.

+1 for this feature.  Explicitly waiting for a given commit to be
applied is one of several approaches to achieve "causal consistency"
for reads on replica nodes, and I think it will be very useful if
combined with a convenient way to get the values to wait for when you
run COMMIT.  This could be used either by applications directly, or by
middleware that somehow keeps track of dependencies between
transactions and inserts waits.

I liked the way Heikki Linnakangas imagined this feature[1]:

  BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;

... or perhaps it could be spelled like this:

  BEGIN [isolation stuff] WAIT FOR COMMIT TOKEN  TIMEOUT ;

That allows waiting only at the start of a transaction, whereas your
idea of making a utility command would allow a single READ COMMITTED
transaction to wait multiple times for transactions it has heard about
through side channels, which may be useful.  Perhaps we could support
the same syntax in a stand alone statement inside a transaction OR as
part of a BEGIN ... statement.  Being able to do it as part of BEGIN
means that you can use this feature for single-snapshot transactions,
ie REPEATABLE READ and SERIALIZABLE (of course you can't use
SERIALIZABLE on hot standbys yet but that'll be fixed one day).
Otherwise you'd be waiting for the LSN in the middle of your
transaction but not be able to see the result because you don't take a
new snapshot.  Or maybe it's enough to use a standalone WAIT ...
statement inside a REPEATABLE READ or SERIALIZABLE transaction as long
as it's the first statement, and should be an error to do so any time
later?

I think working in terms of LSNs or XIDs explicitly is not a good
idea: encouraging clients to think in terms of anything other than
opaque 'commit tokens' seems like a bad idea because it limits future
changes.  For example, there is on-going discussion about introducing
CSNs (commit sequence numbers), and there are some related concepts
lurking in the SSI code; maybe we'd want to use those one day.  Do you
think it would make sense to have a concept of a commit token that is
a non-analysable string as far as clients are concerned, so that
clients are not encouraged to do anything at all with them except use
them in a WAIT FOR COMMIT TOKEN  statement?

INITIAL FEEDBACK ON THE PATCH

I didn't get as far as testing or detailed review because it has some
obvious bugs and compiler warnings which I figured we should talk
about first, and I also have some higher level questions about the
design.

gram.y:12882:15: error: assignment makes pointer from integer without
a cast [-Werror=int-conversion]
  n->delay = $3;

It looks like struct WaitLSNStmt accidentally has 'delay' as a pointer
to int.  Perhaps you want an int?  Maybe it would be useful to include
the units (millisecond, ms) in the name?

waitlsn.c: In function 'WLDisownLatch':
waitlsn.c:82:2: error: suggest parentheses around assignment used as
truth value [-Werror=parentheses]
  if (MyBackendId = state->backend_maxid)
  ^~

Pretty sure you want == here.

waitlsn.c: In function 'WaitLSNUtility':
waitlsn.c:153:17: error: initialization makes integer from pointer
without a cast [-Werror=int-conversion]
  int   tdelay = delay;
 ^

Another place where I think you wanted an int but used a pointer to
int?  To fix that warning you need tdelay = *delay, but I think delay
should really not be taken by pointer at all.

@@ -6922,6 +6923,11 @@ StartupXLOG(void)
+ /*
+ * After update lastReplayedEndRecPtr set Latches in SHMEM array
+ */
+ WaitLSNSetLatch();
+

I think you should try to do this only after commit records are
replayed, not after every record.  Only commit records can make
transactions visible, and the raison d'être for this feature is to let
users wait for transactions they know about to become visible.  You
probably can't do it directly in xact_redo_commit though, because at
that point XLogCtl->lastReplayedEndRecPtr hasn't been updated yet so a
backend that wakes up might not see that it has advanced and go back
to sleep.  It is updated in the StartupXLOG loop after the redo
function runs.  That is the reason why WalRcvForceReply() is called
from there rather than in xact_redo_commit, to implement remote_apply
for replication.  Perhaps you need something similar?

+ tdelay -= (GetCurrentTimestamp() - timer);

You can't do arithmetic with TimestampTz like this.  Depending on
configure option --disable-integer-datetimes (which controls macro
HAVE_INT64_TIMESTAMP), it may be a floating point number of seconds
since the epoch, or an integer number of microseconds since the epoch.
It looks

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Craig Ringer

On 14 September 2016 at 04:56, Petr Jelinek  wrote:

> Not sure what you mean by negotiation. Why would that be needed? You know
> server version when you connect and when you know that you also know what
> capabilities that version of Postgres has. If you send unrecognized option
> you get corresponding error.

Right, because we can rely on the server version = the logical
replication version now.

All good.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v2] Add overflow checks to money type input function

2016-09-14 Thread Peter Eisentraut

On 9/9/16 3:19 AM, Fabien COELHO wrote:
> 
>> I have updated the patch with additional tests and comments per your
>> review.  Final(?) version attached.
> 
> Applied on head, make check ok. No more comments on the code which does 
> what it says. I'm fine with this patch.

Pushed, thanks.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

2016-09-14 Thread Michael Paquier

On Thu, Sep 15, 2016 at 9:44 AM, Peter Eisentraut
 wrote:
> On 9/12/16 11:16 PM, Michael Paquier wrote:
>>> I don't think tar file output in pg_basebackup needs to be fsynced.
>>> > It's just a backup file like what pg_dump produces, and we don't fsync
>>> > that either.  The point of this change is to leave a data directory in
>>> > a synced state equivalent to what initdb leaves behind.
>> Here I don't agree. The point of the patch is to make sure that what
>> gets generated by pg_basebackup is consistent on disk, for all the
>> formats. It seems weird to me that we could say that the plain format
>> makes things consistent while the tar format may cause some files to
>> be lost in case of power outage. The docs make it clear I think:
>> +By default, pg_basebackup will wait for all files
>> +to be written safely to disk.
>> But perhaps this should directly mention that this is done for all the 
>> formats?
>
> That doesn't really explain why we fsync.

Data durability, particularly on ext4 as it has been discussed a
couple of months back [1]. In case if a crash it could be perfectly
possible to lose files, hence we need to be sure that the files
themselves are flushed, as well as their parent directory to prevent
any problems. I think that we should protect users' backups as much as
we can, in the range we can.

> If we think that all
> "important" files should be fsynced, why aren't we doing it in pg_dump?

pg_dump should do it where it can, see thread [2]. I am tackling
problems once at a time, and that's as well a reason why I would like
us to have a common set of routines in src/common or src/fe_utils to
help improve this handling.

> Or psql, or server-side copy?  Similarly, there is no straightforward
> mechanism by which you can unpack the tar file generated by
> pg_basebackup and get the unpacked directory fsynced properly.  (I
> suppose initdb -S should be recommended.)

Yes, those are cases that you cannot cover. Imagine for example
pg_basebackup's tar or pg_dump data sent to stdout. There is nothing
we can actually do for all those cases. However what we can do it
giving a set of options making possible to get consistent backups for
users.

[1] Silent data loss on ext4:
https://www.postgresql.org/message-id/56583bdd.9060...@2ndquadrant.com

[2] Data durability:
https://www.postgresql.org/message-id/20160327233033.gd20...@awork2.anarazel.de
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Amit Langote

On 2016/09/15 0:04, Tom Lane wrote:
> Alvaro Herrera  writes:
>> I don't understand.  Why don't you just use "call pprint(the bitmapset)"
>> in the debugger?
> 
> Bitmapsets aren't Nodes, so pprint doesn't work directly on them.
> I usually find that I can pprint some node containing the value(s)
> I'm interested in, but maybe that isn't working for Ashutosh's
> particular case.

There are many loose (ie, not inside any Node) Relids variables within the
optimizer code.  Perhaps Ashutosh ended up needing to look at those a lot.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

2016-09-14 Thread Peter Eisentraut

On 9/12/16 11:16 PM, Michael Paquier wrote:
>> I don't think tar file output in pg_basebackup needs to be fsynced.
>> > It's just a backup file like what pg_dump produces, and we don't fsync
>> > that either.  The point of this change is to leave a data directory in
>> > a synced state equivalent to what initdb leaves behind.
> Here I don't agree. The point of the patch is to make sure that what
> gets generated by pg_basebackup is consistent on disk, for all the
> formats. It seems weird to me that we could say that the plain format
> makes things consistent while the tar format may cause some files to
> be lost in case of power outage. The docs make it clear I think:
> +By default, pg_basebackup will wait for all files
> +to be written safely to disk.
> But perhaps this should directly mention that this is done for all the 
> formats?

That doesn't really explain why we fsync.  If we think that all
"important" files should be fsynced, why aren't we doing it in pg_dump?
Or psql, or server-side copy?  Similarly, there is no straightforward
mechanism by which you can unpack the tar file generated by
pg_basebackup and get the unpacked directory fsynced properly.  (I
suppose initdb -S should be recommended.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup wish list

2016-09-14 Thread Michael Paquier

On Thu, Sep 15, 2016 at 9:26 AM, Peter Eisentraut
 wrote:
> On 9/13/16 7:24 PM, Michael Paquier wrote:
>> PostgresNode.pm had better use the new --noclean option in its calls,
>> the new default is not useful for debugging.
>
> We don't do it for initdb either.  Is that a problem?

Right. In case of failure it may prove to be useful for debugging if
we'd use it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup wish list

2016-09-14 Thread Peter Eisentraut

On 9/13/16 7:24 PM, Michael Paquier wrote:
> PostgresNode.pm had better use the new --noclean option in its calls,
> the new default is not useful for debugging.

We don't do it for initdb either.  Is that a problem?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OpenSSL 1.1 breaks configure and more

2016-09-14 Thread Andreas Karlsson


On 09/15/2016 02:03 AM, Andreas Karlsson wrote:

On 09/12/2016 06:51 PM, Heikki Linnakangas wrote:

Changes since last version:

* Added more error checks to the my_BIO_s_socket() function. Check for
NULL result from malloc(). Check the return code of BIO_meth_set_*()
functions; looking at OpenSSL sources, they always succeed, but all the
test/example programs that come with OpenSSL do check them.

* Use BIO_get_new_index() to get the index number for the wrapper BIO.

* Also call BIO_meth_set_puts(). It was missing in previous patch
versions.

* Fixed src/test/ssl test suite to also work with OpenSSL 1.1.0.

* Changed all references (in existing code) to SSLEAY_VERSION_NUMBER
into OPENSSL_VERSION_NUMBER, for consistency.

* Squashed all into one patch.

I intend to apply this to all supported branches, so please have a look!
This is now against REL9_6_STABLE, but there should be little difference
between branches in the code that this touches.


This patch no longer seems to apply to head after the removed support of
0.9.6. Is that intentional?


Never mind. I just failed at reading.

Now for a review:

It looks generally good but I think I saw one error. In 
fe-secure-openssl.c your code still calls SSL_library_init() in OpenSSL 
1.1. I think it should be enough to just call 
OPENSSL_init_ssl(OPENSSL_INIT_LOAD_CONFIG, NULL) like you do in be-secure.


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-14 Thread Kouhei Kaigai

> On Wed, Sep 14, 2016 at 12:13 AM, Kouhei Kaigai  wrote:
> > It looks to me pg_buffercache tries to allocate more than 1GB using
> > palloc(), when shared_buffers is more than 256GB.
> >
> > # show shared_buffers ;
> >  shared_buffers
> > 
> >  280GB
> > (1 row)
> >
> > # SELECT buffers, d.datname, coalesce(c.relname, '???')
> > FROM (SELECT count(*) buffers, reldatabase, relfilenode
> > FROM pg_buffercache group by reldatabase, relfilenode) b
> >LEFT JOIN pg_database d ON d.oid = b.reldatabase
> >LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
> >  WHERE datname = current_database())
> >AND b.relfilenode = pg_relation_filenode(c.oid)
> >ORDER BY buffers desc;
> > ERROR:  invalid memory alloc request size 1174405120
> >
> > It is a situation to use MemoryContextAllocHuge(), instead of palloc().
> > Also, it may need a back patching?
> 
> I guess so.  Although it's not very desirable for it to use that much
> memory, I suppose if you have a terabyte of shared_buffers you
> probably have 4GB of memory on top of that to show what they contain.
>
Exactly. I found this problem when a people asked me why shared_buffers=280GB
is slower than shared_buffers=128MB to scan 350GB table.
As I expected, most of shared buffers are not in-use and it also reduced
amount of free memory; usable for page-cache.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OpenSSL 1.1 breaks configure and more

2016-09-14 Thread Andreas Karlsson


On 09/12/2016 06:51 PM, Heikki Linnakangas wrote:

Changes since last version:

* Added more error checks to the my_BIO_s_socket() function. Check for
NULL result from malloc(). Check the return code of BIO_meth_set_*()
functions; looking at OpenSSL sources, they always succeed, but all the
test/example programs that come with OpenSSL do check them.

* Use BIO_get_new_index() to get the index number for the wrapper BIO.

* Also call BIO_meth_set_puts(). It was missing in previous patch versions.

* Fixed src/test/ssl test suite to also work with OpenSSL 1.1.0.

* Changed all references (in existing code) to SSLEAY_VERSION_NUMBER
into OPENSSL_VERSION_NUMBER, for consistency.

* Squashed all into one patch.

I intend to apply this to all supported branches, so please have a look!
This is now against REL9_6_STABLE, but there should be little difference
between branches in the code that this touches.


This patch no longer seems to apply to head after the removed support of 
0.9.6. Is that intentional?


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking timezone abbreviation removals in the IANA tz database

2016-09-14 Thread Tom Lane

Robert Haas  writes:
> I suggest that if you (or someone else who understands it, if there is
> any such person) were willing to either improve the existing
> documentation or maybe even write a whole new chapter on how we do
> time zone handling, it would be most welcome.

Well, I'll put that on my to-do list, but it's a very long list.
And TBH, I thought the SGML docs around this were adequate already,
so I might not be in the best position to rewrite them to satisfy you.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)

2016-09-14 Thread Tom Lane

Andres Freund  writes:
> On 2016-09-12 19:35:22 -0400, Tom Lane wrote:
>> Anyway I'll draft a prototype and then we can compare.

> Ok, cool.

Here's a draft patch that is just meant to investigate what the planner
changes might look like if we do it in the pipelined-result way.
Accordingly, I didn't touch the executor, but just had it emit regular
Result nodes for SRF-execution steps.  However, the SRFs are all
guaranteed to appear at top level of their respective tlists, so that
those Results could be replaced with something that works like
nodeFunctionscan.

A difficulty with this restriction is that if you have a query like
"select f1, generate_series(1,2) / 10 from tab" then you end up with both
a SRF-executing Result and a separate scalar-projection Result above it,
because the division-by-ten has to happen in a separate plan level.
The planner's notions about the cost of Result make it think that this is
quite expensive --- mainly because the upper Result will be iterated once
per SRF output row, so that you get hit with cpu_tuple_cost per output row.
And that in turn leads it to change plans in one or two cases in the
regression tests.  Maybe that's fine.  I'm worried though that it's right
that this will be unduly expensive.  So I'm kind of tempted to define the
SRF-executing node as acting more like, say, Agg or WindowFunc, in that
it has a list of SRFs to execute and then it has the ability to project a
scalar tlist on top of those results.  That would likely save some cycles
at execution, and it would also eliminate a few of the planner warts seen
below, like the rule about not pushing a new scalar tlist down onto a
SRF-executing Result.  I'd have to rewrite split_pathtarget_at_srfs(),
because it'd be implementing quite different rules about how to refactor
targetlists, but that's not a big problem.

On the whole I'm pretty pleased with this approach, at least from the
point of view of the planner.  The net addition of planner code is
smaller than what you had, and though I'm no doubt biased, I think this
version is much cleaner.  Also, though this patch doesn't address exactly
how we might do it, it's fairly clear that it'd be possible to allow
FDWs and CustomScans to implement SRF execution, eg pushing a SRF down to
a foreign server, in a reasonably straightforward extension of the
existing upper-pathification hooks.  If we go with the lateral function
RTE approach, that's going to be somewhere between hard and impossible.

So I think we should continue investigating this way of doing things.
I'll try to take a look at the executor end of it tomorrow.  However
I'm leaving Friday for a week's vacation, and may not have anything to
show before that.

regards, tom lane

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 7e092d7..9052273 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*** _outProjectionPath(StringInfo str, const
*** 1817,1822 
--- 1817,1823 

  	WRITE_NODE_FIELD(subpath);
  	WRITE_BOOL_FIELD(dummypp);
+ 	WRITE_BOOL_FIELD(srfpp);
  }

  static void
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 47158f6..7c59c3d 100644
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*** create_projection_plan(PlannerInfo *root
*** 1421,1428 
  	Plan	   *subplan;
  	List	   *tlist;

! 	/* Since we intend to project, we don't need to constrain child tlist */
! 	subplan = create_plan_recurse(root, best_path->subpath, 0);

  	tlist = build_path_tlist(root, _path->path);

--- 1421,1441 
  	Plan	   *subplan;
  	List	   *tlist;

! 	/*
! 	 * XXX Possibly-temporary hack: if the subpath is a dummy ResultPath,
! 	 * don't bother with it, just make a Result with no input.  This avoids an
! 	 * extra Result plan node when doing "SELECT srf()".  Depending on what we
! 	 * decide about the desired plan structure for SRF-expanding nodes, this
! 	 * optimization might have to go away, and in any case it'll probably look
! 	 * a good bit different.
! 	 */
! 	if (IsA(best_path->subpath, ResultPath) &&
! 		((ResultPath *) best_path->subpath)->path.pathtarget->exprs == NIL &&
! 		((ResultPath *) best_path->subpath)->quals == NIL)
! 		subplan = NULL;
! 	else
! 		/* Since we intend to project, we don't need to constrain child tlist */
! 		subplan = create_plan_recurse(root, best_path->subpath, 0);

  	tlist = build_path_tlist(root, _path->path);

*** create_projection_plan(PlannerInfo *root
*** 1441,1448 
  	 * creation, but that would add expense to creating Paths we might end up
  	 * not using.)
  	 */
! 	if (is_projection_capable_path(best_path->subpath) ||
! 		tlist_same_exprs(tlist, subplan->targetlist))
  	{
  		/* Don't need a separate Result, just assign tlist to subplan */
  		plan = subplan;
--- 1454,1462 
  	 * creation, but that would

Re: [HACKERS] Tracking timezone abbreviation removals in the IANA tz database

2016-09-14 Thread Robert Haas

On Fri, Sep 2, 2016 at 8:55 AM, Tom Lane  wrote:
> I had thought that this wouldn't really affect us because PG's
> interpretations of zone abbreviations are driven by the info in
> timezone/tznames/ rather than the IANA files in timezone/data/.
> I forgot however that the "dynamic abbreviation" code relies on
> being able to find matching abbreviations in the IANA data.
> For example, timezone/tznames/Default lists
>
> NOVST   Asia/Novosibirsk  # Novosibirsk Summer Time (obsolete)
>
> but that zone abbreviation now fails entirely:
>
> # select '2016-09-02 08:00:00 NOVST'::timestamptz;
> ERROR:  time zone abbreviation "novst" is not used in time zone 
> "Asia/Novosibirsk"
>
> (This is also the cause of bug #14307.)
>
> So the question is what to do about it.
>
> An easy answer is to just delete such entries from the timezone/tznames/
> files altogether.  If we believe the IANA crew's assumption that nobody
> uses these abbreviations in the real world, then that seems like it
> removes useless maintenance overhead for little cost.  I'm a little
> worried though that somebody might have followed IANA's lead and actually
> started using these abbreviations, in which case we'd get complaints.
>
> Another possibility is to keep these entries but get rid of the dynamic
> zone aliases, reducing them to plain numeric UTC offsets.  Probably the
> value to use would be whatever was shown as the most recent active value
> in the IANA list ... although that's open to interpretation.  For
> instance, according to the above we'd likely define NOVT as UTC+7,
> but then logically NOVST ought to be UTC+8, which doesn't match up
> with the fact that when it actually last appeared in the IANA data
> it meant UTC+7.  So that sounds like it'd be a bit of a mess involving
> some judgment calls.
>
> In the end, part of the reason we've got these files is so that users
> can make their own decisions about abbreviation meanings.  So the
> ultimate answer to any complaints is going to be "if you think NOVT
> means thus-and-such then put in an entry that says so".
>
> So I'm leaning to the just-remove-it answer for any deleted abbreviation
> that relies on a dynamic definition.  Names that never had more than one
> UTC offset can remain in the tznames list.
>
> Comments?

I tried to understand these emails today and mostly failed. I think
this stuff needs to be better documented.
src/timezones/tznames/README pretty much presumes that you more or
less know what's going on:

# This directory contains files with timezone sets for PostgreSQL.  The problem
# is that time zone abbreviations are not unique throughout the world and you
# might find out that a time zone abbreviation in the `Default' set collides
# with the one you wanted to use.  This can be fixed by selecting a timezone
# set that defines the abbreviation the way you want it.  There might already
# be a file here that serves your needs.  If not, you can create your own.

Of course, it doesn't bother to define the term 'timezone set', a
phrase that is not used in any other file in the source tree.
Eventually, after some poking around, I figured out that this is
described in the documentation under "B.3. Date/Time Configuration
Files", but I didn't notice that when I searched the documentation for
it.  I only figured it out after I actually looked at the contents of
"src/timezone/tznames/Default" and saw that it recommended looking at
the "Date/Time Support", which I had done, but which I gave up on
doing pretty quickly because there was no section whose title included
the phrase "time zone".

Once I read that section, I ran across this:

# The offset is the equivalent offset in seconds from UTC, positive being east
# from Greenwich and negative being west. For example, -18000 would be five
# hours west of Greenwich, or North American east coast standard time. D
# indicates that the zone name represents local daylight-savings time rather
# than standard time. Alternatively, a time_zone_name can be given, in which
# case that time zone definition is consulted, and the abbreviation's meaning
# in that zone is used.

I dimly understand that "the abbreviation's meaning in that zone" is
referring to an EST vs. EDT type of distinction, but a novice reader
of this documentation will probably not understand that from this
language.

Similarly, your email mentions "dynamic zone aliases" but:

[rhaas pgsql]$ git grep -i 'dynamic zone alias'
[rhaas pgsql]$ git grep -i 'dynamic zone'
[rhaas pgsql]$ git grep -i 'zone alias'
[rhaas pgsql]$

I'm guessing that's referring to the same thing as the above
documentation paragraph, wherein a "time_zone_name" is given, but it
doesn't really refer to the time zone name, but rather the DST or
non-DST configuration of that time zone.  But I'm not sure I'm right
about that.

Another thing that's not very well-documented is the use of our own
tzdata vs. the system's tzdata.  There is a paragraph on it in the
installation

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Petr Jelinek


On 14/09/16 21:53, Andres Freund wrote:

Hi,

On 2016-09-14 21:17:42 +0200, Petr Jelinek wrote:

+/*
+ * Gather Relations based o provided by RangeVar list.
+ * The gathered tables are locked in access share lock mode.
+ */


Why access share? Shouldn't we make this ShareUpdateExclusive or
similar, to prevent schema changes?



Hm, I thought AccessShare would be enough to prevent schema changes that
matter to us (which is basically just drop afaik).


Doesn't e.g. dropping an index matter as well?



Drop of primary key matters I guess.




+   if (strcmp($1, "replicate_insert") == 0)
+   $$ = 
makeDefElem("replicate_insert",
+   
 (Node *)makeInteger(TRUE), @1);
+   else if (strcmp($1, 
"noreplicate_insert") == 0)
+   $$ = 
makeDefElem("replicate_insert",
+   
 (Node *)makeInteger(FALSE), @1);
+   else if (strcmp($1, "replicate_update") 
== 0)
+   $$ = 
makeDefElem("replicate_update",
+   
 (Node *)makeInteger(TRUE), @1);
+   else if (strcmp($1, 
"noreplicate_update") == 0)
+   $$ = 
makeDefElem("replicate_update",
+   
 (Node *)makeInteger(FALSE), @1);
+   else if (strcmp($1, "replicate_delete") 
== 0)
+   $$ = 
makeDefElem("replicate_delete",
+   
 (Node *)makeInteger(TRUE), @1);
+   else if (strcmp($1, 
"noreplicate_delete") == 0)
+   $$ = 
makeDefElem("replicate_delete",
+   
 (Node *)makeInteger(FALSE), @1);
+   else
+   ereport(ERROR,
+   
(errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("unrecognized 
publication option \"%s\"", $1),
+
parser_errposition(@1)));
+   }
+   ;


I'm kind of inclined to do this checking at execution (or transform)
time instead.  That allows extension to add options, and handle them in
utility hooks.



Thant's interesting point, I prefer the parsing to be done in gram.y, but it
might be worth moving it for extensibility. Although there are so far other
barriers for that.


Citus uses the lack of such check for COPY to implement copy over it's
distributed tables for example. So there's some benefit.



Yeah I am not saying that I am fundamentally against it, I am just 
saying it won't help all that much probably.






+   check_subscription_permissions();
+
+   rel = heap_open(SubscriptionRelationId, RowExclusiveLock);
+
+   /* Check if name is used */
+   subid = GetSysCacheOid2(SUBSCRIPTIONNAME, MyDatabaseId,
+   
CStringGetDatum(stmt->subname));
+   if (OidIsValid(subid))
+   {
+   ereport(ERROR,
+   (errcode(ERRCODE_DUPLICATE_OBJECT),
+errmsg("subscription \"%s\" already exists",
+   stmt->subname)));
+   }
+
+   /* Parse and check options. */
+   parse_subscription_options(stmt->options, _given, ,
+  , 
);
+
+   /* TODO: improve error messages here. */
+   if (conninfo == NULL)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("connection not specified")));


Probably also makes sense to parse the conninfo here to verify it looks
saen.  Although that's fairly annoying to do, because the relevant code
is libpq :(



Well the connection is eventually used (in later patches) so maybe that's
not problem.


Well, it's nicer if it's immediately parsed, before doing complex and
expensive stuff, especially if that happens outside of the transaction.



Maybe, it's not too hard to add another function to libpqwalreceiver I 
guess.







diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 65230e2..f3d54c8 100644
--- a/src/backend/nodes/copyfuncs.c
+++

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Petr Jelinek


On 14/09/16 18:21, Andres Freund wrote:

(continuing, uh, a bit happier)

On 2016-09-09 00:59:26 +0200, Petr Jelinek wrote:


+/*
+ * Relcache invalidation callback for our relation map cache.
+ */
+static void
+logicalreprelmap_invalidate_cb(Datum arg, Oid reloid)
+{
+   LogicalRepRelMapEntry  *entry;
+
+   /* Just to be sure. */
+   if (LogicalRepRelMap == NULL)
+   return;
+
+   if (reloid != InvalidOid)
+   {
+   HASH_SEQ_STATUS status;
+
+   hash_seq_init(, LogicalRepRelMap);
+
+   /* TODO, use inverse lookup hastable? */


*hashtable


+   while ((entry = (LogicalRepRelMapEntry *) 
hash_seq_search()) != NULL)
+   {
+   if (entry->reloid == reloid)
+   entry->reloid = InvalidOid;


can't we break here?



Probably.




+/*
+ * Initialize the relation map cache.
+ */
+static void
+remoterelmap_init(void)
+{
+   HASHCTL ctl;
+
+   /* Make sure we've initialized CacheMemoryContext. */
+   if (CacheMemoryContext == NULL)
+   CreateCacheMemoryContext();
+
+   /* Initialize the hash table. */
+   MemSet(, 0, sizeof(ctl));
+   ctl.keysize = sizeof(uint32);
+   ctl.entrysize = sizeof(LogicalRepRelMapEntry);
+   ctl.hcxt = CacheMemoryContext;


Wonder if this (and similar code earlier) should try to do everything in
a sub-context of CacheMemoryContext instead. That'd make some issues
easier to track down.


Sure. don't see why not.




+/*
+ * Open the local relation associated with the remote one.
+ */
+static LogicalRepRelMapEntry *
+logicalreprel_open(uint32 remoteid, LOCKMODE lockmode)
+{
+   LogicalRepRelMapEntry  *entry;
+   boolfound;
+
+   if (LogicalRepRelMap == NULL)
+   remoterelmap_init();
+
+   /* Search for existing entry. */
+   entry = hash_search(LogicalRepRelMap, (void *) ,
+   HASH_FIND, );
+
+   if (!found)
+   elog(FATAL, "cache lookup failed for remote relation %u",
+remoteid);
+
+   /* Need to update the local cache? */
+   if (!OidIsValid(entry->reloid))
+   {
+   Oid nspid;
+   Oid relid;
+   int i;
+   TupleDesc   desc;
+   LogicalRepRelation *remoterel;
+
+   remoterel = >remoterel;
+
+   nspid = LookupExplicitNamespace(remoterel->nspname, false);
+   if (!OidIsValid(nspid))
+   ereport(FATAL,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("the logical replication target %s 
not found",
+   
quote_qualified_identifier(remoterel->nspname,

   
remoterel->relname;

+   relid = get_relname_relid(remoterel->relname, nspid);
+   if (!OidIsValid(relid))
+   ereport(FATAL,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("the logical replication target %s 
not found",
+   
quote_qualified_identifier(remoterel->nspname,
+  
remoterel->relname;
+
+   entry->rel = heap_open(relid, lockmode);


This seems rather racy. I think this really instead needs something akin
to RangeVarGetRelidExtended().


Maybe, I am not sure if it really matters here given how it's used, but 
I can change that.





+/*
+ * Executor state preparation for evaluation of constraint expressions,
+ * indexes and triggers.
+ *
+ * This is based on similar code in copy.c
+ */
+static EState *
+create_estate_for_relation(LogicalRepRelMapEntry *rel)
+{
+   EState *estate;
+   ResultRelInfo *resultRelInfo;
+   RangeTblEntry *rte;
+
+   estate = CreateExecutorState();
+
+   rte = makeNode(RangeTblEntry);
+   rte->rtekind = RTE_RELATION;
+   rte->relid = RelationGetRelid(rel->rel);
+   rte->relkind = rel->rel->rd_rel->relkind;
+   estate->es_range_table = list_make1(rte);
+
+   resultRelInfo = makeNode(ResultRelInfo);
+   InitResultRelInfo(resultRelInfo, rel->rel, 1, 0);
+
+   estate->es_result_relations = resultRelInfo;
+   estate->es_num_result_relations = 1;
+   estate->es_result_relation_info = resultRelInfo;
+
+   /* Triggers might need a slot */
+   if (resultRelInfo->ri_TrigDesc)
+   estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);
+
+   return estate;
+}


Ugh, we do this

Re: [HACKERS] Hash Indexes

2016-09-14 Thread Jeff Janes

On Tue, May 10, 2016 at 5:09 AM, Amit Kapila 
wrote:

>
>
> Although, I don't think it is a very good idea to take any performance
> data with WIP patch, still I couldn't resist myself from doing so and below
> are the performance numbers.  To get the performance data, I have dropped
> the primary key constraint on pgbench_accounts and created a hash index on
> aid column as below.
>
> alter table pgbench_accounts drop constraint pgbench_accounts_pkey;
> create index pgbench_accounts_pkey on pgbench_accounts using hash(aid);
>

To be rigorously fair, you should probably replace the btree primary key
with a non-unique btree index and use that in the btree comparison case.  I
don't know how much difference that would make, probably none at all for a
read-only case.

>
>
> Below data is for read-only pgbench test and is a median of 3 5-min runs.
> The performance tests are executed on a power-8 m/c.
>

With pgbench -S where everything fits in shared_buffers and the number of
cores I have at my disposal, I am mostly benchmarking interprocess
communication between pgbench and the backend.  I am impressed that you can
detect any difference at all.

For this type of thing, I like to create a server side function for use in
benchmarking:

create or replace function pgbench_query(scale integer,size integer)
RETURNS integer AS $$
DECLARE sum integer default 0;
amount integer;
account_id integer;
BEGIN FOR i IN 1..size LOOP
   account_id=1+floor(random()*scale);
   SELECT abalance into strict amount FROM pgbench_accounts
  WHERE aid = account_id;
   sum := sum + amount;
END LOOP;
return sum;
END $$ LANGUAGE plpgsql;

And then run using a command like this:

pgbench -f <(echo 'select pgbench_query(40,1000)')  -c$j -j$j -T 300

Where the first argument ('40', here) must be manually set to the same
value as the scale-factor.

With 8 cores and 8 clients, the values I get are, for btree, hash-head,
hash-concurrent, hash-concurrent-cache, respectively:

598.2
577.4
668.7
664.6

(each transaction involves 1000 select statements)

So I do see that the concurrency patch is quite an improvement.  The cache
patch does not produce a further improvement, which was somewhat surprising
to me (I thought that that patch would really shine in a read-write
workload, but I expected at least improvement in read only)

I've run this was 128MB shared_buffers and scale factor 40.  Not everything
fits in shared_buffers, but quite easily fits in RAM, and there is no
meaningful IO caused by the benchmark.

Cheers,

Jeff

Re: [HACKERS] kqueue

2016-09-14 Thread Thomas Munro

On Thu, Sep 15, 2016 at 10:48 AM, Keith Fiske  wrote:
> Thomas Munro brought up in #postgresql on freenode needing someone to test a
> patch on a larger FreeBSD server. I've got a pretty decent machine (3.1Ghz
> Quad Core Xeon E3-1220V3, 16GB ECC RAM, ZFS mirror on WD Red HDD) so offered
> to give it a try.
>
> Bench setup was:
> pgbench -i -s 100 -d postgres
>
> I ran this against 96rc1 instead of HEAD like most of the others in this
> thread seem to have done. Not sure if that makes a difference and can re-run
> if needed.
> With higher concurrency, this seems to cause decreased performance. You can
> tell which of the runs is the kqueue patch by looking at the path to
> pgbench.

Thanks Keith.  So to summarise, you saw no change with 1 client, but
with 4 clients you saw a significant drop in performance (~93K TPS ->
~80K TPS), and a smaller drop for 64 clients (~72 TPS -> ~68K TPS).
These results seem to be a nail in the coffin for this patch for now.

Thanks to everyone who tested.  I might be back in a later commitfest
if I can figure out why and how to fix it.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] kqueue

2016-09-14 Thread Keith Fiske

On Wed, Sep 14, 2016 at 9:09 AM, Matteo Beccati  wrote:

> Hi,
>
> On 14/09/2016 00:06, Tom Lane wrote:
>
>> I'm inclined to think the kqueue patch is worth applying just on the
>> grounds that it makes things better on OS X and doesn't seem to hurt
>> on FreeBSD.  Whether anyone would ever get to the point of seeing
>> intra-kernel contention on these platforms is hard to predict, but
>> we'd be ahead of the curve if so.
>>
>> It would be good for someone else to reproduce my results though.
>> For one thing, 5%-ish is not that far above the noise level; maybe
>> what I'm measuring here is just good luck from relocation of critical
>> loops into more cache-line-friendly locations.
>>
>
> FWIW, I've tested HEAD vs patch on a 2-cpu low end NetBSD 7.0 i386 machine.
>
> HEAD: 1890/1935/1889 tps
> kqueue: 1905/1957/1932 tps
>
> no weird surprises, and basically no differences either.
>
>
> Cheers
> --
> Matteo Beccati
>
> Development & Consulting - http://www.beccati.com/



Thomas Munro brought up in #postgresql on freenode needing someone to test
a patch on a larger FreeBSD server. I've got a pretty decent machine
(3.1Ghz Quad Core Xeon E3-1220V3, 16GB ECC RAM, ZFS mirror on WD Red HDD)
so offered to give it a try.

Bench setup was:
pgbench -i -s 100 -d postgres

I ran this against 96rc1 instead of HEAD like most of the others in this
thread seem to have done. Not sure if that makes a difference and can
re-run if needed.
With higher concurrency, this seems to cause decreased performance. You can
tell which of the runs is the kqueue patch by looking at the path to
pgbench.

SINGLE PROCESS
[keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1_kqueue/bin/pgbench -T
60 -j 1 -c 1 -M prepared -S postgres -p
5496
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: prepared
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1547387
latency average: 0.039 ms
tps = 25789.750236 (including connections establishing)
tps = 25791.018293 (excluding connections establishing)
[keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1_kqueue/bin/pgbench -T
60 -j 1 -c 1 -M prepared -S postgres -p 5496
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: prepared
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1549442
latency average: 0.039 ms
tps = 25823.981255 (including connections establishing)
tps = 25825.189871 (excluding connections establishing)
[keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1_kqueue/bin/pgbench -T
60 -j 1 -c 1 -M prepared -S postgres -p 5496
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: prepared
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1547936
latency average: 0.039 ms
tps = 25798.572583 (including connections establishing)
tps = 25799.917170 (excluding connections establishing)


[keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1/bin/pgbench -T 60 -j 1
-c 1 -M prepared -S postgres -p
5496
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: prepared
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1520722
latency average: 0.039 ms
tps = 25343.122533 (including connections establishing)
tps = 25344.357116 (excluding connections establishing)
[keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1/bin/pgbench -T 60 -j 1
-c 1 -M prepared -S postgres -p 5496~
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: prepared
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1549282
latency average: 0.039 ms
tps = 25821.107595 (including connections establishing)
tps = 25822.407310 (excluding connections establishing)
[keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1/bin/pgbench -T 60 -j 1
-c 1 -M prepared -S postgres -p 5496~
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: prepared
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1541907
latency average: 0.039 ms
tps = 25698.025983 (including connections establishing)
tps = 25699.270663 (excluding connections establishing)


FOUR
/home/keith/pgsql96rc1_kqueue/bin/pgbench -T 60 -j 4 -c 4 -M prepared -S
postgres -p 5496
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: prepared
number of clients: 4
number of threads: 4
duration: 60 s
number of transactions actually processed: 4282185
latency average: 0.056 ms
tps = 71369.146931 (including connections establishing)
tps = 71372.646243 (excluding connections establishing)
[keith@corpus ~/postgresql-9.6rc1_kqueue]$
/home/keith/pgsql96rc1_kqueue/bin/pgbench -T 60 -j 4 -c 4 -M prepared -S
postgres -p 5496
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: prepared
number of clients: 4
number of threads: 4
duration: 60

Re: [HACKERS] PATCH: Avoid use of attribute when building with old Sun compiler versions

2016-09-14 Thread Robert Haas

On Fri, Sep 2, 2016 at 11:55 AM, Andy Grundman  wrote:
> The use of __attribute__ is currently enabled by checking for
> __SUNPRO_C. However, this compiler only added support for
> __attribute__ keywords in version 5.10 [1]. This patch adds a version
> check to only enable this for supported versions [2].
>
> I have tested this with Sun C 5.8 on a Solaris 9 SPARC system,
> building 9.5.4 (+OpenSSL 1.1.0 patches).

Wow, so we'd need this if building on a compiler released before June
2007.  Is this the only change that's needed to build with the older
compiler?  Can you set up a buildfarm member for this platform?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-14 Thread Jeff Janes

On Tue, Sep 13, 2016 at 9:31 AM, Jeff Janes  wrote:

> ===
>
> +Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
> tuples
> +that are moved due to split.  The need for cleanup lock to remove dead
> tuples
> +is to ensure that scans' returns correct results.  Scan that returns
> multiple
> +tuples from the same bucket page always restart the scan from the previous
> +offset number from which it has returned last tuple.
>
> Perhaps it would be better to teach scans to restart anywhere on the page,
> than to force more cleanup locks to be taken?
>

Commenting on one of my own questions:

This won't work when the vacuum removes the tuple which an existing scan is
currently examining and thus will be used to re-find it's position when it
realizes it is not visible and so takes up the scan again.

The index tuples in a page are stored sorted just by hash value, not by the
combination of (hash value, tid).  If they were sorted by both, we could
re-find our position even if the tuple had been removed, because we would
know to start at the slot adjacent to where the missing tuple would be were
it not removed. But unless we are willing to break pg_upgrade, there is no
feasible way to change that now.

Cheers,

Jeff

Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()

2016-09-14 Thread Gavin Flower


On 15/09/16 03:45, Robert Haas wrote:

On Wed, Sep 14, 2016 at 5:22 AM, Thomas Berger  wrote:

Today, i found the time to read all the mails in this thread, and i think i 
have to explain, why we decided to open a bug for this behavior.

Pn Tuesday, 23. August 2016, 13:30:29 Robert Haas wrote:

J. Random User: I'm having a problem!
Mailing List: Gee, how big are your tables?
J. Random User: Here's some pg_size_pretty output.
Mailing List: Gosh, we don't know what that means, what do you have
this obscure GUC set to?
J. Random User: Maybe I'll just give up on SQL and use MongoDB.

In fact, we had just the other way around. One of our most critical databases 
had some extreme bloat.
Some of our internal customers was very confused, about the sizes reported by 
the database.
This confusion has led to wrong decisions. (And a long discussion about the 
choice of DBMS btw)

I think there is a point missing in this whole discussion, or i just didn't see 
it:

Yeah, the behavior of "kB" is defined in the "postgresql.conf" documentation.
But no _user_ reads this. There is no link or hint in the documentation of 
"pg_size_pretty()" [1].

Interesting.  I think that our documentation should only describe the
way we use unit suffixes in one central place, but other places (like
pg_size_pretty) could link to that central place.

I don't believe that there is any general unanimity among users or
developers about the question of which suffixes devote units
denominated in units of 2^10 bytes vs. 10^3 bytes.  About once a year,
somebody makes an argument that we're doing it wrong, but the evidence
that I've seen is very mixed.  So when people say that there is only
one right way to do this and we are not in compliance with that one
right way, I guess I just don't believe it.  Not everybody likes the
way we do it, but I am fairly sure that if we change it, we'll make
some currently-unhappy people happy and some currently-happy people
unhappy.  And the people who don't care but wanted to preserve
backward compatibility will all be in the latter camp.

However, that is not to say that the documentation couldn't be better.

Well, I started programming 1968, and was taught that 1 kilobyte was 
1024 (2^10).


I object to Johny-come-latelies who try and insist it is 1000.

As regards 'kB' vs 'KB', I'm not too worried either way - I think 
consistency is more important



Cheers,
Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Choosing parallel_degree

2016-09-14 Thread Robert Haas

On Wed, Sep 14, 2016 at 3:54 PM, Simon Riggs  wrote:
>> I do think this comment is confusing:
>>
>> + *This value is not locked by the transaction, so this value may
>> + *be changed while a SELECT that has used these values for planning
>> + *is still executing.
>>
>> I don't know what it means for "this value" to be locked, or not
>> locked, by the transaction.  Basically, I have no idea what this is
>> trying to explain.
>
> You're quoting that without context from the line above, which is
> "get_tablespace_io_concurrency"

Sure, but it doesn't make any sense to talk about
tablespace_io_concurrency being locked by a transaction.  At least not
that I can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What is the posix_memalign() equivalent for the PostgreSQL?

2016-09-14 Thread Robert Haas

On Fri, Sep 2, 2016 at 1:17 PM, Andres Freund  wrote:
> On 2016-09-02 13:05:37 -0400, Tom Lane wrote:
>> Anderson Carniel  writes:
>> > If not, according to your experience, is there a
>> > significance difference between the performance of the O_DIRECT or not?
>>
>> AFAIK, nobody's really bothered to measure whether that would be useful
>> for Postgres.  The results would probably be quite platform-specific
>> anyway.
>
> I've played with patches to make postgres use O_DIRECT. On linux, it's
> rather beneficial for some workloads (fits into memory), but it also
> works really badly for some others, because our IO code isn't
> intelligent enough.  We pretty much rely on write() being nearly
> instantaneous when done by normal backends (during buffer replacement),
> we rely on readahead, we rely on the kernel to stopgap some bad
> replacement decisions we're making.

So, suppose we changed the world so that backends don't write dirty
buffers, or at least not normally.  If they need to perform a buffer
eviction, they first check the freelist, then run the clock sweep.
The clock sweep puts clean buffers on the freelist and dirty buffers
on a to-be-cleaned list.  A background process writes buffers on the
to-be-cleaned list and then adds them to the freelist afterward if the
usage count hasn't been bumped meanwhile.  As in Amit's bgreclaimer
patch, we have a target size for the freelist, with a low watermark
and a high watermark.  When we drop below the low watermark, the
background processes run the clock sweep and write from the
to-be-cleaned list to try to populate it; when we surge above the high
watermark, they go back to sleep.

Further, suppose we also create a prefetch system, maybe based on the
synchronous scan machinery.  It preemptively pulls data into
shared_buffers if an ongoing scan will need it soon.  Or maybe don't
base it on the synchronous scan machinery, but instead just have a
queue that lets backends throw prefetch requests over the wall; when
the queue wraps, old requests are discarded.  A background process -
or perhaps one per tablespace or something like that - pull the data
in.

Neither of those things seems that hard.  And if we could do those
things and make them work, then maybe we could offer direct I/O as an
option.  We'd still lose heavily in the case where our buffer eviction
decisions are poor, but that'd probably spur some improvement in that
area, which IMHO would be a good thing.

I personally think direct I/O would be a really good thing, not least
because O_ATOMIC is designed to allow MySQL to avoid double buffering,
their alternative to full page writes.  But we can't use it because it
requires O_DIRECT.  The savings are probably massive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-14 Thread Jeff Janes

On Tue, Aug 30, 2016 at 10:06 AM, Alvaro Herrera 
wrote:

> Jesper Pedersen wrote:
> > Hi,
> >
> > Attached is a patch that adds support for hash indexes in pageinspect.
> >
> > The functionality (hash_metap, hash_page_stats and hash_page_items)
> follows
> > the B-tree functions.
>
> I suggest that pageinspect functions are more convenient to use via the
> get_raw_page interface, that is, instead of reading the buffer
> themselves, the buffer is handed over from elsewhere and they receive it
> as bytea.  This enables other use cases such as grabbing a page from one
> server and examining it in another one.
>

I've never needed to do that, but it does sound like it might be useful.
But it is also annoying to have to do that when you want to examine a
current server.  Could we use overloading, so that it can be used in either
way?

For hash_page_items, the 'data' is always a 32 bit integer, isn't it?
(unlike other pageinspect features, where the data could be any
user-defined data type).  If so, I'd rather see it in plain hex (without
the spaces, without the trailing zeros)

+   /* page type (flags) */
+   if (opaque->hasho_flag & LH_UNUSED_PAGE)
+   stat->type = 'u';

This can never be true, because LH_UNUSED_PAGE is zero.  You should make
this be the fall-through case, and have LH_META_PAGE be explicitly tested
for.

Cheers,

Jeff

Re: [HACKERS] New SQL counter statistics view (pg_stat_sql)

2016-09-14 Thread Robert Haas

On Fri, Sep 2, 2016 at 2:33 AM, Haribabu Kommi  wrote:
> On Wed, Aug 31, 2016 at 3:19 AM, Alvaro Herrera 
> wrote:
>> Haribabu Kommi wrote:
>>
>>> Apart from the above, here are the following list of command tags that
>>> are generated in the code, I took only the first word of the command tag
>>> just to see how many categories present. The number indicates the
>>> subset of operations or number of types it is used. Like create table,
>>> create function and etc.
>>
>> Sounds about right.  I suppose all those cases that you aggregated here
>> would expand to full tags in the actual code.  I furthermore suppose
>> that some of these could be ignored, such as the transaction ones and
>> things like load, lock, move, fetch, discard, deallocate (maybe lump
>> them all together under "other", or some other rough categorization, as
>> Tom suggests).
>
> Following is the pg_stat_sql view with the SQL categories that I considered
> that are important. Rest of the them will be shown under others category.
>
> postgres=# \d pg_stat_sql
>  View "pg_catalog.pg_stat_sql"
>  Column  |   Type   | Modifiers
> -+--+---
>  inserts | bigint   |
>  deletes | bigint   |
>  updates | bigint   |
>  selects | bigint   |
>  declare_cursors | bigint   |
>  closes  | bigint   |
>  creates | bigint   |
>  drops   | bigint   |
>  alters  | bigint   |
>  imports | bigint   |
>  truncates   | bigint   |
>  copies  | bigint   |
>  grants  | bigint   |
>  revokes | bigint   |
>  clusters| bigint   |
>  vacuums | bigint   |
>  analyzes| bigint   |
>  refreshs| bigint   |
>  locks   | bigint   |
>  checkpoints | bigint   |
>  reindexes   | bigint   |
>  deallocates | bigint   |
>  others  | bigint   |
>  stats_reset | timestamp with time zone |
>
>
> If any additions/deletions, I can accommodate them.

I think it is not a good idea to make the command names used here the
plural forms of the command tags.  Instead of "inserts", "updates",
"imports", etc. just use "INSERT", "UPDATE", "IMPORT".  That's simpler
and less error prone - e.g. you won't end up with things like
"refreshs", which is not a word.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Andres Freund

Hi,

On 2016-09-14 21:17:42 +0200, Petr Jelinek wrote:
> > > +/*
> > > + * Gather Relations based o provided by RangeVar list.
> > > + * The gathered tables are locked in access share lock mode.
> > > + */
> > 
> > Why access share? Shouldn't we make this ShareUpdateExclusive or
> > similar, to prevent schema changes?
> > 
> 
> Hm, I thought AccessShare would be enough to prevent schema changes that
> matter to us (which is basically just drop afaik).

Doesn't e.g. dropping an index matter as well?


> > > + if (strcmp($1, "replicate_insert") == 0)
> > > + $$ = 
> > > makeDefElem("replicate_insert",
> > > + 
> > >  (Node *)makeInteger(TRUE), @1);
> > > + else if (strcmp($1, 
> > > "noreplicate_insert") == 0)
> > > + $$ = 
> > > makeDefElem("replicate_insert",
> > > + 
> > >  (Node *)makeInteger(FALSE), @1);
> > > + else if (strcmp($1, "replicate_update") 
> > > == 0)
> > > + $$ = 
> > > makeDefElem("replicate_update",
> > > + 
> > >  (Node *)makeInteger(TRUE), @1);
> > > + else if (strcmp($1, 
> > > "noreplicate_update") == 0)
> > > + $$ = 
> > > makeDefElem("replicate_update",
> > > + 
> > >  (Node *)makeInteger(FALSE), @1);
> > > + else if (strcmp($1, "replicate_delete") 
> > > == 0)
> > > + $$ = 
> > > makeDefElem("replicate_delete",
> > > + 
> > >  (Node *)makeInteger(TRUE), @1);
> > > + else if (strcmp($1, 
> > > "noreplicate_delete") == 0)
> > > + $$ = 
> > > makeDefElem("replicate_delete",
> > > + 
> > >  (Node *)makeInteger(FALSE), @1);
> > > + else
> > > + ereport(ERROR,
> > > + 
> > > (errcode(ERRCODE_SYNTAX_ERROR),
> > > +  
> > > errmsg("unrecognized publication option \"%s\"", $1),
> > > +  
> > > parser_errposition(@1)));
> > > + }
> > > + ;
> > 
> > I'm kind of inclined to do this checking at execution (or transform)
> > time instead.  That allows extension to add options, and handle them in
> > utility hooks.
> > 
> 
> Thant's interesting point, I prefer the parsing to be done in gram.y, but it
> might be worth moving it for extensibility. Although there are so far other
> barriers for that.

Citus uses the lack of such check for COPY to implement copy over it's
distributed tables for example. So there's some benefit.



> > > + check_subscription_permissions();
> > > +
> > > + rel = heap_open(SubscriptionRelationId, RowExclusiveLock);
> > > +
> > > + /* Check if name is used */
> > > + subid = GetSysCacheOid2(SUBSCRIPTIONNAME, MyDatabaseId,
> > > + 
> > > CStringGetDatum(stmt->subname));
> > > + if (OidIsValid(subid))
> > > + {
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_DUPLICATE_OBJECT),
> > > +  errmsg("subscription \"%s\" already exists",
> > > + stmt->subname)));
> > > + }
> > > +
> > > + /* Parse and check options. */
> > > + parse_subscription_options(stmt->options, _given, ,
> > > +, 
> > > );
> > > +
> > > + /* TODO: improve error messages here. */
> > > + if (conninfo == NULL)
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_SYNTAX_ERROR),
> > > +  errmsg("connection not specified")));
> > 
> > Probably also makes sense to parse the conninfo here to verify it looks
> > saen.  Although that's fairly annoying to do, because the relevant code
> > is libpq :(
> > 
> 
> Well the connection is eventually used (in later patches) so maybe that's
> not problem.

Well, it's nicer if it's immediately parsed, before doing complex and
expensive stuff, especially if that happens outside of the transaction.


> > 
> > > diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
> > > index 65230e2..f3d54c8 100644
> > > --- a/src/backend/nodes/copyfuncs.c
> > > +++ b/src/backend/nodes/copyfuncs.c
> > 
> > I

Re: [HACKERS] Choosing parallel_degree

2016-09-14 Thread Simon Riggs

On 14 September 2016 at 14:48, Robert Haas  wrote:
> On Thu, Sep 1, 2016 at 9:39 AM, Simon Riggs  wrote:
 > Can I change this to a lower setting? I would have done this before
 > applying
 > the patch, but you beat me to it.

 I don't have a problem with reducing the lock level there, if we're
 convinced that it's safe.
>>>
>>>
>>> I'll run up a patch, with appropriate comments.
>>
>> Attached
>
> This should really be posted on a new thread, since it changes a bunch
> of reloptions, not only parallel_workers.  I can't immediately think
> of a reason why the changes wouldn't be safe, but I've failed to fully
> apprehend all of the possible dangers multiple times previously, so we
> should try to give everyone who might have ideas about this topic a
> chance to chime in with anything we might be missing.

OK. Will post on new thread.

> I do think this comment is confusing:
>
> + *This value is not locked by the transaction, so this value may
> + *be changed while a SELECT that has used these values for planning
> + *is still executing.
>
> I don't know what it means for "this value" to be locked, or not
> locked, by the transaction.  Basically, I have no idea what this is
> trying to explain.

You're quoting that without context from the line above, which is
"get_tablespace_io_concurrency"

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Choosing parallel_degree

2016-09-14 Thread Robert Haas

On Thu, Sep 1, 2016 at 9:39 AM, Simon Riggs  wrote:
>>> > Can I change this to a lower setting? I would have done this before
>>> > applying
>>> > the patch, but you beat me to it.
>>>
>>> I don't have a problem with reducing the lock level there, if we're
>>> convinced that it's safe.
>>
>>
>> I'll run up a patch, with appropriate comments.
>
> Attached

This should really be posted on a new thread, since it changes a bunch
of reloptions, not only parallel_workers.  I can't immediately think
of a reason why the changes wouldn't be safe, but I've failed to fully
apprehend all of the possible dangers multiple times previously, so we
should try to give everyone who might have ideas about this topic a
chance to chime in with anything we might be missing.

I do think this comment is confusing:

+ *This value is not locked by the transaction, so this value may
+ *be changed while a SELECT that has used these values for planning
+ *is still executing.

I don't know what it means for "this value" to be locked, or not
locked, by the transaction.  Basically, I have no idea what this is
trying to explain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Comment on GatherPath.single_copy

2016-09-14 Thread Robert Haas

On Thu, Sep 1, 2016 at 3:15 AM, Kyotaro HORIGUCHI
 wrote:
> At Wed, 31 Aug 2016 07:26:22 -0400, Tom Lane  wrote in 
> <5934.1472642...@sss.pgh.pa.us>
>> Robert Haas  writes:
>> > On Tue, Aug 30, 2016 at 6:38 PM, Tom Lane  wrote:
>> >> Robert, could you fix the documentation for that field so it's
>> >> intelligible?
>>
>> > Uh, maybe.  The trick, as you've already noted, is finding something
>> > better.  Maybe this:
>>
>> > -boolsingle_copy;/* path must not be executed >1x */
>> > +boolsingle_copy;/* don't execute path in multiple 
>> > processes */
>>
>> OK by me.
>>
>>   regards, tom lane
>
> Me too, thanks.

OK, changed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Petr Jelinek


On 14/09/16 20:50, Andres Freund wrote:

On 2016-09-14 13:20:02 -0500, Peter Eisentraut wrote:

On 9/14/16 11:21 AM, Andres Freund wrote:

+   ExecInsert(NULL, /* mtstate is only used for onconflict handling which 
we don't support atm */

+  remoteslot,
+  remoteslot,
+  NIL,
+  ONCONFLICT_NONE,
+  estate,
+  false);

I have *severe* doubts about just using the (newly) exposed functions
1:1 here.


It is a valid concern, but what is the alternative?  ExecInsert() and
the others appear to do exactly the right things that are required.


They're actually a lot more heavyweight than what's required. If you
e.g. do a large COPY on the source side, we create a single executor
state (if at all), and then insert the rows using lower level
routines. And that's *vastly* faster, than going through all the setup
costs here for each row.



Are your concerns mainly philosophical about calling into internal
executor code, or do you have technical concerns that this will not do
the right thing in some cases?


Well, not about it being wrong in the sene of returning wrong results,
but wrong in the sense of not even remotely being able to keep up in
common cases.



I'd say in common case they will. I don't plan to use these forever btw, 
but it's simplest to just use them in v1 IMHO instead of trying to 
reinvent new versions of these that perform better but also behave 
correctly (in terms of triggers and stuff for example).


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Petr Jelinek


On 14/09/16 00:48, Andres Freund wrote:


First read through the current version. Hence no real architectural
comments.


Hi,

Thanks for looking!



On 2016-09-09 00:59:26 +0200, Petr Jelinek wrote:


diff --git a/src/backend/commands/publicationcmds.c 
b/src/backend/commands/publicationcmds.c
new file mode 100644
index 000..e0c719d
--- /dev/null
+++ b/src/backend/commands/publicationcmds.c
@@ -0,0 +1,761 @@
+/*-
+ *
+ * publicationcmds.c
+ * publication manipulation
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * publicationcmds.c



Not that I'm a fan of this line in the first place, but usually it does
include the path.



Yes, I don't bother with it in WIP version though, because this way I 
won't forget to change it when it's getting close to ready if there were 
renames.



+static void
+check_replication_permissions(void)
+{
+   if (!superuser() && !has_rolreplication(GetUserId()))
+   ereport(ERROR,
+   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+(errmsg("must be superuser or replication role to 
manipulate publications";
+}


Do we want to require owner privileges for replication roles? I'd say
no, but want to raise the question.



No, we might want to invent some publish role for which we will so that 
we can do logical replication with higher granularity but for 
replication role it does not make sense. And I think the higher 
granularity ACLs is something for followup patches.





+ObjectAddress
+CreatePublication(CreatePublicationStmt *stmt)
+{
+   Relationrel;
+   ObjectAddress myself;
+   Oid puboid;
+   boolnulls[Natts_pg_publication];
+   Datum   values[Natts_pg_publication];
+   HeapTuple   tup;
+   boolreplicate_insert_given;
+   boolreplicate_update_given;
+   boolreplicate_delete_given;
+   boolreplicate_insert;
+   boolreplicate_update;
+   boolreplicate_delete;
+
+   check_replication_permissions();
+
+   rel = heap_open(PublicationRelationId, RowExclusiveLock);
+
+   /* Check if name is used */
+   puboid = GetSysCacheOid1(PUBLICATIONNAME, 
CStringGetDatum(stmt->pubname));
+   if (OidIsValid(puboid))
+   {
+   ereport(ERROR,
+   (errcode(ERRCODE_DUPLICATE_OBJECT),
+errmsg("publication \"%s\" already exists",
+   stmt->pubname)));
+   }
+
+   /* Form a tuple. */
+   memset(values, 0, sizeof(values));
+   memset(nulls, false, sizeof(nulls));
+
+   values[Anum_pg_publication_pubname - 1] =
+   DirectFunctionCall1(namein, CStringGetDatum(stmt->pubname));
+
+   parse_publication_options(stmt->options,
+ _insert_given, 
_insert,
+ _update_given, 
_update,
+ _delete_given, 
_delete);
+
+   values[Anum_pg_publication_puballtables - 1] =
+   BoolGetDatum(stmt->for_all_tables);
+   values[Anum_pg_publication_pubreplins - 1] =
+   BoolGetDatum(replicate_insert);
+   values[Anum_pg_publication_pubreplupd - 1] =
+   BoolGetDatum(replicate_update);
+   values[Anum_pg_publication_pubrepldel - 1] =
+   BoolGetDatum(replicate_delete);
+
+   tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
+
+   /* Insert tuple into catalog. */
+   puboid = simple_heap_insert(rel, tup);
+   CatalogUpdateIndexes(rel, tup);
+   heap_freetuple(tup);
+
+   ObjectAddressSet(myself, PublicationRelationId, puboid);
+
+   /* Make the changes visible. */
+   CommandCounterIncrement();
+
+   if (stmt->tables)
+   {
+   List   *rels;
+
+   Assert(list_length(stmt->tables) > 0);
+
+   rels = GatherTableList(stmt->tables);
+   PublicationAddTables(puboid, rels, true, NULL);
+   CloseTables(rels);
+   }
+   else if (stmt->for_all_tables || stmt->schema)
+   {
+   List   *rels;
+
+   rels = GatherTables(stmt->schema);
+   PublicationAddTables(puboid, rels, true, NULL);
+   CloseTables(rels);
+   }


Isn't this (and ALTER) racy? What happens if tables are concurrently
created? This session wouldn't necessarily see the tables, and other
sessions won't see for_all_tables/schema.   Evaluating
for_all_tables/all_in_schema when the publication is used, would solve
that problem.


Well, yes it is. It's technically not problem for all_in_schema as 
that's just

Re: [HACKERS] Hash Indexes

2016-09-14 Thread Jesper Pedersen


Hi,

On 09/14/2016 07:24 AM, Amit Kapila wrote:

On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen
 wrote:

On 09/13/2016 07:26 AM, Amit Kapila wrote:


Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].



I have been testing patch in various scenarios, and it has a positive
performance impact in some cases.

This is especially seen in cases where the values of the indexed column are
unique - SELECTs can see a 40-60% benefit over a similar query using b-tree.



Here, I think it is better if we have the data comparing the situation
of hash index with respect to HEAD as well.  What I mean to say is
that you are claiming that after the hash index improvements SELECT
workload is 40-60% better, but where do we stand as of HEAD?



The tests I have done are with a copy of a production database using the 
same queries sent with a b-tree index for the primary key, and the same 
with a hash index. Those are seeing a speed-up of the mentioned 40-60% 
in execution time - some involve JOINs.


Largest of those tables is 390Mb with a CHAR() based primary key.


UPDATE also sees an improvement.



Can you explain this more?  Is it more compare to HEAD or more as
compare to Btree?  Isn't this contradictory to what the test in below
mail shows?



Same thing here - where the fields involving the hash index aren't updated.


In cases where the indexed column value isn't unique, it takes a long time
to build the index due to the overflow page creation.

Also in cases where the index column is updated with a high number of
clients, ala

-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;

-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;

w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for
hash, and 10062 tps for b-tree).



Thanks for doing the tests.  Have you applied both concurrent index
and cache the meta page patch for these tests?  So from above tests,
we can say that after these set of patches read-only workloads will be
significantly improved even better than btree in quite-a-few useful
cases.


Agreed.


 However when the indexed column is updated, there is still a
large gap as compare to btree (what about the case when the indexed
column is not updated in read-write transaction as in our pgbench
read-write transactions, by any chance did you ran any such test?).


I have done a run to look at the concurrency / TPS aspect of the 
implementation - to try something different than Mark's work on testing 
the pgbench setup.


With definitions as above, with SELECT as

-- select.sql --
\set id random(1,10)
BEGIN;
SELECT * FROM test WHERE id = :id;
COMMIT;

and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed 
w/o one.


[1] [2] [3] is new_hash - old_hash is the existing hash implementation 
on master. btree is master too.


Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal. 
Clients ran with -M prepared.


[1] 
https://www.postgresql.org/message-id/caa4ek1+erbp+7mdkkahjzwq_dtdkocbpt7lswfwcqvuhbxz...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1wp5vylvk...@mail.gmail.com
[3] 
https://www.postgresql.org/message-id/caa4ek1juyr_ab7bxfnsg5+jqhiwgklkgacfk9bfd4mlffk6...@mail.gmail.com


Don't know if you find this useful due to the small number of rows, but 
let me know if there are other tests I can run, f.ex. bump the number of 
rows.



I
think we need to focus on improving cases where index columns are
updated, but it is better to do that work as a separate patch.



Ok.

Best regards,
 Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Andres Freund

On 2016-09-14 13:20:02 -0500, Peter Eisentraut wrote:
> On 9/14/16 11:21 AM, Andres Freund wrote:
> >> +  ExecInsert(NULL, /* mtstate is only used for onconflict handling which 
> >> we don't support atm */
> >> > +   remoteslot,
> >> > +   remoteslot,
> >> > +   NIL,
> >> > +   ONCONFLICT_NONE,
> >> > +   estate,
> >> > +   false);
> > I have *severe* doubts about just using the (newly) exposed functions
> > 1:1 here.
> 
> It is a valid concern, but what is the alternative?  ExecInsert() and
> the others appear to do exactly the right things that are required.

They're actually a lot more heavyweight than what's required. If you
e.g. do a large COPY on the source side, we create a single executor
state (if at all), and then insert the rows using lower level
routines. And that's *vastly* faster, than going through all the setup
costs here for each row.

> Are your concerns mainly philosophical about calling into internal
> executor code, or do you have technical concerns that this will not do
> the right thing in some cases?

Well, not about it being wrong in the sene of returning wrong results,
but wrong in the sense of not even remotely being able to keep up in
common cases.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL consistency check facility

2016-09-14 Thread Robert Haas

On Tue, Sep 13, 2016 at 9:04 PM, Michael Paquier
 wrote:
> It seems to me that you need to think about the way to document things
> properly first, with for example:
> - Have a first documentation patch that explains what is a resource
> manager for WAL, and what are the types available with a nice table.
> - Add in your patch documentation to explain what are the benefits of
> using this facility, the main purpose is testing, but there are also
> mention upthread about users that would like to get that into
> production, assuming that the overhead is minimal.

So, I don't think that this patch should be required to document all
of the currently-undocumented stuff that somebody might want to know
that it is related to this patch.  It should be enough to documented
the patch itself.   One paragraph in config.sgml in the usual format
should be fine.  Maybe two paragraphs.  We do need to list the
resource managers, but that can just be something like this:

The default value of for this setting is off.  To check
all records written to the write-ahead log, set this parameter to
all.  To check only same records, specify a
comma-separated list of resource managers.  The resource managers
which are currently supported are heap, btree,
hash, BLAH, and BLAH.

If somebody wants to write some user-facing documentation of the
write-ahead log format, great.  That could certainly be very helpful
for people who are running pg_xlogdump.  But I don't think that stuff
goes in this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Peter Eisentraut

On 9/14/16 11:21 AM, Andres Freund wrote:
>> +ExecInsert(NULL, /* mtstate is only used for onconflict handling which 
>> we don't support atm */
>> > + remoteslot,
>> > + remoteslot,
>> > + NIL,
>> > + ONCONFLICT_NONE,
>> > + estate,
>> > + false);
> I have *severe* doubts about just using the (newly) exposed functions
> 1:1 here.

It is a valid concern, but what is the alternative?  ExecInsert() and
the others appear to do exactly the right things that are required.

Are your concerns mainly philosophical about calling into internal
executor code, or do you have technical concerns that this will not do
the right thing in some cases?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL consistency check facility

2016-09-14 Thread Kuntal Ghosh

On Thu, Sep 8, 2016 at 1:20 PM, Michael Paquier
 wrote:
>
>> 2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
>> REVMAP page. This happens only for two cases. I'm not sure what the
>> reason can be.
>
> Hm? This smells like a block reference bug. What are the cases you are
> referring to?
>
Following is the only case where the backup page stored in the wal
record and the current page after redo are not consistent.

test:BRIN using gmake-check

Master
-
STATEMENT:  VACUUM brintest;
LOG:  INSERT @ 0/59E1E0F8:  - BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1

Standby
--
LOG:  REDO @ 0/59E1B500; LSN 0/59E1E0F8: prev 0/59E17578; xid 0; len
14; blkref #0: rel 1663/16384/30556, blk 12; blkref #1: rel
1663/16384/30556, blk 1; blkref #2: rel 1663/16384/30556, blk 2 -
BRIN/UPDATE+INIT: heapBlk 100 pagesPerRange 1 old offnum 11, new
offnum 1

WARNING:  Inconsistent page (at byte 26) found, rel 1663/16384/30556,
forknum 0, blkno 1
CONTEXT:  xlog redo at 0/59E1B500 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1

thoughts?
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Robert Haas

On Wed, Sep 14, 2016 at 1:23 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> Actually, I think that probably *is* worthwhile, specifically because
>> it might let us avoid multiple index scans in cases where we currently
>> require them.  Right now, our default maintenance_work_mem value is
>> 64MB, which is enough to hold a little over ten million tuples.  It's
>> also large enough to hold a bitmap for a 14GB table.  So right now if
>> you deleted, say, 100 tuples per page you would end up with an index
>> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
>> the bitmap representation for such cases would require only one index
>> vacuum cycle for every 14GB, more than an order of magnitude
>> improvement!
>
> Yeah, this sounds worthwhile.  If we switch to the more compact
> in-memory representation close to the point where we figure the TID
> array is not going to fit in m_w_m, then we're saving some number of
> additional index scans, and I'm pretty sure that the time to transform
> from array to bitmap is going to be more than paid back by the I/O
> savings.

Yes, that seems pretty clear.  The indexes can be arbitrarily large
and there can be arbitrarily many of them, so we could be save a LOT
of I/O.

> One thing not quite clear to me is how do we create the bitmap
> representation starting from the array representation in midflight
> without using twice as much memory transiently.  Are we going to write
> the array to a temp file, free the array memory, then fill the bitmap by
> reading the array from disk?

I was just thinking about this exact problem while I was out to
lunch.[1]  I wonder if we could do something like this:

1. Allocate an array large enough for one pointer per gigabyte of the
underlying relation.

2. Allocate 64MB, or the remaining amount of maintenance_work_mem if
it's less, to store TIDs.

3. At the beginning of each 1GB chunk, add a pointer to the first free
byte in the slab allocated in step 2 to the array allocated in step 1.
Write a header word that identifies this as a TID list (rather than a
bitmap) and leave space for a TID count; then, write the TIDs
afterwards.  Continue doing this until one of the following things
happens: (a) we reach the end of the 1GB chunk - if that happens,
restart step 3 for the next chunk; (b) we fill the chunk - see step 4,
or (c) we write more TIDs for the chunk than the space being used for
TIDs now exceeds the space needed for a bitmap - see step 5.

4. When we fill up one of the slabs allocated in step 2, allocate a
new one and move the tuples for the current 1GB chunk to the beginning
of the new slab using memmove().  This is wasteful of both CPU time
and memory, but I think it's not that bad.  The maximum possible waste
is less than 10%, and many allocators have more overhead than that.
We could reduce the waste by using, say, 256MB chunks rather than 1GB
chunks.   If no new slab can be allocated because maintenance_work_mem
is completely exhausted (or the remaining space isn't enough for the
TIDs that would need to be moved immediately), then stop and do an
index vacuum cycle.

5. When we write a large enough number of TIDs for 1GB chunk that the
bitmap would be smaller, check whether sufficient maintenance_work_mem
remains to allocate a bitmap for 1GB chunk (~4.5MB).  If not, never
mind; continue with step 3 as if the bitmap representation did not
exist.  If so, allocate space for a bitmap, move all of the TIDs for
the current chunk into it, and update the array allocated in step 1 to
point to it.  Then, finish scanning the current 1GB chunk, updating
that bitmap rather than inserting TIDs into the slab.  Rewind our
pointer into the slab to where it was at the beginning of the current
1GB chunk, so that the memory we consumed for TIDs can be reused now
that those TIDs have been transferred to a bitmap.  If, earlier in the
current 1GB chunk, we did a memmove-to-next-slab operation as
described in step 4, this "rewind" might move our pointer back into
the previous slab, in which case we can free the now-empty slab.  (The
next 1GB segment might have few enough TIDs that they will fit into
the leftover space in the previous slab.)

With this algorithm, we never exceed maintenance_work_mem, not even
transiently.  When memory is no longer sufficient to convert to the
bitmap representation without bursting above maintenance_work_mem, we
simply don't perform the conversion.  Also, we do very little memory
copying.  An alternative I considered was to do a separate allocation
for each 1GB chunk rather than carving the dead-tuple space out of
slabs.  But the problem with that is that you'll have to start those
out small (in case you don't find many dead tuples) and then grow
them, which means reallocating, which is bad both because it can burst
above maintenance_work_mem while the repalloc is in process and also
because you have to keep copying the data from the old chunk to the
new, bigger chunk.

Re: [HACKERS] GiST: interpretation of NaN from penalty function

2016-09-14 Thread Tom Lane

Andrew Borodin  writes:
> Currently GiST treats NaN penalty as zero penalty, in terms of
> generalized tree this means "perfect fit". I think that this situation
> should be considered "worst fit" instead.

On what basis?  It seems hard to me to make any principled argument here.
Certainly, "NaN means infinity", as your patch proposes, has no more basis
to it than "NaN means zero".  If the penalty function doesn't like that
interpretation, it shouldn't return NaN.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Tom Lane

Pavan Deolasee  writes:
> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera 
> wrote:
>> One thing not quite clear to me is how do we create the bitmap
>> representation starting from the array representation in midflight
>> without using twice as much memory transiently.  Are we going to write
>> the array to a temp file, free the array memory, then fill the bitmap by
>> reading the array from disk?

> We could do that.

People who are vacuuming because they are out of disk space will be very
very unhappy with that solution.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-14 Thread Heikki Linnakangas

Addressed all your comments one way or another, new patch attached. 
Comments on some specific points below:


On 09/12/2016 01:13 AM, Peter Geoghegan wrote:

Other things I noticed:

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme. The
preloading will now more or less make that the normal case. Any
fragmentation will now be essentially in memory, not on disk, which is
a big win.


That's not true, the "buffers" in batch memory are not accessed 
sequentially. When we pull the next tuple from a tape, we store it in 
the next free buffer. Usually, that buffer was used to hold the previous 
tuple that was returned from gettuple(), and was just added to the free 
list.


It's still quite cache-friendly, though, because we only need a small 
number of slots (one for each tape).



* i think you should move "bool   *mergeactive; /* active input run
source? */" within Tuplesortstate to be next to the other batch memory
stuff. No point in having separate merge and batch "sections" there
anymore.


Hmm. I think I prefer to keep the memory management stuff in a separate 
section. While it's true that it's currently only used during merging, 
it's not hard to imagine using it when building the initial runs, for 
example. Except for replacement selection, the pattern for building the 
runs is: add a bunch of tuples, sort, flush them out. It would be 
straightforward to use one large chunk of memory to hold all the tuples. 
I'm not going to do that now, but I think keeping the memory management 
stuff separate from merge-related state makes sense.



* I think that you need to comment on why state->tuplecontext is not
used for batch memory now. It is still useful, for multiple merge
passes, but the situation has notably changed for it.


Hmm. We don't actually use it after the initial runs at all anymore. I 
added a call to destroy it in mergeruns().


Now that we use the batch memory buffers for allocations < 1 kB (I 
pulled that number out of a hat, BTW), and we only need one allocation 
per tape (plus one), there's not much risk of fragmentation.



On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan  wrote:

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:


@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool 
forWrite)
Assert(lt->frozen);
datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
}
+
+   /* Allocate a read buffer */
+   if (lt->buffer)
+   pfree(lt->buffer);
+   lt->buffer = palloc(lt->read_buffer_size);
+   lt->buffer_size = lt->read_buffer_size;


Of course, when you do that you're going to have to make the new
"buffer_size" and "read_buffer_size" fields of type "Size" (or,
possibly, "int64", to match tuplesort.c's own buffer sizing variables
ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
and "nbytes" fields next to "buffer_size" within the struct
LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
"long", though, since that reflects an existing hardly-relevant
limitation that you're not making any worse.


True. I fixed that by putting a MaxAllocSize cap on the buffer size 
instead. I doubt that doing > 1 GB of read-ahead of a single tape will 
do any good.


I wonder if we should actually have a smaller cap there. Even 1 GB seems 
excessive. Might be better to start the merging sooner, rather than wait 
for the read of 1 GB to complete. The point of the OS readahead is that 
the OS will do that for us, in the background. And other processes might 
have better use for the memory anyway.



* It couldn't hurt to make this code paranoid about LACKMEM() becoming
true, which will cause havoc (we saw this recently in 9.6; a patch of
mine to fix that just went in):


+   /*
+* Use all the spare memory we have available for read buffers. Divide it
+* memory evenly among all the tapes.
+*/
+   avail_blocks = state->availMem / BLCKSZ;
+   per_tape = avail_blocks / maxTapes;
+   cutoff = avail_blocks % maxTapes;
+   if (per_tape == 0)
+   {
+   per_tape = 1;
+   cutoff = 0;
+   }
+   for (tapenum = 0; tapenum < maxTapes; tapenum++)
+   {
+   LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+   (per_tape + (tapenum < cutoff ? 1 : 0)) 
* BLCKSZ);
+   }


In other words, we really don't want availMem to become < 0, since
it's int64, but a derived value is passed to
LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
if LACKMEM() did happen it would be a bug anyway, but I recommend
defensive code also be added. Per grow_memtuples(), "We need to be
sure that we do not cause LACKMEM to become true, else the space
management algorithm will go nuts". Let's be sure that we get that
right, since, as we saw recently, especially since grow_memtuples()
will not actually have the

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera 
wrote:

>
>
> One thing not quite clear to me is how do we create the bitmap
> representation starting from the array representation in midflight
> without using twice as much memory transiently.  Are we going to write
> the array to a temp file, free the array memory, then fill the bitmap by
> reading the array from disk?
>

We could do that. Or may be compress TID array when consumed half m_w_m and
do this repeatedly with remaining memory. For example, if we start with 1GB
memory, we decide to compress at 512MB. Say that results in 300MB for
bitmap. We then continue to accumulate TID and do another round of fold up
when another 350MB is consumed.

I think we should maintain per offset count of number of dead tuples to
choose the most optimal bitmap size (that needs overflow region). We can
also track how many blocks or block ranges have at least one dead tuple to
know if it's worthwhile to have some sort of indirection. Together that can
tell us how much compression can be achieved and allow us to choose the
most optimal representation.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Simon Riggs

On 14 September 2016 at 11:19, Pavan Deolasee  wrote:

>>  In
>> theory we could even start with the list of TIDs and switch to the
>> bitmap if the TID list becomes larger than the bitmap would have been,
>> but I don't know if it's worth the effort.
>>
>
> Yes, that works too. Or may be even better because we already know the
> bitmap size requirements, definitely for the tuples collected so far. We
> might need to maintain some more stats to further optimise the
> representation, but that seems like unnecessary detailing at this point.

That sounds best to me... build the simple representation, but as we
do maintain stats to show to what extent that set of tuples is
compressible.

When we hit the limit on memory we can then selectively compress
chunks to stay within memory, starting with the most compressible
chunks.

I think we should use the chunking approach Robert suggests, though
mainly because that allows us to consider how parallel VACUUM should
work - writing the chunks to shmem. That would also allow us to apply
a single global limit for vacuum memory rather than an allocation per
VACUUM.
We can then scan multiple indexes at once in parallel, all accessing
the shmem data structure.

We should also find the compression is better when we consider chunks
rather than the whole data structure at once.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Alvaro Herrera

Robert Haas wrote:

> Actually, I think that probably *is* worthwhile, specifically because
> it might let us avoid multiple index scans in cases where we currently
> require them.  Right now, our default maintenance_work_mem value is
> 64MB, which is enough to hold a little over ten million tuples.  It's
> also large enough to hold a bitmap for a 14GB table.  So right now if
> you deleted, say, 100 tuples per page you would end up with an index
> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
> the bitmap representation for such cases would require only one index
> vacuum cycle for every 14GB, more than an order of magnitude
> improvement!

Yeah, this sounds worthwhile.  If we switch to the more compact
in-memory representation close to the point where we figure the TID
array is not going to fit in m_w_m, then we're saving some number of
additional index scans, and I'm pretty sure that the time to transform
from array to bitmap is going to be more than paid back by the I/O
savings.

One thing not quite clear to me is how do we create the bitmap
representation starting from the array representation in midflight
without using twice as much memory transiently.  Are we going to write
the array to a temp file, free the array memory, then fill the bitmap by
reading the array from disk?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] GiST: interpretation of NaN from penalty function

2016-09-14 Thread Andrew Borodin

Hi hackers!

Currently GiST treats NaN penalty as zero penalty, in terms of
generalized tree this means "perfect fit". I think that this situation
should be considered "worst fit" instead.
Here is a patch to highlight place in the code.
I could not construct test to generate bad tree, which would be fixed
by this patch. There is not so much of cases when you get NaN. None of
them can be a result of usual additions and multiplications of real
values.

Do I miss something? Is there any case when NaN should be considered good fit?

Greg Stark was talking about this in
BANLkTi=d+bpps1cm4yc8kukhj63hwj4...@mail.gmail.com but that topic
didn't go far (due to triangles). I'm currently messing with floats in
penalties, very close to NaNs, and I think this question can be
settled.

Regrads, Andrey Borodin.
diff --git a/src/backend/access/gist/gistutil.c 
b/src/backend/access/gist/gistutil.c
index fac166d..0a62561 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -675,8 +675,14 @@ gistpenalty(GISTSTATE *giststate, int attno,
  PointerGetDatum(orig),
  PointerGetDatum(add),
  PointerGetDatum());
-   /* disallow negative or NaN penalty */
-   if (isnan(penalty) || penalty < 0.0)
+   /*
+* disallow negative or NaN penalty
+* NaN is considered float infinity - i.e. most inapropriate fit
+* negative is considered penaltiless fix
+*/
+   if (isnan(penalty))
+   penalty = get_float4_infinity();
+   if (penalty < 0.0)
penalty = 0.0;
}
else if (isNullOrig && isNullAdd)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-14 Thread Andrew Borodin

Here is v6 of cube penalty patch.
There was a warning about unused edge function  under systems without
__STDC_IEC_559__ defined, this patch fixes it.

Regards, Andrey Borodin.
diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 3feddef..ad868ac 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -18,6 +18,7 @@
 
 #include "cubedata.h"
 
+
 PG_MODULE_MAGIC;
 
 /*
@@ -27,6 +28,15 @@ PG_MODULE_MAGIC;
 #define ARRNELEMS(x)  ArrayGetNItems( ARR_NDIM(x), ARR_DIMS(x))
 
 /*
+ * If IEEE754 floats are fully supported we use advanced penalty
+ * which takes into account cube volume, perimeter and tend to favor
+ * small nodes over big ones. For more info see g_cube_penalty implementation
+ */
+#ifdef __STDC_IEC_559__
+#define ADVANCED_PENALTY
+#endif
+
+/*
 ** Input/Output routines
 */
 PG_FUNCTION_INFO_V1(cube_in);
@@ -91,14 +101,18 @@ PG_FUNCTION_INFO_V1(cube_enlarge);
 /*
 ** For internal use only
 */
-int32  cube_cmp_v0(NDBOX *a, NDBOX *b);
-bool   cube_contains_v0(NDBOX *a, NDBOX *b);
-bool   cube_overlap_v0(NDBOX *a, NDBOX *b);
-NDBOX *cube_union_v0(NDBOX *a, NDBOX *b);
-void   rt_cube_size(NDBOX *a, double *sz);
-NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
-bool   g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber 
strategy);
-bool   g_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static int32   cube_cmp_v0(NDBOX *a, NDBOX *b);
+static boolcube_contains_v0(NDBOX *a, NDBOX *b);
+static boolcube_overlap_v0(NDBOX *a, NDBOX *b);
+static NDBOX  *cube_union_v0(NDBOX *a, NDBOX *b);
+#ifdef ADVANCED_PENALTY
+static float   pack_float(float value, int realm);
+static voidrt_cube_edge(NDBOX *a, double *sz);
+#endif
+static voidrt_cube_size(NDBOX *a, double *sz);
+static NDBOX  *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep);
+static boolg_cube_leaf_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
+static boolg_cube_internal_consistent(NDBOX *key, NDBOX *query, 
StrategyNumber strategy);
 
 /*
 ** Auxiliary funxtions
@@ -419,7 +433,36 @@ g_cube_decompress(PG_FUNCTION_ARGS)
}
PG_RETURN_POINTER(entry);
 }
+/*
+ * Function to pack floats of different realms
+ * This function serves to pack bit flags inside float type
+ * Resulted value represent can be from four different "realms"
+ * Every value from realm 3 is greater than any value from realms 2,1 and 0.
+ * Every value from realm 2 is less than every value from realm 3 and greater
+ * than any value from realm 1 and 0, and so on. Values from the same realm
+ * loose two bits of precision. This technique is possible due to floating
+ * point numbers specification according to IEEE 754: exponent bits are highest
+ * (excluding sign bits, but here penalty is always positive). If float a is
+ * greater than float b, integer A with same bit representation as a is greater
+ * than integer B with same bits as b.
+ */
+#ifdef ADVANCED_PENALTY
+static float
+pack_float(const float value, const int realm)
+{
+  union {
+float f;
+struct { unsigned value:31, sign:1; } vbits;
+struct { unsigned value:29, realm:2, sign:1; } rbits;
+  } a;
+
+  a.f = value;
+  a.rbits.value = a.vbits.value >> 2;
+  a.rbits.realm = realm;
 
+  return a.f;
+}
+#endif
 
 /*
 ** The GiST Penalty method for boxes
@@ -441,9 +484,42 @@ g_cube_penalty(PG_FUNCTION_ARGS)
rt_cube_size(DatumGetNDBOX(origentry->key), );
*result = (float) (tmp1 - tmp2);
 
-   /*
-* fprintf(stderr, "penalty\n"); fprintf(stderr, "\t%g\n", *result);
-*/
+#ifdef ADVANCED_PENALTY
+   /* Realm tricks are used only in case of IEEE754 support(IEC 60559) */
+
+   /* REALM 0: No extension is required, volume is zero, return edge   
*/
+   /* REALM 1: No extension is required, return nonzero volume 
*/
+   /* REALM 2: Volume extension is zero, return nonzero edge extension 
*/
+   /* REALM 3: Volume extension is nonzero, return it  
*/
+
+   if( *result == 0 )
+   {
+   double tmp3 = tmp1; /* remember entry volume */
+   rt_cube_edge(ud, );
+   rt_cube_edge(DatumGetNDBOX(origentry->key), );
+   *result = (float) (tmp1 - tmp2);
+   if( *result == 0 )
+   {
+   if( tmp3 != 0 )
+   {
+   *result = pack_float(tmp3, 1); /* REALM 1 */
+   }
+   else
+   {
+   *result = pack_float(tmp1, 0); /* REALM 0 */
+   }
+   }
+   else
+   {
+   *result = pack_float(*result, 2); /* REALM 2 */
+   }
+   }
+   else
+   {
+

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Claudio Freire

On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:
> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.

If you compress the list into a bitmap a posteriori, you know the
number of tuples per page, so you could encode the bitmap even more
efficiently.

It's not a bad idea, one that can be slapped on top of the multiarray
patch - when closing a segment, it can be decided whether to turn it
into a bitmap or not.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-14 Thread Andres Freund

(continuing, uh, a bit happier)

On 2016-09-09 00:59:26 +0200, Petr Jelinek wrote:

> +/*
> + * Relcache invalidation callback for our relation map cache.
> + */
> +static void
> +logicalreprelmap_invalidate_cb(Datum arg, Oid reloid)
> +{
> + LogicalRepRelMapEntry  *entry;
> +
> + /* Just to be sure. */
> + if (LogicalRepRelMap == NULL)
> + return;
> +
> + if (reloid != InvalidOid)
> + {
> + HASH_SEQ_STATUS status;
> +
> + hash_seq_init(, LogicalRepRelMap);
> +
> + /* TODO, use inverse lookup hastable? */

*hashtable

> + while ((entry = (LogicalRepRelMapEntry *) 
> hash_seq_search()) != NULL)
> + {
> + if (entry->reloid == reloid)
> + entry->reloid = InvalidOid;

can't we break here?


> +/*
> + * Initialize the relation map cache.
> + */
> +static void
> +remoterelmap_init(void)
> +{
> + HASHCTL ctl;
> +
> + /* Make sure we've initialized CacheMemoryContext. */
> + if (CacheMemoryContext == NULL)
> + CreateCacheMemoryContext();
> +
> + /* Initialize the hash table. */
> + MemSet(, 0, sizeof(ctl));
> + ctl.keysize = sizeof(uint32);
> + ctl.entrysize = sizeof(LogicalRepRelMapEntry);
> + ctl.hcxt = CacheMemoryContext;

Wonder if this (and similar code earlier) should try to do everything in
a sub-context of CacheMemoryContext instead. That'd make some issues
easier to track down.

> +/*
> + * Open the local relation associated with the remote one.
> + */
> +static LogicalRepRelMapEntry *
> +logicalreprel_open(uint32 remoteid, LOCKMODE lockmode)
> +{
> + LogicalRepRelMapEntry  *entry;
> + boolfound;
> +
> + if (LogicalRepRelMap == NULL)
> + remoterelmap_init();
> +
> + /* Search for existing entry. */
> + entry = hash_search(LogicalRepRelMap, (void *) ,
> + HASH_FIND, );
> +
> + if (!found)
> + elog(FATAL, "cache lookup failed for remote relation %u",
> +  remoteid);
> +
> + /* Need to update the local cache? */
> + if (!OidIsValid(entry->reloid))
> + {
> + Oid nspid;
> + Oid relid;
> + int i;
> + TupleDesc   desc;
> + LogicalRepRelation *remoterel;
> +
> + remoterel = >remoterel;
> +
> + nspid = LookupExplicitNamespace(remoterel->nspname, false);
> + if (!OidIsValid(nspid))
> + ereport(FATAL,
> + 
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +  errmsg("the logical replication target 
> %s not found",
> + 
> quote_qualified_identifier(remoterel->nspname,

   remoterel->relname;
> + relid = get_relname_relid(remoterel->relname, nspid);
> + if (!OidIsValid(relid))
> + ereport(FATAL,
> + 
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +  errmsg("the logical replication target 
> %s not found",
> + 
> quote_qualified_identifier(remoterel->nspname,
> + 
>remoterel->relname;
> +
> + entry->rel = heap_open(relid, lockmode);

This seems rather racy. I think this really instead needs something akin
to RangeVarGetRelidExtended().

> +/*
> + * Executor state preparation for evaluation of constraint expressions,
> + * indexes and triggers.
> + *
> + * This is based on similar code in copy.c
> + */
> +static EState *
> +create_estate_for_relation(LogicalRepRelMapEntry *rel)
> +{
> + EState *estate;
> + ResultRelInfo *resultRelInfo;
> + RangeTblEntry *rte;
> +
> + estate = CreateExecutorState();
> +
> + rte = makeNode(RangeTblEntry);
> + rte->rtekind = RTE_RELATION;
> + rte->relid = RelationGetRelid(rel->rel);
> + rte->relkind = rel->rel->rd_rel->relkind;
> + estate->es_range_table = list_make1(rte);
> +
> + resultRelInfo = makeNode(ResultRelInfo);
> + InitResultRelInfo(resultRelInfo, rel->rel, 1, 0);
> +
> + estate->es_result_relations = resultRelInfo;
> + estate->es_num_result_relations = 1;
> + estate->es_result_relation_info = resultRelInfo;
> +
> + /* Triggers might need a slot */
> + if (resultRelInfo->ri_TrigDesc)
> + estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);
> +
> + return estate;
> +}

Ugh, we do this for every single change? That's pretty darn heavy.


> +/*
> + * Check if the local

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 8:47 PM, Robert Haas  wrote:

>
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.


Yeah, I wouldn't worry only about lookup speedup, but if does speeds up,
that's a bonus. But the bitmaps seem to win even for memory consumption. As
theory and experiments both show, at 10% dead tuple ratio, bitmaps will win
handsomely.

 In
> theory we could even start with the list of TIDs and switch to the
> bitmap if the TID list becomes larger than the bitmap would have been,
> but I don't know if it's worth the effort.
>
>
Yes, that works too. Or may be even better because we already know the
bitmap size requirements, definitely for the tuples collected so far. We
might need to maintain some more stats to further optimise the
representation, but that seems like unnecessary detailing at this point.


>
> On the other hand, if we switch to the bitmap as the ONLY possible
> representation, we will lose badly when there are scattered updates -
> e.g. 1 deleted tuple every 10 pages.


Sure. I never suggested that. What I'd suggested is to switch back to array
representation once we realise bitmaps are not going to work. But I see
it's probably better the other way round.



> So it seems like we probably
> want to have both options.  One tricky part is figuring out how we
> switch between them when memory gets tight; we have to avoid bursting
> above our memory limit while making the switch.


Yes, I was thinking about this problem. Some modelling will be necessary to
ensure that we don't go (much) beyond the maintenance_work_mem while
switching representation, which probably means you need to do that earlier
than necessary.


> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.
>
>
Yes seems like a good idea. Another idea that I'd in mind is to use some
sort of indirection map where bitmap for every block or a set of blocks
will either be recorded or not, depending on whether a bit is set for the
range. If the bitmap exists, the indirection map will give out the offset
into the larger bitmap area. Seems similar to what you described.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-14 Thread Dilip Kumar

On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas  wrote:
> Sure, but you're testing at *really* high client counts here.  Almost
> nobody is going to benefit from a 5% improvement at 256 clients.

I agree with your point, but here we need to consider one more thing,
that on head we are gaining ~30% with both the approaches.

So for comparing these two patches we can consider..

A.  Other workloads (one can be as below)
   -> Load on CLogControlLock at commit (exclusive mode) + Load on
CLogControlLock at Transaction status (shared mode).
   I think we can mix (savepoint + updates)

B. Simplicity of the patch (if both are performing almost equal in all
practical scenarios).

C. Bases on algorithm whichever seems winner.

I will try to test these patches with other workloads...

>  You
> need to test 64 clients and 32 clients and 16 clients and 8 clients
> and see what happens there.  Those cases are a lot more likely than
> these stratospheric client counts.

I tested with 64 clients as well..
1. On head we are gaining ~15% with both the patches.
2. But group lock vs granular lock is almost same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Arthur Silva

On Sep 14, 2016 5:18 PM, "Robert Haas"  wrote:
>
> On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee
>  wrote:
> > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per
page
> > bitmap size. Thats about 36 bytes for 8K page. IOW if on an average
there
> > are 6 or more dead tuples per page, bitmap will outperform the current
> > representation, assuming max allocation for bitmap. If we can use
additional
> > estimates to restrict the size to somewhat more conservative value and
then
> > keep overflow area, then probably the break-even happens even earlier
than
> > that. I hope this gives us a good starting point, but let me know if you
> > think it's still a wrong approach to pursue.
>
> Well, it's certainly a bigger change.  I think the big concern is that
> the amount of memory now becomes fixed based on the table size.  So
> one problem is that you have to figure out what you're going to do if
> the bitmap doesn't fit in maintenance_work_mem.  A related problem is
> that it might fit but use more memory than before, which could cause
> problems for some people.  Now on the other hand it could also use
> less memory for some people, and that would be good.
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.
>
> I'm fairly sure that our number one priority should be to minimize the
> number of cases where we need to do multiple scans of the indexes to
> stay within maintenance_work_mem.  If we're satisfied we've met that
> goal, then within that we should try to make VACUUM as fast as
> possible with as little memory usage as possible.  I'm not 100% sure I
> know how to get there, or how much work it's worth expending.  In
> theory we could even start with the list of TIDs and switch to the
> bitmap if the TID list becomes larger than the bitmap would have been,
> but I don't know if it's worth the effort.
>
> /me thinks a bit.
>
> Actually, I think that probably *is* worthwhile, specifically because
> it might let us avoid multiple index scans in cases where we currently
> require them.  Right now, our default maintenance_work_mem value is
> 64MB, which is enough to hold a little over ten million tuples.  It's
> also large enough to hold a bitmap for a 14GB table.  So right now if
> you deleted, say, 100 tuples per page you would end up with an index
> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
> the bitmap representation for such cases would require only one index
> vacuum cycle for every 14GB, more than an order of magnitude
> improvement!
>
> On the other hand, if we switch to the bitmap as the ONLY possible
> representation, we will lose badly when there are scattered updates -
> e.g. 1 deleted tuple every 10 pages.  So it seems like we probably
> want to have both options.  One tricky part is figuring out how we
> switch between them when memory gets tight; we have to avoid bursting
> above our memory limit while making the switch.  And even if our
> memory limit is very high, we want to avoid using memory gratuitously;
> I think we should try to grow memory usage incrementally with either
> representation.
>
> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

I'd say it's an idea worth pursuing. It's the base idea behind roaring
bitmaps, arguably the best overall compressed bitmap implementation.

Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()

2016-09-14 Thread Tom Lane

Robert Haas  writes:
> Interesting.  I think that our documentation should only describe the
> way we use unit suffixes in one central place, but other places (like
> pg_size_pretty) could link to that central place.

> I don't believe that there is any general unanimity among users or
> developers about the question of which suffixes devote units
> denominated in units of 2^10 bytes vs. 10^3 bytes.  About once a year,
> somebody makes an argument that we're doing it wrong, but the evidence
> that I've seen is very mixed.  So when people say that there is only
> one right way to do this and we are not in compliance with that one
> right way, I guess I just don't believe it.  Not everybody likes the
> way we do it, but I am fairly sure that if we change it, we'll make
> some currently-unhappy people happy and some currently-happy people
> unhappy.  And the people who don't care but wanted to preserve
> backward compatibility will all be in the latter camp.

That's about my position too: I cannot see that changing this is going
to make things better to a degree that would justify breaking backwards
compatibility.

> However, that is not to say that the documentation couldn't be better.

+1; your idea above seems sound.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()

2016-09-14 Thread Robert Haas

On Wed, Sep 14, 2016 at 5:22 AM, Thomas Berger  wrote:
> Today, i found the time to read all the mails in this thread, and i think i 
> have to explain, why we decided to open a bug for this behavior.
>
> Pn Tuesday, 23. August 2016, 13:30:29 Robert Haas wrote:
>> J. Random User: I'm having a problem!
>> Mailing List: Gee, how big are your tables?
>> J. Random User: Here's some pg_size_pretty output.
>> Mailing List: Gosh, we don't know what that means, what do you have
>> this obscure GUC set to?
>> J. Random User: Maybe I'll just give up on SQL and use MongoDB.
>
> In fact, we had just the other way around. One of our most critical databases 
> had some extreme bloat.
> Some of our internal customers was very confused, about the sizes reported by 
> the database.
> This confusion has led to wrong decisions. (And a long discussion about the 
> choice of DBMS btw)
>
> I think there is a point missing in this whole discussion, or i just didn't 
> see it:
>
> Yeah, the behavior of "kB" is defined in the "postgresql.conf" documentation.
> But no _user_ reads this. There is no link or hint in the documentation of 
> "pg_size_pretty()" [1].

Interesting.  I think that our documentation should only describe the
way we use unit suffixes in one central place, but other places (like
pg_size_pretty) could link to that central place.

I don't believe that there is any general unanimity among users or
developers about the question of which suffixes devote units
denominated in units of 2^10 bytes vs. 10^3 bytes.  About once a year,
somebody makes an argument that we're doing it wrong, but the evidence
that I've seen is very mixed.  So when people say that there is only
one right way to do this and we are not in compliance with that one
right way, I guess I just don't believe it.  Not everybody likes the
way we do it, but I am fairly sure that if we change it, we'll make
some currently-unhappy people happy and some currently-happy people
unhappy.  And the people who don't care but wanted to preserve
backward compatibility will all be in the latter camp.

However, that is not to say that the documentation couldn't be better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-14 Thread Robert Haas

On Wed, Sep 14, 2016 at 12:13 AM, Kouhei Kaigai  wrote:
> It looks to me pg_buffercache tries to allocate more than 1GB using
> palloc(), when shared_buffers is more than 256GB.
>
> # show shared_buffers ;
>  shared_buffers
> 
>  280GB
> (1 row)
>
> # SELECT buffers, d.datname, coalesce(c.relname, '???')
> FROM (SELECT count(*) buffers, reldatabase, relfilenode
> FROM pg_buffercache group by reldatabase, relfilenode) b
>LEFT JOIN pg_database d ON d.oid = b.reldatabase
>LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
>  WHERE datname = current_database())
>AND b.relfilenode = pg_relation_filenode(c.oid)
>ORDER BY buffers desc;
> ERROR:  invalid memory alloc request size 1174405120
>
> It is a situation to use MemoryContextAllocHuge(), instead of palloc().
> Also, it may need a back patching?

I guess so.  Although it's not very desirable for it to use that much
memory, I suppose if you have a terabyte of shared_buffers you
probably have 4GB of memory on top of that to show what they contain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-14 Thread Robert Haas

On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumar  wrote:
> 2. Results
> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
> scale factor: 300
> Clients   head(tps)grouplock(tps)  granular(tps)
> ---  -   --   ---
> 12829367 3932637421
> 18029777 3781036469
> 25628523 3741835882
>
>
> grouplock --> 1) Group mode to reduce CLOGControlLock contention
> granular  --> 2) Use granular locking model
>
> I will test with 3rd approach also, whenever I get time.
>
> 3. Summary:
> 1. I can see on head we are gaining almost ~30 % performance at higher
> client count (128 and beyond).
> 2. group lock is ~5% better compared to granular lock.

Sure, but you're testing at *really* high client counts here.  Almost
nobody is going to benefit from a 5% improvement at 256 clients.  You
need to test 64 clients and 32 clients and 16 clients and 8 clients
and see what happens there.  Those cases are a lot more likely than
these stratospheric client counts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Claudio Freire

On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.

FYI, the reported impact was on CPU time, not runtime. There was no
significant difference in runtime (real time), because my test is
heavily I/O bound.

I tested with a few small tables and there was no significant
difference either, but small tables don't stress the array lookup
anyway so that's expected.

But on the assumption that some systems may be CPU bound during vacuum
(particularly those able to do more than 300-400MB/s sequential I/O),
in those cases the increased or decreased cost of lazy_tid_reaped will
directly correlate to runtime. It's just none of my systems, which all
run on amazon and is heavily bandwidth constrained (fastest I/O
subsystem I can get my hands on does 200MB/s).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Push down more full joins in postgres_fdw

2016-09-14 Thread Robert Haas

On Tue, Sep 13, 2016 at 11:38 PM, Ashutosh Bapat
 wrote:
> On Tue, Sep 13, 2016 at 10:28 PM, Robert Haas  wrote:
>> On Tue, Sep 6, 2016 at 9:07 AM, Ashutosh Bapat
>>  wrote:
>>> That's not true with the alias information. As long as we detect which
>>> relations need subqueries, their RTIs are enough to create unique aliases
>>> e.g. if a relation involves RTIs 1, 2, 3 corresponding subquery can have
>>> alias r123 without conflicting with any other alias.
>>
>> What if RTI 123 is also used in the query?
>
> Good catch. I actually meant some combination of 1, 2 and 3, which is
> unique for a join between r1, r2 and r3.  How about r1_2_3 or
> r1_r2_r3?

Sure, something like that can work, but if you have a big enough join
the identifier might get too long.  I'm not sure why it wouldn't work
to just use the lowest RTI involved in the join, though; the others
won't appear anywhere else at that query level.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Robert Haas

On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee
 wrote:
> Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page
> bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there
> are 6 or more dead tuples per page, bitmap will outperform the current
> representation, assuming max allocation for bitmap. If we can use additional
> estimates to restrict the size to somewhat more conservative value and then
> keep overflow area, then probably the break-even happens even earlier than
> that. I hope this gives us a good starting point, but let me know if you
> think it's still a wrong approach to pursue.

Well, it's certainly a bigger change.  I think the big concern is that
the amount of memory now becomes fixed based on the table size.  So
one problem is that you have to figure out what you're going to do if
the bitmap doesn't fit in maintenance_work_mem.  A related problem is
that it might fit but use more memory than before, which could cause
problems for some people.  Now on the other hand it could also use
less memory for some people, and that would be good.

I am kind of doubtful about this whole line of investigation because
we're basically trying pretty hard to fix something that I'm not sure
is broken.I do agree that, all other things being equal, the TID
lookups will probably be faster with a bitmap than with a binary
search, but maybe not if the table is large and the number of dead
TIDs is small, because cache efficiency is pretty important.  But even
if it's always faster, does TID lookup speed even really matter to
overall VACUUM performance? Claudio's early results suggest that it
might, but maybe that's just a question of some optimization that
hasn't been done yet.

I'm fairly sure that our number one priority should be to minimize the
number of cases where we need to do multiple scans of the indexes to
stay within maintenance_work_mem.  If we're satisfied we've met that
goal, then within that we should try to make VACUUM as fast as
possible with as little memory usage as possible.  I'm not 100% sure I
know how to get there, or how much work it's worth expending.  In
theory we could even start with the list of TIDs and switch to the
bitmap if the TID list becomes larger than the bitmap would have been,
but I don't know if it's worth the effort.

/me thinks a bit.

Actually, I think that probably *is* worthwhile, specifically because
it might let us avoid multiple index scans in cases where we currently
require them.  Right now, our default maintenance_work_mem value is
64MB, which is enough to hold a little over ten million tuples.  It's
also large enough to hold a bitmap for a 14GB table.  So right now if
you deleted, say, 100 tuples per page you would end up with an index
vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
the bitmap representation for such cases would require only one index
vacuum cycle for every 14GB, more than an order of magnitude
improvement!

On the other hand, if we switch to the bitmap as the ONLY possible
representation, we will lose badly when there are scattered updates -
e.g. 1 deleted tuple every 10 pages.  So it seems like we probably
want to have both options.  One tricky part is figuring out how we
switch between them when memory gets tight; we have to avoid bursting
above our memory limit while making the switch.  And even if our
memory limit is very high, we want to avoid using memory gratuitously;
I think we should try to grow memory usage incrementally with either
representation.

For instance, one idea to grow memory usage incrementally would be to
store dead tuple information separately for each 1GB segment of the
relation.  So we have an array of dead-tuple-representation objects,
one for every 1GB of the relation.  If there are no dead tuples in a
given 1GB segment, then this pointer can just be NULL.  Otherwise, it
can point to either the bitmap representation (which will take ~4.5MB)
or it can point to an array of TIDs (which will take 6 bytes/TID).
That could handle an awfully wide variety of usage patterns
efficiently; it's basically never worse than what we're doing today,
and when the dead tuple density is high for any portion of the
relation it's a lot better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Alvaro Herrera

Tom Lane wrote:
> Ashutosh Bapat  writes:
> > While working on partition-wise join, I had to examine Relids objects
> > many times. Printing the Bitmapset::words[] in binary format and then
> > inferring the relids takes time and is error prone.
> 
> FWIW, I generally rely on pprint() to look at planner data structures
> from the debugger.

Also:
http://blog.pgaddict.com/posts/making-debugging-with-gdb-a-bit-easier
This may not address the issue directly, but it's probably very helpful.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Tom Lane

Ashutosh Bapat  writes:
> While working on partition-wise join, I had to examine Relids objects
> many times. Printing the Bitmapset::words[] in binary format and then
> inferring the relids takes time and is error prone.

FWIW, I generally rely on pprint() to look at planner data structures
from the debugger.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Tom Lane

Alvaro Herrera  writes:
> I don't understand.  Why don't you just use "call pprint(the bitmapset)"
> in the debugger?

Bitmapsets aren't Nodes, so pprint doesn't work directly on them.
I usually find that I can pprint some node containing the value(s)
I'm interested in, but maybe that isn't working for Ashutosh's
particular case.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Covering + unique indexes.

2016-09-14 Thread Anastasia Lubennikova


One more update.

I added ORDER BY clause to regression tests.
It was done as a separate bugfix patch by Tom Lane some time ago,
but it definitely should be included into the patch.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index d4f9090..99735ce 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name);
 static HTAB *createConnHash(void);
 static void createNewConnection(const char *name, remoteConn *rconn);
 static void deleteConnection(const char *name);
-static char **get_pkey_attnames(Relation rel, int16 *numatts);
+static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts);
 static char **get_text_array_contents(ArrayType *array, int *numitems);
 static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals);
 static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals);
@@ -1483,7 +1483,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey);
 Datum
 dblink_get_pkey(PG_FUNCTION_ARGS)
 {
-	int16		numatts;
+	int16		indnkeyatts;
 	char	  **results;
 	FuncCallContext *funcctx;
 	int32		call_cntr;
@@ -1509,7 +1509,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS)
 		rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT);
 
 		/* get the array of attnums */
-		results = get_pkey_attnames(rel, );
+		results = get_pkey_attnames(rel, );
 
 		relation_close(rel, AccessShareLock);
 
@@ -1529,9 +1529,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS)
 		attinmeta = TupleDescGetAttInMetadata(tupdesc);
 		funcctx->attinmeta = attinmeta;
 
-		if ((results != NULL) && (numatts > 0))
+		if ((results != NULL) && (indnkeyatts > 0))
 		{
-			funcctx->max_calls = numatts;
+			funcctx->max_calls = indnkeyatts;
 
 			/* got results, keep track of them */
 			funcctx->user_fctx = results;
@@ -2021,10 +2021,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS)
  * get_pkey_attnames
  *
  * Get the primary key attnames for the given relation.
- * Return NULL, and set numatts = 0, if no primary key exists.
+ * Return NULL, and set indnkeyatts = 0, if no primary key exists.
  */
 static char **
-get_pkey_attnames(Relation rel, int16 *numatts)
+get_pkey_attnames(Relation rel, int16 *indnkeyatts)
 {
 	Relation	indexRelation;
 	ScanKeyData skey;
@@ -2034,8 +2034,8 @@ get_pkey_attnames(Relation rel, int16 *numatts)
 	char	  **result = NULL;
 	TupleDesc	tupdesc;
 
-	/* initialize numatts to 0 in case no primary key exists */
-	*numatts = 0;
+	/* initialize indnkeyatts to 0 in case no primary key exists */
+	*indnkeyatts = 0;
 
 	tupdesc = rel->rd_att;
 
@@ -2056,12 +2056,12 @@ get_pkey_attnames(Relation rel, int16 *numatts)
 		/* we're only interested if it is the primary key */
 		if (index->indisprimary)
 		{
-			*numatts = index->indnatts;
-			if (*numatts > 0)
+			*indnkeyatts = index->indnkeyatts;
+			if (*indnkeyatts > 0)
 			{
-result = (char **) palloc(*numatts * sizeof(char *));
+result = (char **) palloc(*indnkeyatts * sizeof(char *));
 
-for (i = 0; i < *numatts; i++)
+for (i = 0; i < *indnkeyatts; i++)
 	result[i] = SPI_fname(tupdesc, index->indkey.values[i]);
 			}
 			break;
diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c
index 7352b29..c024cf3 100644
--- a/contrib/tcn/tcn.c
+++ b/contrib/tcn/tcn.c
@@ -138,9 +138,9 @@ triggered_change_notification(PG_FUNCTION_ARGS)
 		/* we're only interested if it is the primary key and valid */
 		if (index->indisprimary && IndexIsValid(index))
 		{
-			int			numatts = index->indnatts;
+			int indnkeyatts = index->indnkeyatts;
 
-			if (numatts > 0)
+			if (indnkeyatts > 0)
 			{
 int			i;
 
@@ -150,7 +150,7 @@ triggered_change_notification(PG_FUNCTION_ARGS)
 appendStringInfoCharMacro(payload, ',');
 appendStringInfoCharMacro(payload, operation);
 
-for (i = 0; i < numatts; i++)
+for (i = 0; i < indnkeyatts; i++)
 {
 	int			colno = index->indkey.values[i];
 
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 322d8d6..0983c93 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3564,6 +3564,14 @@
   pg_class.relnatts)
  
 
+  
+  indnkeyatts
+  int2
+  
+  The number of key columns in the index. "Key columns" are ordinary
+  index columns in contrast with "included" columns.
+ 
+
  
   indisunique
   bool
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..92bef4a 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struct IndexAmRoutine
 boolamclusterable;
 /* does AM handle predicate locks? */
 boolampredlocks;
+/* does AM support columns included with clause INCLUDING? */
+boolamcaninclude;
 /* type of data stored in index, or

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Alvaro Herrera

Ashutosh Bapat wrote:
> Hi All,
> While working on partition-wise join, I had to examine Relids objects
> many times. Printing the Bitmapset::words[] in binary format and then
> inferring the relids takes time and is error prone. Instead I wrote a
> function bms_to_char() which allocates a StringInfo, calls
> outBitmapset() to decode Bitmapset as a set of integers and returns
> the string. In order to examine a Relids object all one has to do is
> execute 'p bms_to_char(bms_object) under gdb.

I don't understand.  Why don't you just use "call pprint(the bitmapset)"
in the debugger?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-14 Thread Tom Lane

Rahila Syed  writes:
>> Looking at the other variables hooks, they already emit errors and can
>> deny the effect of a change corresponding to a new value, without
>> informing the caller. Why would autocommit be different?

> These instances where psql_error occurs inside hooks the command is
> successful and the value supplied by user is reinterpreted to some other
> value as user had supplied an unrecognisable value.
> With psql_error_on_autocommit patch what was intended was to make
> the command unsuccessful and keep the previous setting of autocommit.
> Hence having it inside autocommit_hook did not seem appropriate to me.

Nonetheless, asking all callers of SetVariable to deal with such cases
is entirely unmaintainable/unacceptable.  Have you considered expanding
the API for hook functions?  I'm not really sure why we didn't provide a
way for the hooks to reject a setting to begin with.

Actually, it would make a lot more sense UI-wise if attempting to assign a
non-boolean value to a boolean variable resulted in an error and no change
to the variable, instead of what happens now.

Anyway, I'm not very thrilled with the idea that AUTOCOMMIT is so special
that it should have a different behavior than any other built-in psql
variable.  If we make them all throw errors and refuse to change to bad
values, that would be consistent and defensible IMO.  But having
AUTOCOMMIT alone act that way is not a feature, it's a wart.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] condition variables

2016-09-14 Thread Robert Haas

On Tue, Sep 13, 2016 at 10:55 PM, Peter Geoghegan  wrote:
> On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas  wrote:
>> Another approach to the problem is to use a latch wait loop.  That
>> almost works.  Interrupts can be serviced, and you can recheck shared
>> memory to see whether the condition for proceeding is satisfied after
>> each iteration of the loop.  There's only one problem: when you do
>> something that might cause the condition to be satisfied for other
>> waiting backends, you need to set their latch - but you don't have an
>> easy way to know exactly which processes are waiting, so how do you
>> call SetLatch?  I originally thought of adding a function like
>> SetAllLatches(ParallelContext *) and maybe that can work, but then I
>> had what I think is a better idea, which is to introduce a notion of
>> condition variables.
>
> I don't see a CF entry for this. Are you planning to work on this
> again soon, Robert?
>
> I have an eye on this patch due to my work on parallel CREATE INDEX.
> It would be nice to have some rough idea of when you intend to commit
> this.

I basically figured I would commit it when and if it became clear that
it'd get good use in some other patch which was on the road to being
committed.  I don't think it needs much work, just the assurance that
it will get some use.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sequences and pg_upgrade

2016-09-14 Thread Anastasia Lubennikova

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, failed

Thank you for the patch.
As I see there are no objections in the discussion, all the patches look clear.

Could you clarify, please, why do we dump sequence in schemaOnly mode?
+   if (dopt.schemaOnly && dopt.sequence_data)
+   getSequenceData(, tblinfo, numTables, dopt.oids);

Example: 
postgres=# create table t(i serial, data text);
postgres=# insert into t(data) values ('aaa');
pg_dump -d postgres --sequence-data --schema-only > ../reviews/dump_pg

Then restore it into newdb and add new value.
newdb=# insert into t(data) values ('aaa');
INSERT 0 1
newdb=# select * from t;
 i | data 
---+--
 2 | aaa

I'm not an experienced user, but I thought that while doing dump/restore
of schema of database we reset all the data. Why should the table in newly
created (via pg_restore) database have non-default sequence value?

I also did some other tests and all of them were passed.

One more thing to do is a documentation for the new option.
You should update help() function in pg_dump.c and also add some
notes to pg_dump.sgml and probably to pgupgrade.sgml.

The new status of this patch is: Waiting on Author

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-14 Thread Rahila Syed

Hello,

>Looking at the other variables hooks, they already emit errors and can
>deny the effect of a change corresponding to a new value, without
>informing the caller. Why would autocommit be different?
The only type of psql_error inside hooks is as follows,

 psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n",
   newval, "ECHO", "none");

These instances where psql_error occurs inside hooks the command is
successful and the value supplied by user is reinterpreted to some other
value as user had supplied an unrecognisable value.

With psql_error_on_autocommit patch what was intended was to make
the command unsuccessful and keep the previous setting of autocommit.
Hence having it inside autocommit_hook did not seem appropriate to me.

But as pointed out by you, the other  way of setting autocommit i,e

*SELECT 'on' as "AUTOCOMMIT" \gset*  will not be handled by the patch.
So I will change the patch to have the check in the autocommit_hook instead.

This will mean if *\set AUTOCOMMIT  ON* or  *SELECT 'on' as "AUTOCOMMIT"
\gset *
is run inside a transaction, it will be effective after current transaction
has
ended. Appropriate message will be displayed notifying this to the user and
user need not
rerun the set AUTOCOMMIT command.

Thank you,
Rahila Syed



On Thu, Sep 8, 2016 at 9:55 PM, Daniel Verite 
wrote:

> Rahila Syed wrote:
>
> > However, including the check here will require returning the status
> > of command from this hook. i.e if we throw error inside this
> > hook we will need to return false indicating the value has not changed.
>
> Looking at the other variables hooks, they already emit errors and can
> deny the effect of a change corresponding to a new value, without
> informing the caller. Why would autocommit be different?
>
> For example echo_hook does this:
>
> /* ...if the value is in (queries,errors,all,none) then
>assign pset.echo accordingly ... */
> else
> {
>   psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n",
>newval, "ECHO", "none");
>   pset.echo = PSQL_ECHO_NONE;
> }
>
>
> If the user issues \set ECHO FOOBAR, then it produces the above error
> message and makes the same internal change as if \set ECHO none
> had been issued.
>
> But, the value of the variable as seen by the user is still FOOBAR:
>
> \set
> [...other variables...]
> ECHO = 'FOOBAR'
>
> The proposed patch doesn't follow that behavior, as it denies
> a new value for AUTOCOMMIT. You might argue that it's better,
> but on the other side, there are two reasons against it:
>
> - it's not consistent with the other uses of \set that affect psql,
> such as ECHO, ECHO_HIDDEN,  ON_ERROR_ROLLBACK,
>  COMP_KEYWORD_CASE... and even AUTOCOMMIT as
>  \set AUTOCOMMIT FOOBAR is not denied, just reinterpreted.
>
> - it doesn't address the case of another method than \set being used
>  to set the variable, as in : SELECT 'on' as "AUTOCOMMIT" \gset
>  whereas the hook would work in that case.
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>

Re: [HACKERS] kqueue

2016-09-14 Thread Matteo Beccati


Hi,

On 14/09/2016 00:06, Tom Lane wrote:

I'm inclined to think the kqueue patch is worth applying just on the
grounds that it makes things better on OS X and doesn't seem to hurt
on FreeBSD.  Whether anyone would ever get to the point of seeing
intra-kernel contention on these platforms is hard to predict, but
we'd be ahead of the curve if so.

It would be good for someone else to reproduce my results though.
For one thing, 5%-ish is not that far above the noise level; maybe
what I'm measuring here is just good luck from relocation of critical
loops into more cache-line-friendly locations.


FWIW, I've tested HEAD vs patch on a 2-cpu low end NetBSD 7.0 i386 machine.

HEAD: 1890/1935/1889 tps
kqueue: 1905/1957/1932 tps

no weird surprises, and basically no differences either.


Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 5:32 PM, Robert Haas  wrote:

> On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee
>  wrote:
> > Another interesting bit about these small tables is that the largest used
> > offset for these tables never went beyond 291 which is the value of
> > MaxHeapTuplesPerPage. I don't know if there is something that prevents
> > inserting more than  MaxHeapTuplesPerPage offsets per heap page and I
> don't
> > know at this point if this gives us upper limit for bits per page (may
> be it
> > does).
>
> From PageAddItemExtended:
>
> /* Reject placing items beyond heap boundary, if heap */
> if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage)
> {
> elog(WARNING, "can't put more than MaxHeapTuplesPerPage items
> in a heap page");
> return InvalidOffsetNumber;
> }
>
> Also see the comment where MaxHeapTuplesPerPage is defined:
>
>  * Note: with HOT, there could theoretically be more line pointers (not
> actual
>  * tuples) than this on a heap page.  However we constrain the number of
> line
>  * pointers to this anyway, to avoid excessive line-pointer bloat and not
>  * require increases in the size of work arrays.
>
>
Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per
page bitmap size. Thats about 36 bytes for 8K page. IOW if on an average
there are 6 or more dead tuples per page, bitmap will outperform the
current representation, assuming max allocation for bitmap. If we can use
additional estimates to restrict the size to somewhat more conservative
value and then keep overflow area, then probably the break-even happens
even earlier than that. I hope this gives us a good starting point, but let
me know if you think it's still a wrong approach to pursue.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Robert Haas

On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee
 wrote:
> Another interesting bit about these small tables is that the largest used
> offset for these tables never went beyond 291 which is the value of
> MaxHeapTuplesPerPage. I don't know if there is something that prevents
> inserting more than  MaxHeapTuplesPerPage offsets per heap page and I don't
> know at this point if this gives us upper limit for bits per page (may be it
> does).

>From PageAddItemExtended:

/* Reject placing items beyond heap boundary, if heap */
if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage)
{
elog(WARNING, "can't put more than MaxHeapTuplesPerPage items
in a heap page");
return InvalidOffsetNumber;
}

Also see the comment where MaxHeapTuplesPerPage is defined:

 * Note: with HOT, there could theoretically be more line pointers (not actual
 * tuples) than this on a heap page.  However we constrain the number of line
 * pointers to this anyway, to avoid excessive line-pointer bloat and not
 * require increases in the size of work arrays.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 3:46 PM, Pavan Deolasee 
wrote:

>
>
>  lately I'm using LVM debugger (which probably does not have something
> equivalent),
>

And I was so clueless about lldb's powerful scripting interface. For
example, you can write something like this in bms_utils.py:

import lldb

def print_bms_members (bms):
words = bms.GetChildMemberWithName("words")
nwords = int(bms.GetChildMemberWithName("nwords").GetValue())

ret = 'nwords = {0} bitmap: '.format(nwords,)
for i in range(0, nwords):
ret += hex(int(words.GetChildAtIndex(0, lldb.eNoDynamicValues,
True).GetValue()))

return ret

And then do this while attached to lldb debugger:

Process 99659 stopped
* thread #1: tid = 0x59ba69, 0x0001090b012f
postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at
bitmapset.c:673, queue = 'com.apple.main-thread', stop reason = breakpoint
1.1
frame #0: 0x0001090b012f
postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at bitmapset.c:673
   670 int wordnum,
   671 bitnum;
   672
-> 673 if (x < 0)
   674 elog(ERROR, "negative bitmapset member not allowed");
   675 if (a == NULL)
   676 return bms_make_singleton(x);
(lldb) script
Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D.
>>> from bms_utils import *
>>> bms = lldb.frame.FindVariable ("a")
>>> print print_bms_members(bms)
nwords = 1 bitmap: 0x200


The complete API reference is available here
http://lldb.llvm.org/python_reference/index.html

Looks like an interesting SoC project to write useful lldb/gdb scripts to
print internal structures for ease of debugging :-)

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Amit Kapila

On Wed, Sep 14, 2016 at 4:36 PM, Ashutosh Sharma  wrote:
> Hi All,
>
> Below is the backtrace for the issue reported in my earlier mail [1].
> From the callstack it looks like we are trying to release  lock on a
> meta page twice in _hash_expandtable().
>

Thanks for the call stack.  I think below code in patch is culprit.
Here we have already released the meta page lock and then again on
failure, we are trying to release it.

_hash_expandtable()
{
..

/* Release the metapage lock, before completing the split. */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
..

if (!buf_nblkno)
{
_hash_relbuf(rel, buf_oblkno);
goto fail;
}
..
fail:
/* We didn't write the metapage, so just drop lock */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
}

This is a problem of concurrent hash index patch.  I will send the fix
in next version of the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-14 Thread Amit Kapila

On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen
 wrote:
> On 09/13/2016 07:26 AM, Amit Kapila wrote:
>>
>> Attached, new version of patch which contains the fix for problem
>> reported on write-ahead-log of hash index thread [1].
>>
>
> I have been testing patch in various scenarios, and it has a positive
> performance impact in some cases.
>
> This is especially seen in cases where the values of the indexed column are
> unique - SELECTs can see a 40-60% benefit over a similar query using b-tree.
>

Here, I think it is better if we have the data comparing the situation
of hash index with respect to HEAD as well.  What I mean to say is
that you are claiming that after the hash index improvements SELECT
workload is 40-60% better, but where do we stand as of HEAD?

> UPDATE also sees an improvement.
>

Can you explain this more?  Is it more compare to HEAD or more as
compare to Btree?  Isn't this contradictory to what the test in below
mail shows?

> In cases where the indexed column value isn't unique, it takes a long time
> to build the index due to the overflow page creation.
>
> Also in cases where the index column is updated with a high number of
> clients, ala
>
> -- ddl.sql --
> CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
> CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
> CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
> ANALYZE;
>
> -- test.sql --
> \set id random(1,10)
> \set val random(0,10)
> BEGIN;
> UPDATE test SET val = :val WHERE id = :id;
> COMMIT;
>
> w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for
> hash, and 10062 tps for b-tree).
>

Thanks for doing the tests.  Have you applied both concurrent index
and cache the meta page patch for these tests?  So from above tests,
we can say that after these set of patches read-only workloads will be
significantly improved even better than btree in quite-a-few useful
cases.  However when the indexed column is updated, there is still a
large gap as compare to btree (what about the case when the indexed
column is not updated in read-write transaction as in our pgbench
read-write transactions, by any chance did you ran any such test?).  I
think we need to focus on improving cases where index columns are
updated, but it is better to do that work as a separate patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Ashutosh Sharma

Hi All,

Below is the backtrace for the issue reported in my earlier mail [1].
>From the callstack it looks like we are trying to release  lock on a
meta page twice in _hash_expandtable().

(gdb) bt
#0  0x007b01cf in LWLockRelease (lock=0x7f55f59d0570) at lwlock.c:1799
#1  0x0078990c in LockBuffer (buffer=2, mode=0) at bufmgr.c:3540
#2  0x004a5d3c in _hash_chgbufaccess (rel=0x7f5605b608c0,
buf=2, from_access=1, to_access=-1) at hashpage.c:331
#3  0x004a722d in _hash_expandtable (rel=0x7f5605b608c0,
metabuf=2) at hashpage.c:995
#4  0x004a316a in _hash_doinsert (rel=0x7f5605b608c0,
itup=0x2ba5030) at hashinsert.c:313
#5  0x004a0d85 in hashinsert (rel=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", ht_ctid=0x2c2e4f4,
heapRel=0x7f5605b58250,
checkUnique=UNIQUE_CHECK_NO) at hash.c:248
#6  0x004c5a16 in index_insert (indexRelation=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "",
heap_t_ctid=0x2c2e4f4,
heapRelation=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at indexam.c:204
#7  0x0063f2d5 in ExecInsertIndexTuples (slot=0x2b9a2c0,
tupleid=0x2c2e4f4, estate=0x2b99c00, noDupErr=0 '\000',
specConflict=0x0,
arbiterIndexes=0x0) at execIndexing.c:388
#8  0x0066a1fc in ExecInsert (mtstate=0x2b99e50,
slot=0x2b9a2c0, planSlot=0x2b9a2c0, arbiterIndexes=0x0,
onconflict=ONCONFLICT_NONE,
estate=0x2b99c00, canSetTag=1 '\001') at nodeModifyTable.c:481
#9  0x0066b841 in ExecModifyTable (node=0x2b99e50) at
nodeModifyTable.c:1496
#10 0x00645e51 in ExecProcNode (node=0x2b99e50) at execProcnode.c:396
#11 0x006424d9 in ExecutePlan (estate=0x2b99c00,
planstate=0x2b99e50, use_parallel_mode=0 '\000', operation=CMD_INSERT,
sendTuples=0 '\000',
numberTuples=0, direction=ForwardScanDirection, dest=0xd73c20
) at execMain.c:1567
#12 0x0064087a in standard_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:338
#13 0x7f55fe590605 in pgss_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#14 0x00640751 in ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:284
#15 0x007c331e in ProcessQuery (plan=0x2b873f0,
sourceText=0x2b89bd0 "insert into foo (index) select $1 from
generate_series(1,1)",
params=0x2b89c20, dest=0xd73c20 ,
completionTag=0x7ffdf540a3f0 "") at pquery.c:187
#16 0x007c4a0c in PortalRunMulti (portal=0x2ae7930,
isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xd73c20
,
altdest=0xd73c20 , completionTag=0x7ffdf540a3f0 "")
at pquery.c:1303
#17 0x007c4055 in PortalRun (portal=0x2ae7930,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2b5dc90,
altdest=0x2b5dc90,
completionTag=0x7ffdf540a3f0 "") at pquery.c:815
#18 0x007bfb45 in exec_execute_message (portal_name=0x2b5d880
"", max_rows=9223372036854775807) at postgres.c:1977
#19 0x007c25c7 in PostgresMain (argc=1, argv=0x2b05df8,
dbname=0x2aca3c0 "postgres", username=0x2b05de0 "edb") at
postgres.c:4133
#20 0x00744d8f in BackendRun (port=0x2aefa60) at postmaster.c:4260
#21 0x00744523 in BackendStartup (port=0x2aefa60) at postmaster.c:3934
#22 0x00740f9c in ServerLoop () at postmaster.c:1691
#23 0x00740623 in PostmasterMain (argc=4, argv=0x2ac81f0) at
postmaster.c:1299
#24 0x006940fe in main (argc=4, argv=0x2ac81f0) at main.c:228

Please let me know for any further inputs.

[1]- 
https://www.postgresql.org/message-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z%2B2Tp8vb2Kmxu%3D6zg%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Sep 14, 2016 at 2:45 PM, Ashutosh Sharma  wrote:
> Hi All,
>
> I am getting following error when running the test script shared by
> Jeff -[1] . The error is observed upon executing the test script for
> around 3-4 hrs.
>
> 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR:  lock
> buffer_content 1 is not held
> 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT:  insert into
> foo (index) select $1 from generate_series(1,1)
>
> 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR:  lock
> buffer_content 10 is not held
> 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT:  insert
> into foo (index) select $1 from generate_series(1,1)
>
> 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR:  lock
> buffer_content 10 is not held
> 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT:  insert
> into foo (index) select $1 from generate_series(1,1)
>
> [1]- 
> https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com
>
> Please note that i am performing the test on latest patch for
> concurrent hash index and WAL log in hash index shared by Amit
> yesterday.
>
>
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:

Re: [HACKERS] pgbench - allow to store select results into variables

2016-09-14 Thread Amit Langote

Hi Fabien,

On 2016/09/13 17:41, Fabien COELHO wrote:
> 
> Hello Amit,
> 
>> [...]
>> There still seems to be a change in behavior of the -r option due to the
>> patch. Consider the following example:
>>
>> select a from a where a = 1 \;
>> select a+1 from a where a = 1;
>> ...
>> - statement latencies in milliseconds:
>> 2.889  select a from a where a = 1 ;
> 
> vs
> 
>> select a from a where a = 1 \; \into a
>> select a+1 from a where a = 1; \into b
>> ...
>> 2.093  select a from a where a = 1 ; select a+1 from a where a = 1;
> 
> Yep.
> 
> Note that there is a small logical conumdrum in this argument: As the
> script is not the same, especially as into was not possible before,
> strictly speaking there is no behavior "change".

Sure, scripts are not the same but it seemed like showing the whole
compound query whereas previously only part of it was shown may be an
implementation artifact of \into.

> This said, what you suggest can be done.
> 
> After giving it some thought, I suggest that it is not needed nor
> desirable. If you want to achieve the initial effect, you just have to put
> the "into a" on the next line:
> 
>   select a from a where a = 1 \;
>   \into a
>   select a+1 from a where a = 1; \into b
> 
> Then you would get the -r cut at the end of the compound command. Thus the
> current version gives full control of what will appear in the summary. If
> I change "\into xxx\n" to mean "also cut here", then there is less control
> on when the cut occurs when into is used.

So it means that position of \into affects where a compound command gets
cut for -r display.  I was just wondering if that was unintentional.

>> One more thing I observed which I am not sure if it's a fault of this
>> patch is illustrated below:
>>
>> $ cat /tmp/into.sql
>> \;
>> select a from a where a = 1 \;
>> select a+1 from a where a = 1;
>>
>> $ pgbench -r -n -t 1 -f /tmp/into.sql postgres
>> 
>> - statement latencies in milliseconds:
>> 2.349  ;
>>
>> Note that the compound select statement is nowhere to be seen in the
>> latencies output. The output remains the same even if I use the \into's.
>> What seems to be going on is that the empty statement on the first line
>> (\;) is the only part kept of the compound statement spanning lines 1-3.
> 
> Yes.
> 
> This is really the (debatable) current behavior, and is not affected by
> the patch. The "-r" summary takes the first line of the command, whatever
> it is. In your example the first line is "\;", so you get what you asked
> for, even if it looks rather strange, obviously.

Yep, perhaps it's strange to write a script like that anyway, :)

 +boolsql_command_in_progress = false;
>> [...]
>> I understood that it refers to what you explain here.  But to me it
>> sounded like the name is referring to the progress of *execution* of a SQL
>> command whereas the code in question is simply expecting to finish
>> *parsing* the SQL command using the next lines.
> 
> Ok. I changed it "sql_command_lexing_in_progress".
> 
>>> The attached patch takes into all your comments but:
>>>  - comment about discarded results...
>>>  - the sql_command_in_progress variable name change
>>>  - the longer message on into at the start of a script
>>
>> The patch seems fine without these, although please consider the concern I
>> raised with regard to the -r option above.
> 
> I have considered it. As the legitimate behavior you suggested can be
> achieved just by putting the into on the next line, ISTM that the current
> proposition gives more control than doing a mandatory cut when into is used.
> 
> Attached is a new version with the boolean renaming.

Thanks.

> The other thing I have considered is whether to implemented a "\gset"
> syntax, as suggested by Pavel and Tom. Bar the aesthetic, the main issue I
> have with it is that it does not work with compound commands, and what I
> want is to get the values out of compound commands... because of my focus
> on latency... so basically "\gset" does not do the job I want... Now I
> recognize that other people would like it, so probably I'll do it anyway
> in another patch.

So, psql's \gset does not work as desired for compound commands (viz. it
is only able to save the result of the last sub-command).  You need to use
\gset with each sub-command separately if no result should be discarded.
Because of undesirable latency characteristics of sending multiple
commands, you want to be able to modify compound command handling such
that every sub-command's result could be saved.  In that regard, it's good
that pgbench does not use PQexec() which only returns the result of the
last sub-command if a compound command was issued through it.  pgbench's
doCustom() currently discards all results by issuing discard_response().
You propose to change things such that results are intercepted in a new
function read_response(), values assigned to intovars corresponding to
each sub-command, and then discarded. The intovars

[HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”

2016-09-14 Thread valeriof

Hi, I'm kind of new to Postgres and I'm trying to create a custom output
plugin for logical replication (Postgres is 9.5.4 version and I'm building
the project from a Windows 8/64 bit machine - same machine where the db is
installed).
I started from the code of the sample test_decoding, and I was trying to
simply rebuild it under a new name and install it into Postgres to see if
the module works. Once done, I will start modifying the code.

My project is built in Visual Studio 2013 and the only steps I took were to
copy the built assembly under Postgres lib folder. When I run the command:

postgres=# SELECT * FROM
pg_create_logical_replication_slot('regression_slot', 'my_decoding');

I get an error saying that "output plugins have to declare the
_PG_output_plugin_init symbol".

In my code, I have the function declared as extern:

extern void _PG_init(void);
extern void _PG_output_plugin_init(OutputPluginCallbacks *cb);

and the body of the function is defined somewhere below it.

The actual code is just a copy of the test_decoding sample:
https://github.com/postgres/postgres/blob/REL9_5_STABLE/contrib/test_decoding/test_decoding.c

I'm not sure if the deployment process is ok (just copying the dll) or if
there is some other step to take. Can anyone shed some light?

thanks -



--
View this message in context: 
http://postgresql.nabble.com/Error-running-custom-plugin-output-plugins-have-to-declare-the-PG-output-plugin-init-symbol-tp5921145.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 3:43 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> Hi All,
> While working on partition-wise join, I had to examine Relids objects
> many times. Printing the Bitmapset::words[] in binary format and then
> inferring the relids takes time and is error prone. Instead I wrote a
> function bms_to_char() which allocates a StringInfo, calls
> outBitmapset() to decode Bitmapset as a set of integers and returns
> the string. In order to examine a Relids object all one has to do is
> execute 'p bms_to_char(bms_object) under gdb.
>
>
Can we not do this as gdb macros? My knowledge is rusty in this area and
lately I'm using LVM debugger (which probably does not have something
equivalent), but I believe gdb allows you to write useful macros. As a
bonus, you can then use them even for inspecting core files.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [PATCH] SortSupport for macaddr type

2016-09-14 Thread Julien Rouhaud

On 26/08/2016 19:44, Brandur wrote:
> Hello,
> 

Hello,

> I've attached a patch to add SortSupport for Postgres' macaddr which has the
> effect of improving the performance of sorting operations for the type. The
> strategy that I employ is very similar to that for UUID, which is to create
> abbreviated keys by packing as many bytes from the MAC address as possible 
> into
> Datums, and then performing fast unsigned integer comparisons while sorting.
> 
> I ran some informal local benchmarks, and for cardinality greater than 100k
> rows, I see a speed up on `CREATE INDEX` of roughly 25% to 65%. (For those
> interested, I put a few more numbers into a small report here [2].)
> 

That's a nice improvement!

> Admittedly, this is not quite as useful as speeding up sorting on a more 
> common
> data type like TEXT or UUID, but the change still seems like a useful
> performance improvement. I largely wrote it as an exercise to familiarize
> myself with the Postgres codebase.
> 
> I'll add an entry into the current commitfest as suggested by the Postgres 
> Wiki
> and follow up here with a link.
> 
> Thanks, and if anyone has feedback or other thoughts, let me know!
> 

I just reviewed your patch.  It applies and compiles cleanly, and the
abbrev feature works as intended.  There's not much to say since this is
heavily inspired on the uuid SortSupport. The only really specific part
is in the abbrev_converter function, and I don't see any issue with it.

I have a few trivial comments:

* you used macaddr_cmp_internal() function name, for uuid the same
function is named uuid_internal_cmp().  Using the same naming pattern is
probably better.

* the function comment on macaddr_abbrev_convert() doesn't mention
specific little-endian handling

* "There will be two bytes of zero padding on the least significant end"

"least significant bits" would be better

* This patch will trigger quite a lot modifications after a pgindent
run.  Could you try to run pgindent on mac.c before sending an updated
patch?

Best regards.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Printing bitmap objects in the debugger

2016-09-14 Thread Ashutosh Bapat

Hi All,
While working on partition-wise join, I had to examine Relids objects
many times. Printing the Bitmapset::words[] in binary format and then
inferring the relids takes time and is error prone. Instead I wrote a
function bms_to_char() which allocates a StringInfo, calls
outBitmapset() to decode Bitmapset as a set of integers and returns
the string. In order to examine a Relids object all one has to do is
execute 'p bms_to_char(bms_object) under gdb.

Is there a way, this can be included in the code? If it's available in
the code, developers don't have to apply the patch and compile it for
debugging.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 29b7712..b4cae11 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -93,20 +93,22 @@
 	 outNode(str, node->fldname))
 
 /* Write a bitmapset field */
 #define WRITE_BITMAPSET_FIELD(fldname) \
 	(appendStringInfo(str, " :" CppAsString(fldname) " "), \
 	 _outBitmapset(str, node->fldname))
 
 
 #define booltostr(x)  ((x) ? "true" : "false")
 
+char *bms_to_char(const Bitmapset *bms);
+
 
 /*
  * _outToken
  *	  Convert an ordinary string (eg, an identifier) into a form that
  *	  will be decoded back to a plain token by read.c's functions.
  *
  *	  If a null or empty string is given, it is encoded as "<>".
  */
 static void
 _outToken(StringInfo str, const char *s)
@@ -204,20 +206,31 @@ _outBitmapset(StringInfo str, const Bitmapset *bms)
 }
 
 /* for use by extensions which define extensible nodes */
 void
 outBitmapset(StringInfo str, const Bitmapset *bms)
 {
 	_outBitmapset(str, bms);
 }
 
 /*
+ * TODO: remove, used for debugging through gdb.
+ */
+char *
+bms_to_char(const Bitmapset *bms)
+{
+	StringInfo str = makeStringInfo();
+	outBitmapset(str, bms);
+	return str->data;
+}
+
+/*
  * Print the value of a Datum given its type.
  */
 void
 outDatum(StringInfo str, Datum value, int typlen, bool typbyval)
 {
 	Size		length,
 i;
 	char	   *s;
 
 	length = datumGetSize(value, typbyval, typlen);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee

On Wed, Sep 14, 2016 at 8:47 AM, Pavan Deolasee 
wrote:

>
>>
> Sawada-san offered to reimplement the patch based on what I proposed
> upthread. In the new scheme of things, we will allocate a fixed size bitmap
> of length 2D bits per page where D is average page density of live + dead
> tuples. (The rational behind multiplying D by a factor of 2 is to consider
> worst case scenario where every tuple also has a LP_DIRECT line
> pointer). The value of D in most real world, large tables should not go
> much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That
> translates to about 25 bytes/page. So all TIDs with offset less than 2D can
> be represented by a single bit. We augment this with an overflow to track
> tuples which fall outside this limit. I believe this area will be small,
> say 10% of the total allocation.
>
>
So I cooked up the attached patch to track number of live/dead tuples found
at offset 1 to MaxOffsetNumber. The idea was to see how many tuples
actually go beyond the threshold of 2D offsets per page. Note that I am
proposing to track 2D offsets via bitmap and rest via regular TID array.

So I ran a pgbench test for 2hrs with scale factor 500. autovacuum scale
factor was set to 0.1 or 10%.

Some interesting bits:

postgres=# select relname, n_tup_ins, n_tup_upd, n_tup_hot_upd, n_live_tup,
n_dead_tup, pg_relation_size(relid)/8192 as relsize,
(n_live_tup+n_dead_tup)/(pg_relation_size(relid)/8192) as density from
pg_stat_user_tables ;
 relname  | n_tup_ins | n_tup_upd | n_tup_hot_upd | n_live_tup |
n_dead_tup | relsize | density
--+---+---+---+++-+-
 pgbench_tellers  |  5000 |  95860289 |  87701578 |   5000 |
   0 |3493 |   1
 pgbench_branches |   500 |  95860289 |  94158081 |967 |
   0 |1544 |   0
 pgbench_accounts |  5000 |  95860289 |  93062567 |   51911657 |
 3617465 |  865635 |  64
 pgbench_history  |  95860289 | 0 | 0 |   95258548 |
   0 |  610598 | 156
(4 rows)

Smaller tellers and branches tables bloat so much that the density as
computed by live + dead tuples falls close to 1 tuple/page. So for such
tables, the idea of 2D bits/page will fail miserably. But I think these
tables are worst representatives and I would be extremely surprised if we
ever find very large table bloated so much. But even then, this probably
tells us that we can't solely rely on the density measure.

Another interesting bit about these small tables is that the largest used
offset for these tables never went beyond 291 which is the value of
MaxHeapTuplesPerPage. I don't know if there is something that prevents
inserting more than  MaxHeapTuplesPerPage offsets per heap page and I don't
know at this point if this gives us upper limit for bits per page (may be
it does).

For pgbench_accounts table, the maximum offset used was 121, though bulk of
the used offsets were at the start of the page (see attached graph). Now
the test did not create enough dead tuples to trigger autovacuum on
pgbench_accounts table. So I ran a manul vacuum at the end. (There are
about 5% dead tuples in the table by the time test finished)

postgres=# VACUUM VERBOSE pgbench_accounts ;
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 2797722 row versions
DETAIL:  CPU 0.00s/9.39u sec elapsed 9.39 sec.
INFO:  "pgbench_accounts": removed 2797722 row versions in 865399 pages
DETAIL:  CPU 0.10s/7.01u sec elapsed 7.11 sec.
INFO:  index "pgbench_accounts_pkey" now contains 5000 row versions in
137099 pages
DETAIL:  2797722 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pgbench_accounts": found 852487 removable, 5000 nonremovable
row versions in 865635 out of 865635 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 802256 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.73s/27.20u sec elapsed 27.98 sec.tuple count at each offset
(offnum:all_tuples:dead_tuples)

For 2797722 dead line pointers, the current representation would have used
2797722  x 6 = 16786332 bytes of memory. The most optimal bitmap would have
used 121 bits/page x 865399 pages = 13089159 bytes where as if we had
provisioned 2D bits/page and assuming D = 64 based on the above
calculation, we would have used 13846384 bytes of memory. This is about 18%
less than the current representation. Of course, we would have allocated
some space for overflow region, which will make the difference
smaller/negligible. But the bitmaps would be extremely cheap to lookup
during index scans.

Now may be I got lucky, may be I did nor run tests long enough (though I
believe that may have worked in favour of bitmap), may be mostly HOT
updated tables are not good candidate for

Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()

2016-09-14 Thread Thomas Berger

Today, i found the time to read all the mails in this thread, and i think i 
have to explain, why we decided to open a bug for this behavior.

Pn Tuesday, 23. August 2016, 13:30:29 Robert Haas wrote:
> J. Random User: I'm having a problem!
> Mailing List: Gee, how big are your tables?
> J. Random User: Here's some pg_size_pretty output.
> Mailing List: Gosh, we don't know what that means, what do you have
> this obscure GUC set to?
> J. Random User: Maybe I'll just give up on SQL and use MongoDB.

In fact, we had just the other way around. One of our most critical databases 
had some extreme bloat.
Some of our internal customers was very confused, about the sizes reported by 
the database.
This confusion has led to wrong decisions. (And a long discussion about the 
choice of DBMS btw)

I think there is a point missing in this whole discussion, or i just didn't see 
it:

Yeah, the behavior of "kB" is defined in the "postgresql.conf" documentation.
But no _user_ reads this. There is no link or hint in the documentation of 
"pg_size_pretty()" [1].

[1] 
https://www.postgresql.org/docs/9.5/static/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE

-- 
Thomas Berger

PostgreSQL DBA
Database Operations

1&1 Telecommunication SE | Ernst-Frey-Straße 10 | 76135 Karlsruhe | Germany
Phone: +49 721 91374-6566
E-Mail: thomas.ber...@1und1.de | Web: www.1und1.de

Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 23963

Vorstand: Markus Huhn, Alessandro Nava, Moritz Roth, Ludger Sieverding, Martin 
Witt
Aufsichtsratsvorsitzender: Michael Scheeren

Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen 
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten 
Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, 
diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise 
auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient of this e-mail, you are hereby notified that saving, 
distribution or use of the content of this e-mail in any way is prohibited. If 
you have received this e-mail in error, please notify the sender and delete the 
e-mail.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Ashutosh Sharma

Hi All,

I am getting following error when running the test script shared by
Jeff -[1] . The error is observed upon executing the test script for
around 3-4 hrs.

57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR:  lock
buffer_content 1 is not held
57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT:  insert into
foo (index) select $1 from generate_series(1,1)

124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR:  lock
buffer_content 10 is not held
124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT:  insert
into foo (index) select $1 from generate_series(1,1)

124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR:  lock
buffer_content 10 is not held
124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT:  insert
into foo (index) select $1 from generate_series(1,1)

[1]- 
https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com

Please note that i am performing the test on latest patch for
concurrent hash index and WAL log in hash index shared by Amit
yesterday.


With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com



On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
 wrote:
> On 09/13/2016 07:41 AM, Amit Kapila wrote:
>>>
>>> README:
>>> +in_complete split flag.  The reader algorithm works correctly, as it
>>> will
>>> scan
>>>
>>> What flag ?
>>>
>>
>> in-complete-split flag which indicates that split has to be finished
>> for that particular bucket.  The value of these flags are
>> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
>> bucket respectively.  It is set in hasho_flag in special area of page.
>> I have slightly expanded the definition in README, but not sure if it
>> is good idea to mention flags defined in hash.h. Let me know, if still
>> it is unclear or you want some thing additional to be added in README.
>>
>
> I think it is better now.
>
>>> hashxlog.c:
>>>
>>> hash_xlog_move_page_contents
>>> hash_xlog_squeeze_page
>>>
>>> Both have "bukcetbuf" (-> "bucketbuf"), and
>>>
>>> +   if (BufferIsValid(bukcetbuf));
>>>
>>> ->
>>>
>>> +   if (BufferIsValid(bucketbuf))
>>>
>>> and indent following line:
>>>
>>> LockBufferForCleanup(bukcetbuf);
>>>
>>> hash_xlog_delete
>>>
>>> has the "if" issue too.
>>>
>>
>> Fixed all the above cosmetic issues.
>>
>
> I meant there is an extra ';' on the "if" statements:
>
> +   if (BufferIsValid(bukcetbuf)); <--
>
> in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
> hash_xlog_delete.
>
>
> Best regards,
>  Jesper
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL consistency check facility

2016-09-14 Thread Kuntal Ghosh

On Wed, Sep 14, 2016 at 11:31 AM, Michael Paquier
 wrote:
> On Wed, Sep 14, 2016 at 2:56 PM, Kuntal Ghosh
>  wrote:
>> Master
>> ---
>> - If wal_consistency check is enabled or needs_backup is set in
>> XLogRecordAssemble(), we do a fpw.
>> - If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
>> which in turns set has_image flag while decoding the record.
>> - If a fpw needs to be restored during redo, i.e., needs_backup is true,
>> then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.
>
> Here that should be if wal_consistency is true, no?
Nope. I'll try to explain using some pseudo-code:
XLogRecordAssemble()
{

 include_image = needs_backup || wal_consistency[rmid];
 if (include_image)
 {
  
  set XLogRecordBlockHeader.fork_flags |= BKPBLOCK_HAS_IMAGE;
  if (needs_backup)
set XLogRecordBlockImageHeader.bimg_info
  |=BKPIMAGE_IS_REQUIRED_FOR_REDO;
  
 }
 .
}

XLogReadBufferForRedoExtended()
{
 ..
 if (XLogRecHasBlockImage() && XLogRecHasBlockImageForRedo())
 {
  RestoreBlockImage();
  
  return BLK_RESTORED;
 }
 ..
}

checkConsistency()
{
 
 if (wal_consistency[rmid] && !XLogRecHasBlockImage())
  throw error;
 .
}

*XLogRecHasBlockImageForRedo() checks whether bimg_info is set with
 BKPIMAGE_IS_REQUIRED_FOR_REDO.

For a backup image any of the followings is possible:
1. consistency should be checked.
2. page should restored.
3. both 1 and 2.

Consistency check can be controlled by a guc parameter. But, standby
should be conveyed whether an image should be restored. For that, we
have used the new flag.
Suggestions?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-09-14 Thread Rajkumar Raghuwanshi

I have Continued with testing declarative partitioning with the latest
patch. Got some more observation, given below

-- Observation 1 : Getting overlap error with START with EXCLUSIVE in range
partition.

create table test_range_bound ( a int) partition by range(a);
--creating a partition to contain records {1,2,3,4}, by default 1 is
inclusive and 5 is exclusive
create table test_range_bound_p1 partition of test_range_bound for values
start (1) end (5);
--now trying to create a partition by explicitly mentioning start is
exclusive to contain records {5,6,7}, here trying to create with START with
4 as exclusive so range should be 5 to 8, but getting partition overlap
error.
create table test_range_bound_p2 partition of test_range_bound for values
start (4) EXCLUSIVE end (8);
ERROR:  partition "test_range_bound_p2" would overlap partition
"test_range_bound_p1"

-- Observation 2 : able to create sub-partition out of the range set for
main table, causing not able to insert data satisfying any of the partition.

create table test_subpart (c1 int) partition by range (c1);
create table test_subpart_p1 partition of test_subpart for values start (1)
end (100) inclusive partition by range (c1);
create table test_subpart_p1_sub1 partition of test_subpart_p1 for values
start (101) end (200);

\d+ test_subpart
 Table "public.test_subpart"
 Column |  Type   | Modifiers | Storage | Stats target | Description
+-+---+-+--+-
 c1 | integer |   | plain   |  |
Partition Key: RANGE (c1)
Partitions: test_subpart_p1 FOR VALUES START (1) END (100) INCLUSIVE

\d+ test_subpart_p1
   Table "public.test_subpart_p1"
 Column |  Type   | Modifiers | Storage | Stats target | Description
+-+---+-+--+-
 c1 | integer |   | plain   |  |
Partition Of: test_subpart FOR VALUES START (1) END (100) INCLUSIVE
Partition Key: RANGE (c1)
Partitions: test_subpart_p1_sub1 FOR VALUES START (101) END (200)

insert into test_subpart values (50);
ERROR:  no partition of relation "test_subpart_p1" found for row
DETAIL:  Failing row contains (50).
insert into test_subpart values (150);
ERROR:  no partition of relation "test_subpart" found for row
DETAIL:  Failing row contains (150).

-- Observation 3 : Getting cache lookup failed, when selecting list
partition table containing array.

CREATE TABLE test_array ( i int,j int[],k text[]) PARTITION BY LIST (j);
CREATE TABLE test_array_p1 PARTITION OF test_array FOR VALUES IN ('{1}');
CREATE TABLE test_array_p2 PARTITION OF test_array FOR VALUES IN ('{2,2}');

INSERT INTO test_array (i,j[1],k[1]) VALUES (1,1,1);
INSERT INTO test_array (i,j[1],j[2],k[1]) VALUES (2,2,2,2);

postgres=# SELECT tableoid::regclass,* FROM test_array_p1;
   tableoid| i |  j  |  k
---+---+-+-
 test_array_p1 | 1 | {1} | {1}
(1 row)

postgres=# SELECT tableoid::regclass,* FROM test_array_p2;
   tableoid| i |   j   |  k
---+---+---+-
 test_array_p2 | 2 | {2,2} | {2}
(1 row)

postgres=# SELECT tableoid::regclass,* FROM test_array;
ERROR:  cache lookup failed for type 0

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Fri, Sep 9, 2016 at 2:25 PM, Amit Langote 
wrote:

> On 2016/09/06 22:04, Amit Langote wrote:
> > Will fix.
>
> Here is an updated set of patches.
>
> In addition to fixing a couple of bugs reported by Ashutosh and Rajkumar,
> there are a few of major changes:
>
> * change the way individual partition bounds are represented internally
>   and the way a collection of partition bounds associated with a
>   partitioned table is exposed to other modules.  Especially list
>   partition bounds which are manipulated more efficiently as discussed
>   at [1].
>
> * \d partitioned_table now shows partition count and \d+ lists partition
>   names and their bounds as follows:
>
> \d t6
>Table "public.t6"
>  Column |   Type| Modifiers
> .---+---+---
>  a  | integer   |
>  b  | character varying |
> Partition Key: LIST (a)
> Number of partitions: 3 (Use \d+ to list them.)
>
> \d+ t6
>Table "public.t6"
>  Column |   Type| Modifiers | Storage  | Stats target |
> Description
> .---+---+---+--+
> --+-
>  a  | integer   |   | plain|  |
>  b  | character varying |   | extended |  |
> Partition Key: LIST (a)
> Partitions: t6_p1 FOR VALUES IN (1, 2, NULL),
> t6_p2 FOR VALUES IN (4, 5),
> t6_p3 FOR VALUES IN (3, 6)
>
> \d+ p
>  Table "public.p"
>  Column | Type | Modifiers | Storage  | Stats target | Description
>

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-14 Thread Dilip Kumar

On Wed, Sep 14, 2016 at 10:25 AM, Dilip Kumar  wrote:
> I have tested performance with approach 1 and approach 2.
>
> 1. Transaction (script.sql): I have used below transaction to run my
> bench mark, We can argue that this may not be an ideal workload, but I
> tested this to put more load on ClogControlLock during commit
> transaction.
>
> ---
> \set aid random (1,3000)
> \set tid random (1,3000)
>
> BEGIN;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> SAVEPOINT s1;
> SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
> SAVEPOINT s2;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> END;
> ---
>
> 2. Results
> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
> scale factor: 300
> Clients   head(tps)grouplock(tps)  granular(tps)
> ---  -   --   ---
> 12829367 3932637421
> 18029777 3781036469
> 25628523 3741835882
>
>
> grouplock --> 1) Group mode to reduce CLOGControlLock contention
> granular  --> 2) Use granular locking model
>
> I will test with 3rd approach also, whenever I get time.
>
> 3. Summary:
> 1. I can see on head we are gaining almost ~30 % performance at higher
> client count (128 and beyond).
> 2. group lock is ~5% better compared to granular lock.

Forgot to mention that, this test is on unlogged tables.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: speeding up GIN build with parallel workers

2016-09-14 Thread Heikki Linnakangas


On 01/17/2016 10:03 PM, Jeff Janes wrote:

On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan  wrote:

On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan  wrote:

I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.


I am currently working on a patch that allows B-Tree index builds to
be performed in parallel. I think I'm a week or two away from posting
it.

Even without parallelism, wouldn't it be better if GIN indexes were
built using tuplesort? I know way way less about the gin am than the
nbtree am, but I imagine that a prominent cost for GIN index builds is
constructing the main B-Tree (the one that's constructed over key
values) itself. Couldn't tuplesort.c be adapted to cover this case?
That would be much faster in general, particularly with the recent
addition of abbreviated keys, while also leaving a clear path forward
to performing the build in parallel.


I think it would take a lot of changes to tuple sort to make this be a
almost-always win.

In the general case each GIN key occurs in many tuples, and the
in-memory rbtree is good at compressing the tid list for each key to
maximize the amount of data that can be in memory (although perhaps it
could be even better, as it doesn't use varbyte encoding on the tid
list the way the posting lists on disk do--it could do so in the bulk
build case, where it receives the tid in order, but not feasibly in
the pending-list clean-up case, where it doesn't get them in order)

When I was testing building an index on a column of text identifiers
where there were a couple million identifiers occurring a few dozen
times each, building it as a gin index (using btree_gin so that plain
text columns could be indexed) was quite a bit faster than building it
as a regular btree index using tuple sort.  I didn't really
investigate this difference, but I assume it was due to the better
memory usage leading to less IO.


I've been wondering about this too, using tuplesort to build GIN 
indexes, for a long time. Surely the highly-optimized quicksort+merge 
code in tuplesort.c is faster than maintaining a red-black tree? Or so I 
thought.


I wrote a quick prototype of using tuplesort.c for GIN index build. I 
tested it with a 500 MB table of integer arrays, so that the sort fit 
completely in memory. It's a lot slower than the current code. It turns 
out eliminating the duplicates early is really really important.


So we want to keep the red-black tree, to eliminate the duplicates. Or 
add that capability to tuplesort.c, which might speed up Sort+Unique and 
aggregates in general, but that's a big effort.


However, I still wonder, if we shouldn't use a merge approach when the 
tree doesn't fit in memory, like tuplesort.c does. Currently, when the 
tree is full, we flush it out to the index, by inserting all the 
entries. That can get quite expensive, I/O-wise. It also generates more 
WAL, compared to writing each page only once.


If we flushed the tree to a tape instead, then we could perhaps use the 
machinery that Peter's parallel B-tree patch is adding to tuplesort.c, 
to merge the tapes. I'm not sure if that works out, but I think it's 
worth some experimentation.



But I do think this with aggregation would be worth it despite the
amount of work involved.  In addition to automatically benefiting from
parallel sorts and any other future sort improvements, I think it
would also generate better indexes.  I have a theory that a problem
with very large gin indexes is that repeatedly building
maintenance_work_mem worth of rbtree entries and then merging them to
disk yields highly fragmented btrees, in which logical adjacent
key-space does not occupy physically nearby leaf pages, which then can
causes problems either with access that follows a pattern (like range
scans, except gin indexes can't really do those anyway) or further
bulk operations.


Yeah, there's that too.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] kqueue

2016-09-14 Thread Michael Paquier

On Wed, Sep 14, 2016 at 3:32 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> From an OSX laptop with -S, -c 1 and -M prepared (9 runs, removed the
>> three best and three worst):
>> - HEAD: 9356/9343/9369
>> - HEAD + patch: 9433/9413/9461.071168
>> This laptop has a lot of I/O overhead... Still there is a slight
>> improvement here as well. Looking at the progress report, per-second
>> TPS gets easier more frequently into 9500~9600 TPS with the patch. So
>> at least I am seeing something.
>
> Which OSX version exactly?

El Capitan 10.11.6. With -s 20 (300MB) and 1GB of shared_buffers so as
everything is on memory. Actually re-running the tests now with no VMs
around and no apps, I am getting close to 9650~9700TPS with patch, and
9300~9400TPS on HEAD, so that's unlikely only noise.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] kqueue

2016-09-14 Thread Tom Lane

Michael Paquier  writes:
> From an OSX laptop with -S, -c 1 and -M prepared (9 runs, removed the
> three best and three worst):
> - HEAD: 9356/9343/9369
> - HEAD + patch: 9433/9413/9461.071168
> This laptop has a lot of I/O overhead... Still there is a slight
> improvement here as well. Looking at the progress report, per-second
> TPS gets easier more frequently into 9500~9600 TPS with the patch. So
> at least I am seeing something.

Which OSX version exactly?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL consistency check facility

2016-09-14 Thread Michael Paquier

On Wed, Sep 14, 2016 at 2:56 PM, Kuntal Ghosh
 wrote:
> Master
> ---
> - If wal_consistency check is enabled or needs_backup is set in
> XLogRecordAssemble(), we do a fpw.
> - If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
> which in turns set has_image flag while decoding the record.
> - If a fpw needs to be restored during redo, i.e., needs_backup is true,
> then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.

Here that should be if wal_consistency is true, no?

> Standby
> ---
> - In XLogReadBufferForRedoExtended(), if both XLogRecHasBlockImage() and
> XLogRecHasBlockImageForRedo()(added by me*) return true, we restore the
> backup image.
> - In checkConsistency, we only check if XLogRecHasBlockImage() returns true
> when wal_consistency check is enabled for this rmid.

My guess would have been that you do not need to check anymore for
wal_consistency in checkConsistency, making the GUC value only used on
master node.

> *XLogRecHasBlockImageForRedo() checks whether bimg_info is set with
> BKPIMAGE_IS_REQUIRED_FOR_REDO.

Yes, that's more or less what you should have.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

93 matches

Mail list logo