Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-20 Thread Amit Kapila
On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes  wrote:
> On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila 
> wrote:
>>
>>
>> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> that, I have changed _hash_alloc_buckets() to initialize the page
>> instead of making it completely Zero because of problems discussed in
>> another related thread [1].  I have also updated README.
>>
>
> with v7 of the concurrent has patch and v4 of the write ahead log patch and
> the latest relcache patch (I don't know how important that is to reproducing
> this, I suspect it is not), I once got this error:
>
>
> 38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> interrupted; last known up at 2016-09-19 16:25:49 PDT
> 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> properly shut down; automatic recovery in progress
> 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification failed,
> calculated checksum 65067 but expected 21260
> 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of
> relation base/16384/17334
> 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>
>
> The original page with the invalid checksum is:
>

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
!  errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
!  errhint("JJ is screwing with the database.")));
! } else {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code.  The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page.  I think this could happen though
rarely for any page with the above kind of tests.

Does this explanation explains the reason of problem you are seeing?

>
> If I ignore the checksum failure and re-start the system, the page gets
> restored to be a bitmap page.
>

Okay, but have you ensured in some way that redo is applied to bitmap page?


Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1].  That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards.  I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”

2016-09-20 Thread Ashutosh Bapat
On Tue, Sep 20, 2016 at 4:32 PM, valeriof  wrote:
> Hi Ashutosh,
> Thank you for your answer. At the end I realized that the PGDLLEXPORT
> keyword was missing from the functions definitions.
>
> As a side question, what are the options to debug the plugin while it's
> being executing? I've seen a debug plugin for Postgres but it seems more for
> SQL functions and stored procedures. Is it possible to attach the process
> from Visual Studio debugger?
>

I have never used Visual Studio, but you might find something useful
at 
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tracking wait event for latches

2016-09-20 Thread Michael Paquier
On Wed, Sep 21, 2016 at 1:03 PM, Thomas Munro
 wrote:
> Yeah but that's at run time.  I meant you could help developers
> discover ASAP if they add a new item to one place but not the other
> with a compile time assertion:
> const char *
> GetWaitEventIdentifier(uint16 eventId)
> {
> StaticAssertStmt(lengthof(WaitEventNames) == WE_LAST_TYPE + 1,
>  "WaitEventNames must match WaitEventIdentifiers");
> if (eventId > WE_LAST_TYPE)
> return "???";
> return WaitEventNames[eventId];
> }

Ah, OK, good idea. I had AssertStmt in mind, not StaticAssertStmt.

> You missed a couple that are hiding inside #ifdef WIN32:
>
> From pgstat.c:
> +   EVENT_PGSTAT_MAIN);
>
> From syslogger.c:
> + EVENT_SYSLOGGER_MAIN);

Oops. Fixed those ones and checked the builds on WIN32.
-- 
Michael
diff --git a/contrib/postgres_fdw/connection.c 
b/contrib/postgres_fdw/connection.c
index 8ca1c1c..7cecef8 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -496,7 +496,7 @@ pgfdw_get_result(PGconn *conn, const char *query)
wc = WaitLatchOrSocket(MyLatch,
   WL_LATCH_SET 
| WL_SOCKET_READABLE,
   
PQsocket(conn),
-  -1L);
+  -1L, 
WE_EXTENSION);
ResetLatch(MyLatch);
 
CHECK_FOR_INTERRUPTS();
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0776428..5185d65 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -679,6 +679,18 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   buffer in question.
  
 
+
+ 
+  EventSet: The server process is waiting on one or more
+  sockets, a time or an inter-process latch.  The wait happens in a
+  WaitEventSet, PostgreSQL's portable
+  I/O multiplexing abstraction using boolean variables letting a
+  process sleep until it is set.  It supports several type of
+  operations, like postmaster death handling, timeout and socket
+  activity lookup, and are a useful replacement for sleep
+  where signal handling matters.
+ 
+

   
  
@@ -1085,6 +1097,139 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
  BufferPin
  Waiting to acquire a pin on a buffer.
 
+
+ EventSet
+ ArchiverMain
+ Waiting in main loop of the archiver process.
+
+
+ AutoVacuumMain
+ Waiting in main loop of autovacuum launcher process.
+
+
+ BaseBackupThrottle
+ Waiting during base backup when throttling activity.
+
+
+ BgWorkerShutdown
+ Waiting for background worker to shut down.
+
+
+ BgWorkerStartup
+ Waiting for background worker to start up.
+
+
+ BgWriterHibernate
+ Waiting in background writer process, hibernating.
+
+
+ BgWriterMain
+ Waiting in main loop of background writer process background 
worker.
+
+
+ CheckpointerMain
+ Waiting in main loop of checkpointer process.
+
+
+ ExecuteGather
+ Waiting for activity from child process when executing 
Gather node.
+
+
+ Extension
+ Waiting in the code path of an extension, should be used by 
custom plugins and modules
+
+
+ MessageQueueInternal
+ Waiting for other process to be attached in shared message 
queue.
+
+
+ MessageQueuePutMessage
+ Waiting to put new message in shared message queue.
+
+
+ MessageQueueReceive
+ Waiting to receive bytes in shared message queue.
+
+
+ MessageQueueSend
+ Waiting to send bytes in shared message queue.
+
+
+ ParallelFinish
+ Waiting for parallel workers to finish computing.
+
+
+ PgSleep
+ Waiting in process that called pg_sleep.
+
+
+ PgStatMain
+ Waiting in main loop of the statistics collector 
process.
+
+
+ ProcSignal
+ Waiting for signal from another backend.
+
+
+ ProcSleep
+ Waiting for a specific lock.
+
+
+ RecoveryApplyDelay
+ Waiting to apply WAL at recovery because it is delayed.
+
+
+ RecoveryWalAll
+ Waiting for WAL from any kind 

Re: [HACKERS] Tracking wait event for latches

2016-09-20 Thread Thomas Munro
On Wed, Sep 21, 2016 at 3:40 PM, Michael Paquier
 wrote:
> On Wed, Sep 21, 2016 at 8:13 AM, Thomas Munro
>  wrote:
>> It looks like this array wants to be in alphabetical order, but it
>> isn't quite.  Also, perhaps a compile time assertion about the size of
>> the array matching EVENT_LAST_TYPE could be useful?
>
> In GetWaitEventIdentifier()? I'd think that just returning ??? would
> have been fine if there is a non-matching call.

Yeah but that's at run time.  I meant you could help developers
discover ASAP if they add a new item to one place but not the other
with a compile time assertion:

const char *
GetWaitEventIdentifier(uint16 eventId)
{
StaticAssertStmt(lengthof(WaitEventNames) == WE_LAST_TYPE + 1,
 "WaitEventNames must match WaitEventIdentifiers");
if (eventId > WE_LAST_TYPE)
return "???";
return WaitEventNames[eventId];
}

>> +1 from me too for avoiding the overly general term 'event'.  It does
>> seem a little odd to leave the enumerators names as EVENT_...  though;
>> shouldn't these be WAIT_EVENT_... or WE_...?  Or perhaps you could
>> consider WaitPointIdentifier and WP_SECURE_READ or
>> WaitEventPointIdentifier and WEP_SECURE_READ, if you buy my earlier
>> argument that what we are really naming here is point in the code
>> where we wait, not the events we're waiting for.  Contrast with
>> LWLocks where we report the lock that you're waiting for, not the
>> place in the code where you're waiting for that lock.
>
> Well, WE_ if I need make a choice for something else than EVENT_.

You missed a couple that are hiding inside #ifdef WIN32:

>From pgstat.c:
+   EVENT_PGSTAT_MAIN);

>From syslogger.c:
+ EVENT_SYSLOGGER_MAIN);

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tracking wait event for latches

2016-09-20 Thread Michael Paquier
On Wed, Sep 21, 2016 at 8:13 AM, Thomas Munro
 wrote:
> + 
> +  EventSet: The server process is waiting on a socket
> +  or a timer. This wait happens in a latch, an inter-process facility
> +  using boolean variables letting a process sleep until it is set.
> +  Latches support several type of operations, like postmaster death
> +  handling, timeout and socket activity lookup, and are a useful
> +  replacement for sleep where signal handling matters.
> + 
>
> This paragraph seems a bit confused.  I suggest something more like
> this:  "The server process is waiting for one or more sockets, a timer
> or an interprocess latch.  The wait happens in a WaitEventSet,
> PostgreSQL's portable IO multiplexing abstraction."

OK, I have tweaked the paragraph as follows using your suggestion:
+
+ 
+  EventSet: The server process is waiting on one or more
+  sockets, a time or an inter-process latch.  The wait happens in a
+  WaitEventSet, PostgreSQL's portable
+  I/O multiplexing abstraction using boolean variables letting a
+  process sleep until it is set.  It supports several type of
+  operations, like postmaster death handling, timeout and socket
+  activity lookup, and are a useful replacement for sleep
+  where signal handling matters.
+ 
+

> On Tue, Aug 23, 2016 at 7:01 PM, Michael Paquier
>  wrote:
>>> On Mon, Aug 22, 2016 at 12:09 PM, Alexander Korotkov
>>>  wrote:
 Would it be better to use an array here?
>>
>> Okay, I have switched to an array
>
> I looked at various macro tricks[1] but they're all pretty unpleasant!
>  +1 for the simple array with carefully managed order.  About that
> order...
> [1] 
> http://stackoverflow.com/questions/147267/easy-way-to-use-variables-of-enum-types-as-string-in-c

Yes, I recall bumping on this one, or something really similar to that...

> +const char *const WaitEventNames[] = {
> [...]
> +};
>
> It looks like this array wants to be in alphabetical order, but it
> isn't quite.  Also, perhaps a compile time assertion about the size of
> the array matching EVENT_LAST_TYPE could be useful?

In GetWaitEventIdentifier()? I'd think that just returning ??? would
have been fine if there is a non-matching call.

> +typedef enum WaitEventIdentifier
> +{
> [...]
> +} WaitEventIdentifier;
>
> This is also nearly but not exactly in alphabetical order
> (EVENT_PROC_SIGNAL comes after EVENT_PROC_SLEEP for example).  But
> it's not currently possible to have the strings and the enumerators
> both in alphabetical order because they're not the same, even
> accounting for camel-case to upper-case transliteration.  I think at
> least one of them should be sorted.  Shouldn't they match fully and
> then *both* be sorted alphabetically?  For example
> "MessageQueueInternal" doesn't match EVENT_MQ_WAIT_INTERNAL.  Then
> there are some cases where I'd expect underscores for consistency with
> other enumerators and with the corresponding camel-case strings: you
> have EVENT_WAL_SENDER_MAIN, but EVENT_WALWRITER_MAIN.

Not wrong..

> The documentation is in a slightly different order again but also not
> exactly alphabetical: for example ProcSleep is listed before
> ProcSignal.

Right.

> Sorry if this all sounds terribly picky but I think we should try to
> be strictly systematic here.

No worries about that, it matters a lot for this patch. The user-faced
documentation is what should do the decision-making I think. So let's
order the names, and adapt the enum depending on that. I have done so
after double-checking both lists, and added a comment for anybody
updating that in the fiture.

 EventIdentifier seems too general name for me, isn't it?  Could we name it
 WaitEventIdentifier? Or WaitEventId for shortcut?
>>
>> ... And WaitEventIdentifier.
>
> +1 from me too for avoiding the overly general term 'event'.  It does
> seem a little odd to leave the enumerators names as EVENT_...  though;
> shouldn't these be WAIT_EVENT_... or WE_...?  Or perhaps you could
> consider WaitPointIdentifier and WP_SECURE_READ or
> WaitEventPointIdentifier and WEP_SECURE_READ, if you buy my earlier
> argument that what we are really naming here is point in the code
> where we wait, not the events we're waiting for.  Contrast with
> LWLocks where we report the lock that you're waiting for, not the
> place in the code where you're waiting for that lock.

Well, WE_ if I need make a choice for something else than EVENT_.

> On the other hand, if we could *accurately* report whether it's a
> "Latch", "Socket" or "Latch|Socket" that we're waiting for, it might
> be cool to do that instead.  One way to do that would be to use up
> several class IDs:  WAIT_EVENT_LATCH, WAIT_EVENT_LATCH_OR_SOCKET,
> WAIT_EVENT_SOCKET (or perhaps WAIT_EVENT_LATCH | WAIT_EVENT_SOCKET,

Re: [HACKERS] Logical Replication WIP

2016-09-20 Thread Steve Singer

On Tue, 20 Sep 2016, Peter Eisentraut wrote:


On 9/18/16 4:17 PM, Steve Singer wrote:




I think if we want to prevent the creation of subscriptions that point
to self, then we need to create a magic token when the postmaster starts
and check for that when we connect.  So more of a running-instance
identifier instead of a instance-on-disk identifier.

The other option is that we just allow it and make it more robust.


I think we should go with the second option for now. I feel that the effort 
is  better spent making sure that initial syncs that have don't subscribe 
(for whatever reasons) can be aborted instead of trying to build a concept of 
node identity before we really need it.


Steve



--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-20 Thread Dilip Kumar
On Tue, Sep 20, 2016 at 9:15 AM, Dilip Kumar  wrote:
> +1
>
> My test are under run, I will post it soon..

I have some more results now

8 socket machine
10 min run(median of 3 run)
synchronous_commit=off
scal factor = 300
share buffer= 8GB

test1: Simple update(pgbench)

Clients Head  GroupLock
 32   45702   45402
 64   46974   51627
128  35056   55362


test2: TPCB (pgbench)

Clients Head   GroupLock
 32   27969   27765
 64   33140   34786
128  21555   38848

Summary:
--
At 32 clients no gain, I think at this workload Clog Lock is not a problem.
At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB.
At 128 Clients we can see > 50% gain.

Currently I have tested with synchronous commit=off, later I can try
with on. I can also test at 80 client, I think we will see some
significant gain at this client count also, but as of now I haven't
yet tested.

With above results, what we think ? should we continue our testing ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix Error Message for allocate_recordbuf() Failure

2016-09-20 Thread Michael Paquier
On Wed, Sep 21, 2016 at 10:25 AM, Michael Paquier
 wrote:
> On Wed, Sep 21, 2016 at 12:32 AM, Robert Haas  wrote:
>> For what it's worth, I think it's fine.  Good error messages are a useful 
>> thing.
>>
>> More generally, I think the whole patch looks good and should be committed.
>
> Hm. I'd think that it is still more portable to just issue a WARNING
> message in palloc_extended() when MCXT_ALLOC_NO_OOM then. Other code
> paths could benefit from that as well, and the patch proposed does
> nothing for the other places calling it. I am fine to write a patch
> for this purpose if you agree on that.

Or in short the attached. All the other callers of palloc_extended
benefit from that.
-- 
Michael
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 5cf388f..52b9c1b 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -973,6 +973,10 @@ palloc_extended(Size size, int flags)
 errmsg("out of memory"),
 errdetail("Failed on request of size 
%zu.", size)));
}
+   ereport(WARNING,
+   (errcode(ERRCODE_OUT_OF_MEMORY),
+errmsg("out of memory"),
+errdetail("Failed on request of size %zu.", 
size)));
return NULL;
}
 
diff --git a/src/common/fe_memutils.c b/src/common/fe_memutils.c
index 58c5c4c..0d44ef6 100644
--- a/src/common/fe_memutils.c
+++ b/src/common/fe_memutils.c
@@ -30,9 +30,9 @@ pg_malloc_internal(size_t size, int flags)
tmp = malloc(size);
if (tmp == NULL)
{
+   fprintf(stderr, _("could not allocate %zu bytes of memory\n"), 
size);
if ((flags & MCXT_ALLOC_NO_OOM) == 0)
{
-   fprintf(stderr, _("out of memory\n"));
exit(EXIT_FAILURE);
}
return NULL;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New SQL counter statistics view (pg_stat_sql)

2016-09-20 Thread Haribabu Kommi
On Thu, Sep 15, 2016 at 6:01 AM, Robert Haas  wrote:

> On Fri, Sep 2, 2016 at 2:33 AM, Haribabu Kommi 
> wrote:
> > On Wed, Aug 31, 2016 at 3:19 AM, Alvaro Herrera <
> alvhe...@2ndquadrant.com>
> > wrote:
> >> Haribabu Kommi wrote:
> >>
> >>> Apart from the above, here are the following list of command tags that
> >>> are generated in the code, I took only the first word of the command
> tag
> >>> just to see how many categories present. The number indicates the
> >>> subset of operations or number of types it is used. Like create table,
> >>> create function and etc.
> >>
> >> Sounds about right.  I suppose all those cases that you aggregated here
> >> would expand to full tags in the actual code.  I furthermore suppose
> >> that some of these could be ignored, such as the transaction ones and
> >> things like load, lock, move, fetch, discard, deallocate (maybe lump
> >> them all together under "other", or some other rough categorization, as
> >> Tom suggests).
> >
> > Following is the pg_stat_sql view with the SQL categories that I
> considered
> > that are important. Rest of the them will be shown under others category.
> >
> > postgres=# \d pg_stat_sql
> >  View "pg_catalog.pg_stat_sql"
> >  Column  |   Type   | Modifiers
> > -+--+---
> >  inserts | bigint   |
> >  deletes | bigint   |
> >  updates | bigint   |
> >  selects | bigint   |
> >  declare_cursors | bigint   |
> >  closes  | bigint   |
> >  creates | bigint   |
> >  drops   | bigint   |
> >  alters  | bigint   |
> >  imports | bigint   |
> >  truncates   | bigint   |
> >  copies  | bigint   |
> >  grants  | bigint   |
> >  revokes | bigint   |
> >  clusters| bigint   |
> >  vacuums | bigint   |
> >  analyzes| bigint   |
> >  refreshs| bigint   |
> >  locks   | bigint   |
> >  checkpoints | bigint   |
> >  reindexes   | bigint   |
> >  deallocates | bigint   |
> >  others  | bigint   |
> >  stats_reset | timestamp with time zone |
> >
> >
> > If any additions/deletions, I can accommodate them.
>
> I think it is not a good idea to make the command names used here the
> plural forms of the command tags.  Instead of "inserts", "updates",
> "imports", etc. just use "INSERT", "UPDATE", "IMPORT".  That's simpler
> and less error prone - e.g. you won't end up with things like
> "refreshs", which is not a word.


Thanks for your suggestion. I also thought of changing the name while
writing
"refreshs" as a column name of the view originally. A small restriction
with the
change of names to command names is that, user needs to execute the query
as follows.

select pg_stat_sql.select from pg_stat_sql;

Updated patch is attached with the corrections. Apart from name change,
added documentation and tests.

Regards,
Hari Babu
Fujitsu Australia


pg_stat_sql_2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix Error Message for allocate_recordbuf() Failure

2016-09-20 Thread Michael Paquier
On Wed, Sep 21, 2016 at 12:32 AM, Robert Haas  wrote:
> For what it's worth, I think it's fine.  Good error messages are a useful 
> thing.
>
> More generally, I think the whole patch looks good and should be committed.

Hm. I'd think that it is still more portable to just issue a WARNING
message in palloc_extended() when MCXT_ALLOC_NO_OOM then. Other code
paths could benefit from that as well, and the patch proposed does
nothing for the other places calling it. I am fine to write a patch
for this purpose if you agree on that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] get_home_path: use HOME

2016-09-20 Thread Rudolf Gavlas
2016-09-20 18:35 GMT+02:00, Tom Lane :
> Rudolf Gavlas  writes:
>> The usage of HOME environment variable (if set) is IMO the right,
>> standard and faster way to get_home_path().
>
> Can you provide some evidence for that claim?  I can believe "faster"
> but the rest sounds like wishful thinking.

1) NetBSD glob(3)
http://netbsd.gw.com/cgi-bin/man-cgi?glob+3+NetBSD-current
ENVIRONMENT
HOME  If defined, used as the home directory of the current user in
tilde expansions.

2) BIND
https://nxr.netbsd.org/xref/src/external/bsd/bind/dist/bin/dig/dig.c#1765

3) less
https://nxr.netbsd.org/xref/src/external/bsd/less/dist/cmdbuf.c#1403
(https://nxr.netbsd.org/xref/src/external/bsd/less/dist/decode.c#533)

4) NetBSD sh(1)
http://netbsd.gw.com/cgi-bin/man-cgi?sh+1+NetBSD-current
ENVIRONMENT
HOME  Set automatically by login(1) from the user's login directory in
the password file (passwd(5)).  This environment variable also
functions as the default argument for the cd built-in.

5) bash(1) (version 4.3.39)
Shell Variables
The following variables are used by the shell.  In some cases, bash
assigns a default value to a variable; these cases are noted below.
HOME   The home directory of the current user; the default argument
for the cd builtin command.  The value of this variable is also used
when performing tilde expansion.

6) OpenLDAP
https://nxr.netbsd.org/xref/src/external/bsd/openldap/dist/libraries/libldap/init.c#331

I've just grabbed what I have at hand, the list could go on ...

>> I work in an environment, where servers are administered by people
>> with different user names and identical uid (0).
>
> I think what you have there is an incredibly badly-designed system that
> can be expected to break outside software (eg, Postgres).  If we take
> this patch, what's to stop someone from complaining that we broke *their*
> badly-designed system that abuses the HOME variable?  I'm pretty hesitant
> to touch code that's worked the same way for a decade or two on such a
> basis.

I don't think this system is incredibly bad. But that's off-topic.

If you think that using the value of HOME variable as the user's home
directory is bad idea, I won't argue with that, I've already expressed
my opinion. What is the real problem here is using home directory of a
user A as a home directory for user B. That's clearly a bug and if you
want to solve it without using HOME, I am fine with that.

r.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] get_home_path: use HOME

2016-09-20 Thread Rudolf Gavlas
2016-09-20 18:55 GMT+02:00, Alvaro Herrera :
> Rudolf Gavlas wrote:
>
>> I work in an environment, where servers are administered by people
>> with different user names and identical uid (0).
>
> So everyone is superuser there?  That sounds, um, unorthodox.

Yes, the administrators of the servers, that means people responsible
for installing, configuring and running all of the software on the
servers day and night are superusers there. I am quite surprised it
may sound unorthodox. I am only used to unix environment though. What
is the orthodox way of doing that, btw?

r.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exclude schema during pg_restore

2016-09-20 Thread Peter Eisentraut
On 9/19/16 3:23 PM, Michael Banck wrote:
> Version 2 attached.

Committed, thanks.

I added the new option to the help output in pg_restore.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-20 Thread Bruce Momjian
On Mon, Sep 19, 2016 at 03:50:38PM -0400, Robert Haas wrote:
> It will probably have some bugs, but they probably won't be worse than
> the status quo:
> 
> WARNING: hash indexes are not WAL-logged and their use is discouraged
> 
> Personally, I think it's outright embarrassing that we've had that
> limitation for years; it boils down to "hey, we have this feature but
> it doesn't work", which is a pretty crummy position for the world's
> most advanced open-source database to take.

No question.  We inherited the technical dept of hash indexes 20 years
ago and haven't really solved it yet.  We keep making incremental
improvements, which keeps it from being removed, but hash is still far
behind other index types.

> > I'm rather unenthused about having a hash index implementation that's
> > mildly better in some corner cases, but otherwise doesn't have much
> > benefit. That'll mean we'll have to step up our user education a lot,
> > and we'll have to maintain something for little benefit.
> 
> If it turns out that it has little benefit, then we don't really need
> to step up our user education.  People can just keep using btree like

The big problem is people coming from other databases and assuming our
hash indexes have the same benefits over btree that exist in some other
database software.  The 9.5 warning at least helps with that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less expensive pg_buffercache on big shmem

2016-09-20 Thread Tomas Vondra

On 09/02/2016 11:01 AM, Robert Haas wrote:

On Fri, Sep 2, 2016 at 8:49 AM, Andres Freund  wrote:

On 2016-09-02 08:31:42 +0530, Robert Haas wrote:

I wonder whether we ought to just switch from the consistent method to
the semiconsistent method and call it good.


+1. I think, before long, we're going to have to switch away from having
locks & partitions in the first place. So I don't see a problem relaxing
this. It's not like that consistency really buys you anything...  I'd
even consider not using any locks.


I think we certainly want to lock the buffer header, because otherwise
we might get a torn read of the buffer tag, which doesn't seem good.
But it's not obvious to me that there's any point in taking the lock
on the buffer mapping partition; I'm thinking that doesn't really do
anything unless we lock them all, and we all seem to agree that's
going too far.


+1 from me to only locking the buffer headers. IMHO that's perfectly 
fine for the purpose of this extension.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less expensive pg_buffercache on big shmem

2016-09-20 Thread Tomas Vondra

On 09/05/2016 06:19 PM, Ivan Kartyshov wrote:

On 09/03/2016 05:04 AM, Tomas Vondra wrote:

This patch needs a rebase, as 06d7fd6e bumped the version to 1.2.

Thank you for a valuable hint.



So, will we get a rebased patch? I see the patch is back in 'needs 
review' but there's no new version.



> If we will replace consistent method, then we should replace it with

the

> partially consistent method (called "nonconsistent") because:
> 1) it's based on fast spinlocks (it's not fully up to its name, though)

In other words, it does exactly the thing proposed up-thread, i.e.
locking only buffer headers.

What do you mean by fast spinlocks? And why aren't they up to the name?


Not they (spinlocks), but the name “nonconsistent” is somewhat
misleading. At the moment we can't implement concurrent shared memory
access without locks in general, so most inconsistent method that has
been proposed was the "nonconsistent" one. But roughly speaking
*nonconsistent* is not as such by the name, because it contains a
locking mechanism (spinlocks).



I'm a bit confused by the naming, but my understanding is that 
nonconsistent only acquires (spin)locks on individual buffers, so it 
does not have a consistent view on shared_buffers as a whole, but data 
about individual buffers are consistent. I think that's fine and 
perfectly acceptable for this extension.



> 2) it's *almost* the fastest one (the less time needed for execution of
> method, the less data core will change and as a consequence the more
> consistent snapshot will be)

I'm not sure I understand what you're saying here? What do you mean by
"almost the fastest one"?


I mean the fastest one that has been proposed ("consistent"|
"semiconsistent"| "nonconsistent").


I'm a bit confused by the results you reported before, i.e. that

1) nonconsistent is 10% faster than consistent method
2) semiconsistent is 5-times slower than nonconsistent method

What does that mean? Are you refering to duration of the queries reading
data from pg_buffercache, or to impact on the other queries?


Here I mean "working duration time".



I have no idea what "working duration time" is.


How can be semiconsistent 5x slower than nonconsistent? Wouldn't that
make it significantly slower than the consistent method?


Firstly, when we want to measure the quality of pg_buffercache, we must
measure several values:
1) Execution time (duration of the queries reading data from
pg_buffercache)
2) How it influences the system (and other queries) during its work

Secondly, the semiconsistent is slower than nonconsistent and consistent
method, but it makes less influence on other queries then consistent.

Thirdly, it is slower because it locks the partitions of shared memory
in a different way than in consistent or nonconsistent methods.
The semi-consistent strategy implies that only one buffer is locked at a
time. Therefor has no significant effect on the system, but it takes
more time.



It'd be really useful if you could provide actual numbers, explain what 
metrics you compare and how. I'm sure it makes sense to you but it's 
utterly confusing for everyone else, and it also makes it impossible to 
reproduce the benchmark.


I've looked at the code (applied on top of e3b607c) and I see this in 
pg_buffercache_pages:


for (j = 0; j < num_partit; j++)
{
if (snaptype == CONS_SNAP)
for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
LWLockAcquire(BufMappingPartitionLockByIndex(i),
  LW_SHARED);
else if (snaptype == SEMICONS_SNAP)
LWLockAcquire(BufMappingPartitionLockByIndex(j), LW_SHARED);

for (i = 0; i < NBuffers; i++)
{
  ... walk through all shared buffers ...
}
 }

How is the SEMICONS_SNAP case supposed to only scan the currently locked 
partition, when we simply scans all shared buffers for SEMICONS_SNAP? 
That seems outright broken, it should scan only the currently locked 
partition.


Secondly, I see this bit added to the loop over buffers:

if (bufHdr->tag.forkNum == -1)
{
fctx->record[i].blocknum = InvalidBlockNumber;
continue;
}

and I have no idea why this is needed (when it was not needed before).

The patch also breaks a few comments, because it adds code between the 
comment and the related code. For example the (likely unnecessary) bit 
of code changes this:


/* Lock each buffer header before inspecting. */
buf_state = LockBufHdr(bufHdr);

into this

/* Lock each buffer header before inspecting. */
if (bufHdr->tag.forkNum == -1)
{
fctx->record[i].blocknum = InvalidBlockNumber;
continue;
}
buf_state = LockBufHdr(bufHdr);

which is confusing (and I'd argue broken).

Similarly when releasing the locks, the original comment/code block 
looks like this:


  /*
   * And release locks.  We do this in reverse order for two reasons:
   * (1) Anyone else who needs more than one of the locks 

Re: [HACKERS] Tracking wait event for latches

2016-09-20 Thread Thomas Munro
+ 
+  EventSet: The server process is waiting on a socket
+  or a timer. This wait happens in a latch, an inter-process facility
+  using boolean variables letting a process sleep until it is set.
+  Latches support several type of operations, like postmaster death
+  handling, timeout and socket activity lookup, and are a useful
+  replacement for sleep where signal handling matters.
+ 

This paragraph seems a bit confused.  I suggest something more like
this:  "The server process is waiting for one or more sockets, a timer
or an interprocess latch.  The wait happens in a WaitEventSet,
PostgreSQL's portable IO multiplexing abstraction."

On Tue, Aug 23, 2016 at 7:01 PM, Michael Paquier
 wrote:
>> On Mon, Aug 22, 2016 at 12:09 PM, Alexander Korotkov
>>  wrote:
>>> Would it be better to use an array here?
>
> Okay, I have switched to an array

I looked at various macro tricks[1] but they're all pretty unpleasant!
 +1 for the simple array with carefully managed order.  About that
order...

+const char *const WaitEventNames[] = {
+ "ArchiverMain",
+ "AutoVacuumMain",
+ "BaseBackupThrottle",
+ "BgWorkerStartup",
+ "BgWorkerShutdown",
+ "BgWriterMain",
+ "BgWriterHibernate",
+ "CheckpointerMain",
+ "ExecuteGather",
+ "Extension",
+ "MessageQueuePutMessage",
+ "MessageQueueSend",
+ "MessageQueueReceive",
+ "MessageQueueInternal",
+ "ParallelFinish",
+ "PgStatMain",
+ "ProcSleep",
+ "ProcSignal",
+ "PgSleep",
+ "SecureRead",
+ "SecureWrite",
+ "SSLOpenServer",
+ "SyncRep",
+ "SysLoggerMain",
+ "RecoveryApplyDelay",
+ "RecoveryWalAll",
+ "RecoveryWalStream",
+ "WalReceiverWaitStart",
+ "WalReceiverMain",
+ "WalSenderMain",
+ "WalSenderWaitForWAL",
+ "WalSenderWriteData"
+ "WalWriterMain",
+};

It looks like this array wants to be in alphabetical order, but it
isn't quite.  Also, perhaps a compile time assertion about the size of
the array matching EVENT_LAST_TYPE could be useful?

+typedef enum WaitEventIdentifier
+{
+ EVENT_ARCHIVER_MAIN,
+ EVENT_AUTOVACUUM_MAIN,
+ EVENT_BASEBACKUP_THROTTLE,
+ EVENT_BGWORKER_STARTUP,
+ EVENT_BGWORKER_SHUTDOWN,
+ EVENT_BGWRITER_MAIN,
+ EVENT_BGWRITER_HIBERNATE,
+ EVENT_CHECKPOINTER_MAIN,
+ EVENT_EXECUTE_GATHER,
+ EVENT_EXTENSION,
+ EVENT_MQ_PUT_MESSAGE,
+ EVENT_MQ_SEND_BYTES,
+ EVENT_MQ_RECEIVE_BYTES,
+ EVENT_MQ_WAIT_INTERNAL,
+ EVENT_PARALLEL_WAIT_FINISH,
+ EVENT_PGSTAT_MAIN,
+ EVENT_PROC_SLEEP,
+ EVENT_PROC_SIGNAL,
+ EVENT_PG_SLEEP,
+ EVENT_SECURE_READ,
+ EVENT_SECURE_WRITE,
+ EVENT_SSL_OPEN_SERVER,
+ EVENT_SYNC_REP,
+ EVENT_SYSLOGGER_MAIN,
+ EVENT_RECOVERY_APPLY_DELAY,
+ EVENT_RECOVERY_WAL_ALL,
+ EVENT_RECOVERY_WAL_STREAM,
+ EVENT_WAL_RECEIVER_WAIT_START,
+ EVENT_WAL_RECEIVER_MAIN,
+ EVENT_WAL_SENDER_WRITE_DATA,
+ EVENT_WAL_SENDER_MAIN,
+ EVENT_WAL_SENDER_WAIT_WAL,
+ EVENT_WALWRITER_MAIN
+} WaitEventIdentifier;

This is also nearly but not exactly in alphabetical order
(EVENT_PROC_SIGNAL comes after EVENT_PROC_SLEEP for example).  But
it's not currently possible to have the strings and the enumerators
both in alphabetical order because they're not the same, even
accounting for camel-case to upper-case transliteration.  I think at
least one of them should be sorted.  Shouldn't they match fully and
then *both* be sorted alphabetically?  For example
"MessageQueueInternal" doesn't match EVENT_MQ_WAIT_INTERNAL.  Then
there are some cases where I'd expect underscores for consistency with
other enumerators and with the corresponding camel-case strings: you
have EVENT_WAL_SENDER_MAIN, but EVENT_WALWRITER_MAIN.

The documentation is in a slightly different order again but also not
exactly alphabetical: for example ProcSleep is listed before
ProcSignal.

Sorry if this all sounds terribly picky but I think we should try to
be strictly systematic here.

>>> EventIdentifier seems too general name for me, isn't it?  Could we name it
>>> WaitEventIdentifier? Or WaitEventId for shortcut?
>
> ... And WaitEventIdentifier.

+1 from me too for avoiding the overly general term 'event'.  It does
seem a little odd to leave the enumerators names as EVENT_...  though;
shouldn't these be WAIT_EVENT_... or WE_...?  Or perhaps you could
consider WaitPointIdentifier and WP_SECURE_READ or
WaitEventPointIdentifier and WEP_SECURE_READ, if you buy my earlier
argument that what we are really naming here is point in the code
where we wait, not the events we're waiting for.  Contrast with
LWLocks where we report the lock that you're waiting for, not the
place in the code where you're waiting for that lock.

On Wed, Aug 3, 2016 at 1:31 AM, Michael Paquier
 wrote:
> Finally, I have changed the patch so as it does not track "Latch"
> events, but "EventSet" events instead, per the suggestion of Thomas.
> "WaitEventSet" is too close to wait_event in my taste so I shortened
> the suggeston.

This is good, because showing "Latch" when we were really waiting for
a socket 

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-20 Thread Tomas Vondra

Hi,

On 09/19/2016 09:10 PM, Robert Haas wrote:
>

It's possible that the effect of this patch depends on the number of
sockets. EDB test machine cthulhu as 8 sockets, and power2 has 4
sockets. I assume Dilip's tests were run on one of those two,
although he doesn't seem to have mentioned which one. Your system is
probably 2 or 4 sockets, which might make a difference. Results
might also depend on CPU architecture; power2 is, unsurprisingly, a
POWER system, whereas I assume you are testing x86. Maybe somebody
who has access should test on hydra.pg.osuosl.org, which is a
community POWER resource. (Send me a private email if you are a known
community member who wants access for benchmarking purposes.)



Yes, I'm using x86 machines:

1) large but slightly old
- 4 sockets, e5-4620 (so a bit old CPU, 32 cores in total)
- kernel 3.2.80

2) smaller but fresh
- 2 sockets, e5-2620 v4 (newest type of Xeons, 16 cores in total)
- kernel 4.8.0


Personally, I find the results so far posted on this thread
thoroughly unimpressive. I acknowledge that Dilip's results appear
to show that in a best-case scenario these patches produce a rather
large gain. However, that gain seems to happen in a completely
contrived scenario: astronomical client counts, unlogged tables, and
a test script that maximizes pressure on CLogControlLock. If you
have to work that hard to find a big win, and tests under more
reasonable conditions show no benefit, it's not clear to me that it's
really worth the time we're all spending benchmarking and reviewing
this, or the risk of bugs, or the damage to the SLRU abstraction
layer. I think there's a very good chance that we're better off
moving on to projects that have a better chance of helping in the
real world.


I'm posting results from two types of workloads - traditional r/w 
pgbench and Dilip's transaction. With synchronous_commit on/off.


Full results (including script driving the benchmark) are available 
here, if needed:


https://bitbucket.org/tvondra/group-clog-benchmark/src

It'd be good if someone could try reproduce this on a comparable 
machine, to rule out my stupidity.



2 x e5-2620 v4 (16 cores, 32 with HT)
=

On the "smaller" machine the results look like this - I have only tested 
up to 64 clients, as higher values seem rather uninteresting on a 
machine with only 16 physical cores.


These are averages of 5 runs, where the min/max for each group are 
within ~5% in most cases (see the "spread" sheet). The "e5-2620" sheet 
also shows the numbers as % compared to master.



 dilip / sync=off  148   16   32   64
--
 master 47561767235542573037459682138
 granular-locking   47451772835078561057298377858
 no-content-lock46461765034887557947327379000
 group-update   45821775735383569747438781794

 dilip / sync=on   148   16   32   64
--
 master 48191758335636574377462082036
 granular-locking   45681781635122561687319278462
 no-content-lock45401766234747555607350879320
 group-update   44951761235474570957440981874

 pgbench / sync=off148   16   32   64
--
 master 37911436827806433695447262956
 granular-locking   38221446227597431735639164669
 no-content-lock37251421227471430415543163589
 group-update   38951445327574434055678362406

 pgbench / sync=on 148   16   32   64
--
 master 39071428927802437175690262916
 granular-locking   37701450327636441075520563903
 no-content-lock37721411127388430545642464386
 group-update   38441433427452436215589662498

There's pretty much no improvement at all - most of the results are 
within 1-2% of master, in both directions. Hardly a win.


Actually, with 1 client there seems to be ~5% regression, but it might 
also be noise and verifying it would require further testing.



4 x e5-4620 v1 (32 cores, 64 with HT)
=

These are averages of 10 runs, and there are a few strange things here.

Firstly, for Dilip's workload the results get much (much) worse between 
64 and 128 clients, for some reason. I suspect this might be due to 
fairly old kernel (3.2.80), so I'll reboot the machine with 4.5.x kernel 
and try again.


Secondly, the min/max differences get much larger than the ~5% on the 

Re: [HACKERS] gratuitous casting away const

2016-09-20 Thread Mark Dilger

> On Sep 20, 2016, at 1:06 PM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>> Would patches to not cast away const be considered?
> 
> In general, yes, but I'm not especially in favor of something like this:
> 
>> bool
>> PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
>> -   Item newtup, Size newsize)
>> +   const char *newtup, Size 
>> newsize)
>> {
> 
> since that seems to be discarding type information in order to add
> "const"; does not seem like a net benefit from here.

The following seems somewhere in between, with ItemPointer
changing to const ItemPointerData *.  I expect you would not care
for this change, but thought I'd check to see where you draw the line:


diff --git a/src/backend/access/gin/ginbulk.c b/src/backend/access/gin/ginbulk.c
index 71c64e4..903b01f 100644
--- a/src/backend/access/gin/ginbulk.c
+++ b/src/backend/access/gin/ginbulk.c
@@ -244,7 +244,7 @@ ginInsertBAEntries(BuildAccumulator *accum,
 static int
 qsortCompareItemPointers(const void *a, const void *b)
 {
-   int res = ginCompareItemPointers((ItemPointer) a, (ItemPointer) b);
+   int res = ginCompareItemPointers((const ItemPointerData *) a, 
(const ItemPointerData *) b);

/* Assert that there are no equal item pointers being sorted */
Assert(res != 0);
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index bf589ab..2e5a7dff 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -968,7 +968,7 @@ extern ItemPointer ginMergeItemPointers(ItemPointerData *a, 
uint32 na,
  * so we want this to be inlined.
  */
 static inline int
-ginCompareItemPointers(ItemPointer a, ItemPointer b)
+ginCompareItemPointers(const ItemPointerData *a, const ItemPointerData *b)
 {
uint64  ia = (uint64) a->ip_blkid.bi_hi << 32 | (uint64) 
a->ip_blkid.bi_lo << 16 | a->ip_posid;
uint64  ib = (uint64) b->ip_blkid.bi_hi << 32 | (uint64) 
b->ip_blkid.bi_lo << 16 | b->ip_posid;




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] needlessly casting away const in localtime.c

2016-09-20 Thread Tom Lane
Mark Dilger  writes:
> Friends, per the recent thread "gratuitous casting away const", the
> assignment on line 1247 of localtime.c has const lvalue and rvalue,
> yet casts through (char *) rather than (const char *).  Fix attached.

If you want to propose cosmetic improvements in localtime.c, or the
majority of the other stuff in src/timezone, you need to send it to
the IANA folk.  Otherwise it'll just be lost next time we sync with
them.  See src/timezone/README.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] needlessly casting away const in localtime.c

2016-09-20 Thread Mark Dilger
Friends, per the recent thread "gratuitous casting away const", the
assignment on line 1247 of localtime.c has const lvalue and rvalue,
yet casts through (char *) rather than (const char *).  Fix attached.

Mark Dilger



localtime.c.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2016-09-20 Thread Andres Freund
On 2016-09-20 16:32:46 -0400, Robert Haas wrote:
> > Requiring a non-default compile time or even just cluster creation time
> > option for tuning isn't something worth expanding energy on imo.
> 
> I don't agree.  The latency requirements on an archive_command when
> you're churning out 16MB files multiple times per second are insanely
> tight, and saying that we shouldn't increase the size because it's
> better to go redesign a bunch of other things that will eventually
> *maybe* remove the need for archive_command does not seem like a
> reasonable response.

Oh, I'm on board with increasing the default size a bit. A different
default size isn't a non-default compile time option anymore though, and
I don't think 1GB is a reasonable default.

Running multiple archive_commands concurrently - pretty easy to
implement - isn't the same as removing the need for archive command. I'm
pretty sure that continously,and if necessary concurrently, archiving a
bunch of 64MB files is going to work better than irregularly
creating / transferring 1GB files.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 4:25 PM, Andres Freund  wrote:
>> Even with a 1GB segment size, some of them will fill multiple files
>> per minute.  At the current limit of 64MB, a few of them would still
>> fill more than one file per second.  That is not sane.
>
> I doubt generating much larger files actually helps a lot there. I bet
> you a patch review that 1GB files are going to regress in pretty much
> every situation; especially when taking latency into account.

Well, you have a point: let's find out.  Suppose we create a cluster
that generates WAL very quickly, and then try different WAL segment
sizes and see what works out best.  Maybe something like: create N
relatively small tables, with 100 or so tuples in each one.  Have N
backends, each assigned one of those tables, and it just updates all
the rows over and over in a tight loop.  Or feel free to suggest
something else.

> I think what's actually needed for that is:
> - make it easier to implement archiving via streaming WAL; i.e. make
>   pg_receivexlog actually usable
> - make archiving parallel
> - decouple WAL write & fsyncing granularity from segment size
>
> Requiring a non-default compile time or even just cluster creation time
> option for tuning isn't something worth expanding energy on imo.

I don't agree.  The latency requirements on an archive_command when
you're churning out 16MB files multiple times per second are insanely
tight, and saying that we shouldn't increase the size because it's
better to go redesign a bunch of other things that will eventually
*maybe* remove the need for archive_command does not seem like a
reasonable response.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2016-09-20 Thread Andres Freund
On 2016-09-20 16:18:02 -0400, Robert Haas wrote:
> On Tue, Sep 20, 2016 at 4:09 PM, Andres Freund  wrote:
> > That sounds way too big to me. WAL file allocation would trigger pretty
> > massive IO storms during zeroing, max_wal_size is going to be hard to
> > tune, the amounts of dirty data during bulk loads is going to be very
> > hard to control.  If somebody wants to do something like this they
> > better be well informed enough to override a #define.
> 
> EnterpriseDB has customers generating multiple TB of WAL per day.

Sure, that's kind of common.


> Even with a 1GB segment size, some of them will fill multiple files
> per minute.  At the current limit of 64MB, a few of them would still
> fill more than one file per second.  That is not sane.

I doubt generating much larger files actually helps a lot there. I bet
you a patch review that 1GB files are going to regress in pretty much
every situation; especially when taking latency into account.
I think what's actually needed for that is:
- make it easier to implement archiving via streaming WAL; i.e. make
  pg_receivexlog actually usable
- make archiving parallel
- decouple WAL write & fsyncing granularity from segment size

Requiring a non-default compile time or even just cluster creation time
option for tuning isn't something worth expanding energy on imo.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 4:09 PM, Andres Freund  wrote:
> On 2016-09-20 16:05:44 -0400, Robert Haas wrote:
>> On Tue, Sep 20, 2016 at 2:49 PM, Peter Eisentraut
>>  wrote:
>> > On 8/24/16 9:31 PM, Robert Haas wrote:
>> >> I'd like to propose that we increase the default WAL segment size,
>> >> which is currently 16MB.
>> >
>> > While the discussion about the best default value is ongoing, maybe we
>> > should at least *allow* some larger sizes, for testing out.  Currently,
>> > configure says "Allowed values are 1,2,4,8,16,32,64.".  What might be a
>> > good new upper limit?
>
> I'm doubtful it's worth increasing this.
>
>> 1GB?
>
> That sounds way too big to me. WAL file allocation would trigger pretty
> massive IO storms during zeroing, max_wal_size is going to be hard to
> tune, the amounts of dirty data during bulk loads is going to be very
> hard to control.  If somebody wants to do something like this they
> better be well informed enough to override a #define.

EnterpriseDB has customers generating multiple TB of WAL per day.
Even with a 1GB segment size, some of them will fill multiple files
per minute.  At the current limit of 64MB, a few of them would still
fill more than one file per second.  That is not sane.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2016-09-20 Thread Andres Freund
On 2016-09-20 16:05:44 -0400, Robert Haas wrote:
> On Tue, Sep 20, 2016 at 2:49 PM, Peter Eisentraut
>  wrote:
> > On 8/24/16 9:31 PM, Robert Haas wrote:
> >> I'd like to propose that we increase the default WAL segment size,
> >> which is currently 16MB.
> >
> > While the discussion about the best default value is ongoing, maybe we
> > should at least *allow* some larger sizes, for testing out.  Currently,
> > configure says "Allowed values are 1,2,4,8,16,32,64.".  What might be a
> > good new upper limit?

I'm doubtful it's worth increasing this.


> 1GB?

That sounds way too big to me. WAL file allocation would trigger pretty
massive IO storms during zeroing, max_wal_size is going to be hard to
tune, the amounts of dirty data during bulk loads is going to be very
hard to control.  If somebody wants to do something like this they
better be well informed enough to override a #define.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rename max_parallel_degree?

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 3:31 PM, Peter Eisentraut
 wrote:
> On 9/20/16 11:47 AM, Robert Haas wrote:
 >> #define BGWORKER_CLASS_MASK   0x00f0
 >> #define BGWORKER_CLASS_PARALLEL  0x0010
 >> /* add additional bgworker classes here */
>>> >
>>> > Unless we have a mechanism that makes use of this somehow, this attempt
>>> > at categorizing might be premature.
>> My main reason for wanting is that I suspect there will be a similar
>> desire to limit the number of replication workers at some point.
>
> Would that work for something developed as an extension?  Would we all
> have to agree on non-conflicting numbers?  Maybe a string tag would work
> better?

No, I'm assuming that the classes would be built-in.  A string tag
seems like over-engineering to me, particularly because the postmaster
needs to switch on the tag, and we need to be very careful about the
degree to which the postmaster trusts the contents of shared memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gratuitous casting away const

2016-09-20 Thread Tom Lane
Mark Dilger  writes:
> Would patches to not cast away const be considered?

In general, yes, but I'm not especially in favor of something like this:

>  bool
>  PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
> -   Item newtup, Size newsize)
> +   const char *newtup, Size 
> newsize)
>  {

since that seems to be discarding type information in order to add
"const"; does not seem like a net benefit from here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 2:49 PM, Peter Eisentraut
 wrote:
> On 8/24/16 9:31 PM, Robert Haas wrote:
>> I'd like to propose that we increase the default WAL segment size,
>> which is currently 16MB.
>
> While the discussion about the best default value is ongoing, maybe we
> should at least *allow* some larger sizes, for testing out.  Currently,
> configure says "Allowed values are 1,2,4,8,16,32,64.".  What might be a
> good new upper limit?

1GB?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] gratuitous casting away const

2016-09-20 Thread Mark Dilger
Friends,

There are places in the code that cast away const for no apparent
reason, fixed like such:



diff --git a/src/backend/executor/nodeIndexscan.c 
b/src/backend/executor/nodeIndexscan.c
index 3143bd9..fe2cfc2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -409,8 +409,8 @@ static int
 reorderqueue_cmp(const pairingheap_node *a, const pairingheap_node *b,
 void *arg)
 {
-   ReorderTuple *rta = (ReorderTuple *) a;
-   ReorderTuple *rtb = (ReorderTuple *) b;
+   const ReorderTuple *rta = (const ReorderTuple *) a;
+   const ReorderTuple *rtb = (const ReorderTuple *) b;
IndexScanState *node = (IndexScanState *) arg;
 
return -cmp_orderbyvals(rta->orderbyvals, rta->orderbynulls,



There are also places which appear to cast away const due to using
typedefs that don't include const, which make for a more messy fix,
like such:



diff --git a/src/backend/storage/page/bufpage.c 
b/src/backend/storage/page/bufpage.c
index 73aa0c0..9e157a3 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1037,7 +1037,7 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber 
offnum)
  */
 bool
 PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
-   Item newtup, Size newsize)
+   const char *newtup, Size 
newsize)
 {
PageHeader  phdr = (PageHeader) page;
ItemId  tupid;
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..cd97a1a 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -431,7 +431,7 @@ extern void PageIndexTupleDelete(Page page, OffsetNumber 
offset);
 extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
 extern void PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offset);
 extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
-   Item newtup, Size newsize);
+   const char *newtup, Size 
newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);



Are these castings away of const consistent with the project's coding standards?
Would patches to not cast away const be considered?

Thanks,

Mark Dilger

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rename max_parallel_degree?

2016-09-20 Thread Peter Eisentraut
On 9/20/16 11:47 AM, Robert Haas wrote:
>>> >> #define BGWORKER_CLASS_MASK   0x00f0
>>> >> #define BGWORKER_CLASS_PARALLEL  0x0010
>>> >> /* add additional bgworker classes here */
>> >
>> > Unless we have a mechanism that makes use of this somehow, this attempt
>> > at categorizing might be premature.
> My main reason for wanting is that I suspect there will be a similar
> desire to limit the number of replication workers at some point.

Would that work for something developed as an extension?  Would we all
have to agree on non-conflicting numbers?  Maybe a string tag would work
better?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE

2016-09-20 Thread Peter Eisentraut
On 9/20/16 3:04 PM, David Fetter wrote:
> On Fri, Sep 09, 2016 at 09:57:21AM -0400, Peter Eisentraut wrote:
>> Review of the patch in the commit fest:
>>
>> - The documentation is a bit incorrect about the ways to load this
>>   module.  shared_preload_libraries is not necessary.  session_ and
>>   local_ (with prep) should also work.
> 
> Would you be so kind as to describe how you got
> local_preload_libraries to work?  I'm stuck on getting Makefile to
> realize that the hook should be installed in $libdir/plugins rather
> than $libdir itself.

There is no Makefile support for that, and AFAICT, that particular
feature is kind of obsolescent.  I wouldn't worry about it too much.
The main point was, there are multiple ways to load modules.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE

2016-09-20 Thread David Fetter
On Fri, Sep 09, 2016 at 09:57:21AM -0400, Peter Eisentraut wrote:
> Review of the patch in the commit fest:
> 
> - The documentation is a bit incorrect about the ways to load this
>   module.  shared_preload_libraries is not necessary.  session_ and
>   local_ (with prep) should also work.

Would you be so kind as to describe how you got
local_preload_libraries to work?  I'm stuck on getting Makefile to
realize that the hook should be installed in $libdir/plugins rather
than $libdir itself.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] get_home_path: use HOME

2016-09-20 Thread Alvaro Herrera
Rudolf Gavlas wrote:
> 2016-09-20 18:55 GMT+02:00, Alvaro Herrera :
> > Rudolf Gavlas wrote:
> >
> >> I work in an environment, where servers are administered by people
> >> with different user names and identical uid (0).
> >
> > So everyone is superuser there?  That sounds, um, unorthodox.
> 
> Yes, the administrators of the servers, that means people responsible
> for installing, configuring and running all of the software on the
> servers day and night are superusers there. I am quite surprised it
> may sound unorthodox. I am only used to unix environment though. What
> is the orthodox way of doing that, btw?

In my view of the world, each of the admins would have a regular user,
with the privilege of running commands as superuser using something like
"sudo" (including running a shell).

get_home_path is psql's code.  I would expect client connections to come
from regular users, as it is considered risky to run all code with
elevated privileges, anyway.

As I recall, if you tried to start the postgres server using a superuser
account you would quickly find out that it completely refuses to start.
I suppose it works because some start script su's to the postgres
unprivileged account to run pg_ctl.  (Windows is an exception to this,
where it used to be customary to run servers using administrator
privileges, where instead of outright refusing to run, pg_ctl would drop
all privileges first.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "Some tests to cover hash_index"

2016-09-20 Thread Alvaro Herrera

Why not use generate_series() queries to insert the appropriate number
of tuples, instead of a handful of INSERT lines each time?  Since each
insert is a separate transaction, that would probably be faster.

Why do you have a plpgsql function just to create a cursor?  Wouldn't it
be simpler to create the cursor in an SQL statement?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2016-09-20 Thread Peter Eisentraut
On 8/24/16 9:31 PM, Robert Haas wrote:
> I'd like to propose that we increase the default WAL segment size,
> which is currently 16MB.

While the discussion about the best default value is ongoing, maybe we
should at least *allow* some larger sizes, for testing out.  Currently,
configure says "Allowed values are 1,2,4,8,16,32,64.".  What might be a
good new upper limit?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-20 Thread Claudio Freire
On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:
> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>>
>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>> the other thread, with these patches, that'd be nice. These patches are
>> effectively a replacement for
>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>> would be much appreciated too, of course.
>>
>> Attached are new versions. Compared to last set, they contain a few comment
>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>> that were completely unused.
>
>
> Will do so

Well, here they are, the results.

ODS format only (unless you've got issues opening the ODS).

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.


logtape_preload_timings.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-20 Thread Jesper Pedersen

On 09/16/2016 02:42 AM, Amit Kapila wrote:

Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1].  I have also updated README.



Thanks.

This needs a rebase against the CHI v8 [1] patch.

[1] 
https://www.postgresql.org/message-id/CAA4eK1+X=8sud1uczdzne3d9cgi9kw+kjxp2tnw7sx5w8pl...@mail.gmail.com


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2016-09-20 Thread Jesper Pedersen

Hi,

On 09/20/2016 01:46 PM, Ashutosh Sharma wrote:

I am getting a fatal error along with some warnings when trying to
apply the v3 patch using "git apply" command.

[ashu@localhost postgresql]$ git apply
0001-pageinspect-Hash-index-support_v3.patch
0001-pageinspect-Hash-index-support_v3.patch:29: trailing whitespace.
  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
0001-pageinspect-Hash-index-support_v3.patch:36: trailing whitespace.
DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \
0001-pageinspect-Hash-index-support_v3.patch:37: trailing whitespace.
pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
0001-pageinspect-Hash-index-support_v3.patch:38: trailing whitespace.
pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
0001-pageinspect-Hash-index-support_v3.patch:39: trailing whitespace.
pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
fatal: git apply: bad git-diff - expected /dev/null on line 46



Which platform are you on ?

The development branch is @ 
https://github.com/jesperpedersen/postgres/tree/pageinspect_hash



Also, i think the USAGE for hash_metap() is refering to
hash_metap_bytea(). Please correct it. I have just started reviewing
the patch, will keep on posting my comments upon further review.


Fixed in v4.

Thanks for the review.

Best regards,
 Jesper




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2016-09-20 Thread Jesper Pedersen

On 09/20/2016 12:45 PM, Jeff Janes wrote:

Is the 2nd "1" in this call needed?

SELECT * FROM hash_page_stats(get_raw_page('mytab_index', 1), 1)

As far as I can tell, that argument is only used to stuff into the output
field "blkno", it is not used to instruct the interpretation of the raw
page itself.  It doesn't seem worthwhile to have the parameter that only
echos back to the user what the user already put in (twice).  The only
existing funtions which take the blkno argument are those that don't use
the get_raw_page style.

Also, should we document what the single letter values mean in the
hash_page_stats.type column?  It is not obvious that 'i' means bitmap, for
example.



Adjusted in v4. Code/doc will need an update once the CHI patch goes in.

Best regards,
 Jesper

>From 1f27a2bb28cc6dfea9cba015d7cceab768f67d0a Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 5 Aug 2016 10:16:32 -0400
Subject: [PATCH] pageinspect: Hash index support

---
 contrib/pageinspect/Makefile  |  10 +-
 contrib/pageinspect/hashfuncs.c   | 403 ++
 contrib/pageinspect/pageinspect--1.5--1.6.sql |  63 
 contrib/pageinspect/pageinspect--1.5.sql  | 279 --
 contrib/pageinspect/pageinspect--1.6.sql  | 338 +
 contrib/pageinspect/pageinspect.control   |   2 +-
 doc/src/sgml/pageinspect.sgml | 104 +++
 7 files changed, 914 insertions(+), 285 deletions(-)
 create mode 100644 contrib/pageinspect/hashfuncs.c
 create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql
 delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql
 create mode 100644 contrib/pageinspect/pageinspect--1.6.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index a98237e..c73d728 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -2,13 +2,13 @@
 
 MODULE_big	= pageinspect
 OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
-		  brinfuncs.o ginfuncs.o $(WIN32RES)
+		  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \
-	pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \
-	pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \
-	pageinspect--unpackaged--1.0.sql
+DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \
+	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
+	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
+	pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
 ifdef USE_PGXS
diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c
new file mode 100644
index 000..6efbf22
--- /dev/null
+++ b/contrib/pageinspect/hashfuncs.c
@@ -0,0 +1,403 @@
+/*
+ * hashfuncs.c
+ *		Functions to investigate the content of HASH indexes
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pageinspect/hashfuncs.c
+ */
+
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+
+PG_FUNCTION_INFO_V1(hash_metap);
+PG_FUNCTION_INFO_V1(hash_page_items);
+PG_FUNCTION_INFO_V1(hash_page_stats);
+
+/* 
+ * structure for single hash page statistics
+ * 
+ */
+typedef struct HashPageStat
+{
+	uint32		live_items;
+	uint32		dead_items;
+	uint32		page_size;
+	uint32		max_avail;
+	uint32		free_size;
+	uint32		avg_item_size;
+	char		type;
+
+	/* opaque data */
+	BlockNumber hasho_prevblkno;
+	BlockNumber hasho_nextblkno;
+	Bucket		hasho_bucket;
+	uint16		hasho_flag;
+	uint16		hasho_page_id;
+} HashPageStat;
+
+
+/*
+ * Verify that the given bytea contains a HASH page, or die in the attempt.
+ * A pointer to the page is returned.
+ */
+static Page
+verify_hash_page(bytea *raw_page)
+{
+	Page		page;
+	int			raw_page_size;
+	HashPageOpaque pageopaque;
+
+	raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+	if (raw_page_size != BLCKSZ)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d",
+		   BLCKSZ, raw_page_size)));
+
+	page = VARDATA(raw_page);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	if (pageopaque->hasho_page_id != HASHO_PAGE_ID)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a HASH page"),
+ errdetail("Expected %08x, got %08x.",
+		   HASHO_PAGE_ID, pageopaque->hasho_page_id)));
+
+	return page;
+}
+
+/* -
+ * GetHashPageStatistics()
+ *
+ * Collect statistics of single hash page
+ * -
+ */
+static void
+GetHashPageStatistics(Page page, HashPageStat *stat)
+{
+	PageHeader	phdr = (PageHeader) page;
+	OffsetNumber 

Re: [HACKERS] pageinspect: Hash index support

2016-09-20 Thread Ashutosh Sharma
Hi,

I am getting a fatal error along with some warnings when trying to
apply the v3 patch using "git apply" command.

[ashu@localhost postgresql]$ git apply
0001-pageinspect-Hash-index-support_v3.patch
0001-pageinspect-Hash-index-support_v3.patch:29: trailing whitespace.
  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
0001-pageinspect-Hash-index-support_v3.patch:36: trailing whitespace.
DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \
0001-pageinspect-Hash-index-support_v3.patch:37: trailing whitespace.
pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
0001-pageinspect-Hash-index-support_v3.patch:38: trailing whitespace.
pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
0001-pageinspect-Hash-index-support_v3.patch:39: trailing whitespace.
pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
fatal: git apply: bad git-diff - expected /dev/null on line 46

Also, i think the USAGE for hash_metap() is refering to
hash_metap_bytea(). Please correct it. I have just started reviewing
the patch, will keep on posting my comments upon further review.
Thanks.

With Regards,
Ashutosh Sharma.
EnterpriseDB: http://www.enterprisedb.com



On Tue, Sep 20, 2016 at 10:15 PM, Jeff Janes  wrote:
> On Tue, Sep 20, 2016 at 5:40 AM, Jesper Pedersen
>  wrote:
>>
>> On 09/20/2016 03:19 AM, Michael Paquier wrote:
>>>
>>> You did not get right the comments from Alvaro upthread. The following
>>> functions are added with this patch:
>>>  function hash_metap(text)
>>>  function hash_metap_bytea(bytea)
>>>  function hash_page_items(text,integer)
>>>  function hash_page_items_bytea(bytea)
>>>  function hash_page_stats(text,integer)
>>>  function hash_page_stats_bytea(bytea,integer)
>>>
>>> Now the following set of functions would be sufficient:
>>> function hash_metapage_info(bytea)
>>> function hash_page_items(bytea)
>>> function hash_page_stats(bytea)
>>> The last time pageinspect has been updated, when BRIN functions have
>>> been added, it has been discussed to just use (bytea) as an argument
>>> interface and just rely on get_raw_page() to get the pages wanted, so
>>> I think that we had better stick with that and keep things simple.
>>>
>>
>> Yes, I know, Alvaro and you voted for the bytea methods, and Jeff asked
>> for both.
>>
>> Attached is v3 with only the bytea based methods.
>>
>> Alvaro, Michael and Jeff - Thanks for the review !
>
>
>
> Is the 2nd "1" in this call needed?
>
> SELECT * FROM hash_page_stats(get_raw_page('mytab_index', 1), 1)
>
> As far as I can tell, that argument is only used to stuff into the output
> field "blkno", it is not used to instruct the interpretation of the raw page
> itself.  It doesn't seem worthwhile to have the parameter that only echos
> back to the user what the user already put in (twice).  The only existing
> funtions which take the blkno argument are those that don't use the
> get_raw_page style.
>
> Also, should we document what the single letter values mean in the
> hash_page_stats.type column?  It is not obvious that 'i' means bitmap, for
> example.
>
> Cheers,
>
> Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvements in psql hooks for variables

2016-09-20 Thread Daniel Verite
Ashutosh Bapat wrote:

> You may want to add this to the November commitfest
> https://commitfest.postgresql.org/11/.

Done. It's at https://commitfest.postgresql.org/11/799/

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rename max_parallel_degree?

2016-09-20 Thread Petr Jelinek
On 20/09/16 17:47, Robert Haas wrote:
> On Tue, Sep 20, 2016 at 11:27 AM, Peter Eisentraut
>  wrote:
>> On 9/19/16 2:18 PM, Robert Haas wrote:
>>> I suggest that we make it part of bgw_flags, but use a bitmask for it,
>>> like this:
>>>
>>> #define BGWORKER_CLASS_MASK   0x00f0
>>> #define BGWORKER_CLASS_PARALLEL  0x0010
>>> /* add additional bgworker classes here */
>>
>> Unless we have a mechanism that makes use of this somehow, this attempt
>> at categorizing might be premature.
> 
> My main reason for wanting is that I suspect there will be a similar
> desire to limit the number of replication workers at some point.
> 

Yes there definitely is desire to use this for replication workers as
well. The logical replication launcher currently handles the limit by
itself but I'd definitely prefer to reuse something more generic.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/Postgr

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 1:03 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, May 26, 2016 at 7:44 PM, Michael Paquier
>>  wrote:
>>> Yeah, it is definitely better to register it. And I have switched the
>>> patch as ready for committer just now.
>
>> Committed and back-patched to 9.4, where DSM was introduced.
>
> ISTM both the previous coding and this version can fail for no good
> reason, that is what if GetLastError() happens to return one of these
> error codes as a result of some unrelated failure from before this
> subroutine is entered?  That is, wouldn't it be a good idea to
> do SetLastError(0) before calling CreateFileMapping?  Or is the latter
> guaranteed to do that on success?  I don't see that stated in its
> man page.

I'll leave it to people who know more about Windows than I do to opine
on that.  I suspect it's not too seriously broken because we've
managed to cut two (almost three) major releases since this code was
written without any complaints about that.  But there may well be
something that can be tightened up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 12:53 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Yeah, random() is the wrong thing.  It should use PostmasterRandom().
>> Fixed to do that instead.
>
> I am not very happy about this patch; have you considered the security
> implications of what you just did?

Hmm. No.

> If you haven't, I'll tell you:
> you just made the postmaster's selection of "random" cancel keys and
> password salts a lot more predictable.  Formerly, the srandom() seed
> for those depended on both the postmaster start time and the time of
> the first connection request, but this change removes the first
> connection request from the equation.  If you know the postmaster start
> time --- which we will happily tell any asker --- it will not take too
> many trials to find the seed that's in use.

Realistically, in some large percentage of the real-world installs,
that's not going to take too many trials anyway.  People don't
generally start a postmaster so that they can NOT connect to it, and
there are plenty of real-world installations where you can count on
the first connection happening in well under 1s.  I'd suggest that if
you're relying on that time being a secret for anything very
security-critical, you're already in trouble.

> I'd be the first to agree that this point is inadequately documented
> in the code, but PostmasterRandom should be reserved for its existing
> security-related uses, not exposed to the world for (ahem) random other
> uses.

So, we could have dsm_postmaster_startup() seed the random number
generator itself, and then let PostmasterRandom() override the seed
later.  Like maybe:

struct timeval tv;
gettimeofday(, NULL);
srandom(tv.tv_sec);
...
dsm_control_handle = random();

dsm_postmaster_startup() doesn't care very much about whether an
adversary can predict the chosen DSM control segment ID, but it
doesn't want to keep picking the same one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.85140

2016-09-20 Thread Tom Lane
Robert Haas  writes:
> On Thu, May 26, 2016 at 7:44 PM, Michael Paquier
>  wrote:
>> Yeah, it is definitely better to register it. And I have switched the
>> patch as ready for committer just now.

> Committed and back-patched to 9.4, where DSM was introduced.

ISTM both the previous coding and this version can fail for no good
reason, that is what if GetLastError() happens to return one of these
error codes as a result of some unrelated failure from before this
subroutine is entered?  That is, wouldn't it be a good idea to
do SetLastError(0) before calling CreateFileMapping?  Or is the latter
guaranteed to do that on success?  I don't see that stated in its
man page.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] get_home_path: use HOME

2016-09-20 Thread Alvaro Herrera
Rudolf Gavlas wrote:

> I work in an environment, where servers are administered by people
> with different user names and identical uid (0).

So everyone is superuser there?  That sounds, um, unorthodox.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-20 Thread Jeff Janes
On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila 
wrote:

>
> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
> that, I have changed _hash_alloc_buckets() to initialize the page
> instead of making it completely Zero because of problems discussed in
> another related thread [1].  I have also updated README.
>
>
with v7 of the concurrent has patch and v4 of the write ahead log patch and
the latest relcache patch (I don't know how important that is to
reproducing this, I suspect it is not), I once got this error:


38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
interrupted; last known up at 2016-09-19 16:25:49 PDT
38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
properly shut down; automatic recovery in progress
38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
failed, calculated checksum 65067 but expected 21260
38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of
relation base/16384/17334
38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T


The original page with the invalid checksum is:

$ od 16384_17334_9
000 32 00 015420 077347 020404 00 30 017760
020 017760 020004 00 00 00 00 00 00
040 00 00 00 00 00 00 00 00
*
0017760 07 02 17 17 07 00 02 177600
002

If I ignore the checksum failure and re-start the system, the page gets
restored to be a bitmap page.

Cheers,

Jeff


Re: [HACKERS] Use of SizeOfIptrData - is that obsolete?

2016-09-20 Thread Pavan Deolasee
On Tue, Sep 20, 2016 at 8:34 PM, Tom Lane  wrote:

> Pavan Deolasee  writes:
> > I happened to notice this comment in src/include/storage/itemptr.h
>
> >  * Note: because there is an item pointer in each tuple header and index
> >  * tuple header on disk, it's very important not to waste space with
> >  * structure padding bytes.  The struct is designed to be six bytes long
> >  * (it contains three int16 fields) but a few compilers will pad it to
> >  * eight bytes unless coerced.  We apply appropriate persuasion where
> >  * possible, and to cope with unpersuadable compilers, we try to use
> >  * "SizeOfIptrData" rather than "sizeof(ItemPointerData)" when computing
> >  * on-disk sizes.
> >  */
>
> > Is that now obsolete?
>
> Realistically, because struct HeapTupleHeaderData contains a field of
> type ItemPointerData, it's probably silly to imagine that we can save
> anything if the compiler can't be persuaded to believe that
> sizeof(ItemPointerData) is 6.  It may well be that the structure pragmas
> work on everything that wouldn't natively believe that anyway.
>

Yeah, that's what I thought and rest of the code seems to make that
assumption as well. Attached patch removes the last remaining reference to
SizeOfIptrData and also removes the macro and the associated comment. TBH I
couldn't fully trace how the TID array gets generated in nodeTidscan.c, but
I'm fairly confident that this should not break anything because this was
the only remaining reference.

While htup.h refactoring happened in 9.5, I don't see any point in back
patching this.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_remove_itemptr_size.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission denied”

2016-09-20 Thread Tom Lane
Robert Haas  writes:
> Yeah, random() is the wrong thing.  It should use PostmasterRandom().
> Fixed to do that instead.

I am not very happy about this patch; have you considered the security
implications of what you just did?  If you haven't, I'll tell you:
you just made the postmaster's selection of "random" cancel keys and
password salts a lot more predictable.  Formerly, the srandom() seed
for those depended on both the postmaster start time and the time of
the first connection request, but this change removes the first
connection request from the equation.  If you know the postmaster start
time --- which we will happily tell any asker --- it will not take too
many trials to find the seed that's in use.

I'd be the first to agree that this point is inadequately documented
in the code, but PostmasterRandom should be reserved for its existing
security-related uses, not exposed to the world for (ahem) random other
uses.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2016-09-20 Thread Jeff Janes
On Tue, Sep 20, 2016 at 5:40 AM, Jesper Pedersen  wrote:

> On 09/20/2016 03:19 AM, Michael Paquier wrote:
>
>> You did not get right the comments from Alvaro upthread. The following
>> functions are added with this patch:
>>  function hash_metap(text)
>>  function hash_metap_bytea(bytea)
>>  function hash_page_items(text,integer)
>>  function hash_page_items_bytea(bytea)
>>  function hash_page_stats(text,integer)
>>  function hash_page_stats_bytea(bytea,integer)
>>
>> Now the following set of functions would be sufficient:
>> function hash_metapage_info(bytea)
>> function hash_page_items(bytea)
>> function hash_page_stats(bytea)
>> The last time pageinspect has been updated, when BRIN functions have
>> been added, it has been discussed to just use (bytea) as an argument
>> interface and just rely on get_raw_page() to get the pages wanted, so
>> I think that we had better stick with that and keep things simple.
>>
>>
> Yes, I know, Alvaro and you voted for the bytea methods, and Jeff asked
> for both.
>
> Attached is v3 with only the bytea based methods.
>
> Alvaro, Michael and Jeff - Thanks for the review !
>


Is the 2nd "1" in this call needed?

SELECT * FROM hash_page_stats(get_raw_page('mytab_index', 1), 1)

As far as I can tell, that argument is only used to stuff into the output
field "blkno", it is not used to instruct the interpretation of the raw
page itself.  It doesn't seem worthwhile to have the parameter that only
echos back to the user what the user already put in (twice).  The only
existing funtions which take the blkno argument are those that don't use
the get_raw_page style.

Also, should we document what the single letter values mean in the
hash_page_stats.type column?  It is not obvious that 'i' means bitmap, for
example.

Cheers,

Jeff


Re: [HACKERS] [PATCH] get_home_path: use HOME

2016-09-20 Thread Tom Lane
Rudolf Gavlas  writes:
> The usage of HOME environment variable (if set) is IMO the right,
> standard and faster way to get_home_path().

Can you provide some evidence for that claim?  I can believe "faster"
but the rest sounds like wishful thinking.

> I work in an environment, where servers are administered by people
> with different user names and identical uid (0).

I think what you have there is an incredibly badly-designed system that
can be expected to break outside software (eg, Postgres).  If we take
this patch, what's to stop someone from complaining that we broke *their*
badly-designed system that abuses the HOME variable?  I'm pretty hesitant
to touch code that's worked the same way for a decade or two on such a
basis.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission denied”

2016-09-20 Thread Robert Haas
On Thu, Oct 15, 2015 at 11:32 PM, Amit Kapila  wrote:
> On Thu, Oct 15, 2015 at 8:35 PM, Dmitry Vasilyev 
> wrote:
>>
>> I think that function dsm_impl_windows() with EACCES error should not
>> do ereport() with FATAL level. It works, but it is likely to make an
>> infinite loop if the user will receive EACCES error.
>>
>
> Currently we are using error level as ERROR for creating dsm during
> postmaster startup which is not right and rather we should use error
> level as LOG.  Can you please try with the attached patch and see
> if the issue is fixed for you.
>
> Another some what related point is currently we are using random()
> function to ensure a unique name for dsm and it seems to me that
> it is always going to generate same number on first invocation (at least
> thats what happening on windows) due to which you are seeing the
> error.

Yeah, random() is the wrong thing.  It should use PostmasterRandom().
Fixed to do that instead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618

2016-09-20 Thread Robert Haas
On Thu, May 26, 2016 at 7:44 PM, Michael Paquier
 wrote:
>> Thanks for reviewing the patch.  I have added the entry for this patch in
>> next CF (https://commitfest.postgresql.org/10/636/),  feel free to mark it
>> as Ready for committer if you think patch is good.
>
> Yeah, it is definitely better to register it. And I have switched the
> patch as ready for committer just now.

Committed and back-patched to 9.4, where DSM was introduced.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] get_home_path: use HOME

2016-09-20 Thread Rudolf Gavlas
Hi,

I work in an environment, where servers are administered by people
with different user names and identical uid (0). The attached patch
fixes a bug exposed in such environments: where the logic of
retrieving a personal configuration file relies solely on
get_home_path(), the different users are forced to share the file of
the first user with given uid.

The usage of HOME environment variable (if set) is IMO the right,
standard and faster way to get_home_path().

r.
diff --git a/src/port/path.c b/src/port/path.c
index 7bf7cbc..33cb790 100644
--- a/src/port/path.c
+++ b/src/port/path.c
@@ -807,15 +807,24 @@ bool
 get_home_path(char *ret_path)
 {
 #ifndef WIN32
-	char		pwdbuf[BUFSIZ];
-	struct passwd pwdstr;
-	struct passwd *pwd = NULL;
-
-	(void) pqGetpwuid(geteuid(), , pwdbuf, sizeof(pwdbuf), );
-	if (pwd == NULL)
-		return false;
-	strlcpy(ret_path, pwd->pw_dir, MAXPGPATH);
-	return true;
+	char		*envhome = getenv("HOME");
+	if (envhome != NULL && strlen(envhome) > 0)
+	{
+		strlcpy(ret_path, envhome, MAXPGPATH);
+		return true;
+	}
+	else
+	{
+		char		pwdbuf[BUFSIZ];
+		struct passwd pwdstr;
+		struct passwd *pwd = NULL;
+
+		(void) pqGetpwuid(geteuid(), , pwdbuf, sizeof(pwdbuf), );
+		if (pwd == NULL)
+			return false;
+		strlcpy(ret_path, pwd->pw_dir, MAXPGPATH);
+		return true;
+	}
 #else
 	char	   *tmppath;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rename max_parallel_degree?

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 11:27 AM, Peter Eisentraut
 wrote:
> On 9/19/16 2:18 PM, Robert Haas wrote:
>> I suggest that we make it part of bgw_flags, but use a bitmask for it,
>> like this:
>>
>> #define BGWORKER_CLASS_MASK   0x00f0
>> #define BGWORKER_CLASS_PARALLEL  0x0010
>> /* add additional bgworker classes here */
>
> Unless we have a mechanism that makes use of this somehow, this attempt
> at categorizing might be premature.

My main reason for wanting is that I suspect there will be a similar
desire to limit the number of replication workers at some point.

However, that could be wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Odd system-column handling in postgres_fdw join pushdown patch

2016-09-20 Thread Robert Haas
On Fri, Jul 1, 2016 at 3:10 AM, Etsuro Fujita
 wrote:
> On 2016/04/14 4:57, Robert Haas wrote:
>>
>> 1. For a regular FDW scan, zero the xmin, xmax, and cid of the tuple
>> before returning it from postgres_fdw, so that we don't expose the
>> datum-tuple fields.   I can't see any reason this isn't safe, but I
>> might be missing something.
>
> I noticed odd behavior of this in EvalPlanQual.  Consider:
>
> -- session 1
> postgres=# create extension postgres_fdw;
> CREATE EXTENSION
> postgres=# create server fs foreign data wrapper postgres_fdw options
> (dbname 'postgres');
> CREATE SERVER
> postgres=# create user mapping for public server fs;
> CREATE USER MAPPING
> postgres=# create table t1 (a int, b int);
> CREATE TABLE
> postgres=# create table t2 (a int, b int);
> CREATE TABLE
> postgres=# insert into t1 values (1, 1);
> INSERT 0 1
> postgres=# insert into t2 values (1, 1);
> INSERT 0 1
> postgres=# create foreign table ft1 (a int, b int) server fs options
> (table_name 't1');
> CREATE FOREIGN TABLE
> postgres=# select xmin, xmax, cmin, * from ft1;
>   xmin | xmax | cmin | a | b
> --+--+--+---+---
>  0 |0 |0 | 1 | 1
> (1 row)
>
> -- session 2
> postgres=# begin;
> BEGIN
> postgres=# update t2 set a = a;
> UPDATE 1
>
> -- session 1
> postgres=# select ft1.xmin, ft1.xmax, ft1.cmin, ft1.* from ft1, t2 for
> update;
>
> -- session 2
> postgres=# commit;
> COMMIT
>
> -- session 1 (will show the following)
>   xmin |xmax| cmin  | a | b
> --++---+---+---
>128 | 4294967295 | 16398 | 1 | 1
> (1 row)
>
> The values of xmin, xmax, and cmin are not 0!  The reason for that is that
> we don't zero these values in a test tuple stored by
> EvalPlanQualFetchRowMarks for EvalPlanQual re-evaluations.
>
> That cleanup applies to the file_fdw foreign table case as well, so I think
> xmin, xmax, and cid in tuples from such tables should be set to 0, too.  And
> ISTM that it's better that the core (ie, ForeignNext) supports doing that,
> rather than each FDW does that work.  That would also minimize the overhead
> because ForeignNext does that if needed.  Please find attached a patch.

If you want this to be considered, you'll need to rebase it and submit
it to the next CommitFest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix Error Message for allocate_recordbuf() Failure

2016-09-20 Thread Robert Haas
On Mon, Jul 11, 2016 at 12:04 AM, Michael Paquier
 wrote:
> On Sat, Jul 9, 2016 at 2:58 AM, Shoaib Lari  wrote:
>> Besides making the error message more informative, we had to modify
>> allocate_recordbuf() to return the actual number of bytes that were being
>> allocated.
>
> -   report_invalid_record(state, "record length %u at %X/%X too long",
> - total_len,
> - (uint32) (RecPtr >> 32), (uint32) RecPtr);
> +   report_invalid_record(state,
> + "cannot allocate %u bytes for record
> length %u at %X/%X",
> + newSizeOut, total_len, (uint32) (RecPtr >> 32),
> + (uint32) RecPtr);
>
> It does not look like a good idea to me to complicate the interface of
> allocate_recordbuf just to make more verbose one error message, ...

For what it's worth, I think it's fine.  Good error messages are a useful thing.

More generally, I think the whole patch looks good and should be committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-20 Thread Peter Eisentraut
On 9/18/16 4:17 PM, Steve Singer wrote:
> When I create my subscriber database by doing a physical backup of the 
> publisher cluster (with cp before I add any data) then I am unable to 
> connect subscribe.
> ie
> initdb ../data
> cp -r ../data ../data2
> ./postgres -D ../data
> ./postgres -D ../data2
> 
> This make sense when I look at your code, but it might not be what we want

I think if we want to prevent the creation of subscriptions that point
to self, then we need to create a magic token when the postmaster starts
and check for that when we connect.  So more of a running-instance
identifier instead of a instance-on-disk identifier.

The other option is that we just allow it and make it more robust.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rename max_parallel_degree?

2016-09-20 Thread Peter Eisentraut
On 9/19/16 2:18 PM, Robert Haas wrote:
> I suggest that we make it part of bgw_flags, but use a bitmask for it,
> like this:
> 
> #define BGWORKER_CLASS_MASK   0x00f0
> #define BGWORKER_CLASS_PARALLEL  0x0010
> /* add additional bgworker classes here */

Unless we have a mechanism that makes use of this somehow, this attempt
at categorizing might be premature.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "Some tests to cover hash_index"

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 8:52 AM, Amit Kapila  wrote:
> I think you have a point, but not sure if it is worth to add a
> separate file.  It might be tricky to choose which file to add new
> tests for hash_indexes. Anybody else have opinion on this point?

I think all the tests should be added to hash_index.sql.  A new file
doesn't seem warranted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-20 Thread Bruce Momjian
On Thu, Sep 15, 2016 at 11:11:41AM +0530, Amit Kapila wrote:
> I think it is possible without breaking pg_upgrade, if we match all
> items of a page at once (and save them as local copy), rather than
> matching item-by-item as we do now.  We are already doing similar for
> btree, refer explanation of BTScanPosItem and BTScanPosData in
> nbtree.h.

FYI, pg_upgrade has code to easily mark indexes as invalid and create a
script the use can run to recreate the indexes as valid.  I have
received no complaints when this was used.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] more parallel query documentation

2016-09-20 Thread Peter Eisentraut
On 9/19/16 1:22 PM, Robert Haas wrote:
> On Fri, Sep 16, 2016 at 4:28 PM, Alvaro Herrera
>  wrote:
>> I agree it should be added.  I suggest that it could even be added to
>> the 9.6 docs, if you can make it.
> 
> Here's a patch.  I intend to commit this pretty quickly unless
> somebody objects, and also to backpatch it into 9.6.  I'm sure it's
> not perfect, but imperfect documentation is better than no
> documentation.

Looks reasonable to me.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use of SizeOfIptrData - is that obsolete?

2016-09-20 Thread Tom Lane
Pavan Deolasee  writes:
> I happened to notice this comment in src/include/storage/itemptr.h

>  * Note: because there is an item pointer in each tuple header and index
>  * tuple header on disk, it's very important not to waste space with
>  * structure padding bytes.  The struct is designed to be six bytes long
>  * (it contains three int16 fields) but a few compilers will pad it to
>  * eight bytes unless coerced.  We apply appropriate persuasion where
>  * possible, and to cope with unpersuadable compilers, we try to use
>  * "SizeOfIptrData" rather than "sizeof(ItemPointerData)" when computing
>  * on-disk sizes.
>  */

> Is that now obsolete?

Realistically, because struct HeapTupleHeaderData contains a field of
type ItemPointerData, it's probably silly to imagine that we can save
anything if the compiler can't be persuaded to believe that
sizeof(ItemPointerData) is 6.  It may well be that the structure pragmas
work on everything that wouldn't natively believe that anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 10:58 AM, Tom Lane  wrote:
> Maybe it would be better to fix the rule against workers
> invoking their own parallel queries.

That rule does have the advantage of preventing us from having one
user backend launch N^2 workers.  I don't think it would be that much
work to fix it, but the results might be pretty exciting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 9:24 AM, Amit Kapila  wrote:
> I think here point is that for any case where there is count of rows
> to be selected, we disable parallelism.  There are many genuine cases
> like select count(*) into cnt ... which will run to completion, but as
> plpgsql passes row count to be 1 or 2, it doesn't enter parallel mode.
> There are couple other cases like that.  Do you see a reason for not
> enabling parallelism for such cases?

If we can somehow know that the rowcount which is passed is greater
than or equal to the actual number of rows which will be generated,
then it's fine to enable parallelism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-09-20 Thread Robert Haas
On Tue, Sep 20, 2016 at 2:20 AM, Tsunakawa, Takayuki
 wrote:
> There's no apparent evidence to indicate the cause, but I could guess a few 
> reasons.  What do you think these are correct and should fix PostgreSQL? (I 
> think so)

I think that we shouldn't start changing things based on guesses about
what the problem is, even if they're fairly smart guesses.  The thing
to do would be to construct a test rig, crash the server repeatedly,
and add debugging instrumentation to figure out where the time is
actually going.

I do think your theory about the stats collector might be worth
pursuing.  It seems that the stats collector only responds to SIGQUIT,
ignoring SIGTERM.  Making it do a clean shutdown on SIGTERM and a fast
exit on SIGQUIT seems possibly worthwhile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-20 Thread Tom Lane
Amit Kapila  writes:
> On Mon, Sep 19, 2016 at 11:39 PM, Robert Haas  wrote:
>> However, I think the chances of supporting parallel query for queries
>> not executed to completion any time in the near future are very poor.

> I think here point is that for any case where there is count of rows
> to be selected, we disable parallelism.  There are many genuine cases
> like select count(*) into cnt ... which will run to completion, but as
> plpgsql passes row count to be 1 or 2, it doesn't enter parallel mode.
> There are couple other cases like that.  Do you see a reason for not
> enabling parallelism for such cases?

The other problem that would have to be confronted here is nesting,
ie it would only be OK for a plpgsql function to invoke a parallel
query if it wasn't itself being executed by a parallel worker ---
or maybe even if it's being executed by the leader process but there's
an active Gather somewhere else in the calling query's plan tree.
(Not sure about the implementation's properties for that case.)
We'd have to decide whether we want plancache to track both parallel
and nonparallel plans for plpgsql queries.  Do-able no doubt but
pretty ugly.  Maybe it would be better to fix the rule against workers
invoking their own parallel queries.

However that's probably moot unless the not-executing-to-completion
issue can be solved.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-09-20 Thread Robert Haas
On Mon, Sep 19, 2016 at 9:54 PM, Craig Ringer  wrote:
>> You appear to have attached the wrong patch set.
>
> Whoops, multitasking fail.
>
> Sorry for the late response, hospitals are "fun".

I did some cleanup of 0001 (see attached) and was all set to commit it
when I realized what I think is a problem: holding XidGenLock doesn't
seem to help with the race condition between this function and CLOG
truncation, because vac_truncate_clog() updates the shared memory
limits AFTER performing the truncation.  If the order of those
operations were reversed, we'd be fine, because it would get stuck
trying to update the shared memory limits and wouldn't be able to
truncate until it did - and conversely, if it updated the shared
memory limits before we examined them, that would be OK, too, because
we'd be sure not to consult the pages that are about to be truncated.
As it is, though, I don't see that there's any real interlock here.

(BTW, the stuff you moved from clog.c to clog.h doesn't actually need
to be moved; one of the changes I made here was to undo that.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


txid-status-rmh.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-20 Thread Stas Kelvich
> 
> On 20 Sep 2016, at 09:40, Michael Paquier  wrote:
> 

Thanks for digging into this.

> +   StartupCLOG();
> +   StartupSUBTRANS(checkPoint.oldestActiveXid);
> +   RecoverPreparedFromFiles();
> [...]
>/*
> -* Startup commit log and subtrans only.  MultiXact and commit
> -* timestamp have already been started up and other SLRUs are not
> -* maintained during recovery and need not be started yet.
> -*/
> -   StartupCLOG();
> -   StartupSUBTRANS(oldestActiveXID);
> Something is definitely wrong in this patch if we begin to do that.

Putting that before actual WAL replay is just following historical order of 
events.
Prepared files are pieces of WAL that happened before checkpoint, so ISTM 
there is no conceptual problem in restoring their state before replay.

Moreover I think that this approach is better then oldest one.
There is kind of circular dependency between StartupSUBTRANS() and restoring
old prepared transaction: StartupSUBTRANS requires oldestActiveXID, but you
can get it only after recovering 2pc files, but that recovery requires working 
SUBTRANS.

In old code that was resolved by two passes through 2pc files: first one finds
oldestActiveXmin but doesn’t restore their state, then subtrans was started, and
only after that RecoverPreparedTransactions() is called. And that logic was 
repeated
three times in xlog.c with help of following functions:
PrescanPreparedTransactions() — 3 calls
StandbyRecoverPreparedTransactions() — 2 calls
RecoverPreparedTransactions() — 1 call

Now, since we know that 2pc files are written only on checkpoint we can boil all
that cases down to one: get oldestActiveXID from checkpoint and restore it 
before
WAL replay. So only one call of RecoverPreparedFromFiles() and one call of 
CleanupPreparedTransactionsOnPITR() in case of PITR. And each of them does
only what stated on function name without side-effects like advancing nextXid in
previous versions.

So, to summarise, i think keeping old interfaces here is bigger evil because in 
each of
mentioned functions we will need to deal with both memory and file 2pc states.

> There is no need to move those two calls normally, and I think that we
> had better continue to do that only for hot standbys just to improve
> 2PC handling…

We can simulate old interface, it’s not a big problem. But that will complicate 
2pc code and
keep useless code in xlog.c because it was written in assumption that 2pc file 
can be created
before checkpoint and now it isn’t true.

> CleanupPreparedTransactionsOnPITR() cleans up pg_twophase, but isn't
> it necessary to do something as well for what is in memory?

Yes, that is necessary. Thanks, will fix. And probably add prepared tx to PITR 
test.

> I have been thinking about this patch quite a bit, and the approach
> taken looks really over-complicated to me. Basically what we are
> trying to do here is to reuse as much as possible code paths that are
> being used by non-recovery code paths during recovery, and then try to
> adapt a bunch of things like SUBTRANS structures, CLOG, XID handling
> and PGXACT things in sync to handle the 2PC information in memory
> correctly. I am getting to think that this is *really* bug-prone in
> the future and really going to make maintenance hard.

With this patch we are reusing one infrastructure for normal work, recovery and 
replay.
I don’t think that we will win a lot reliability if we split that to a 
different code paths.

> Taking one step back, and I know that I am a noisy one, but looking at
> this patch is making me aware of the fact that it is trying more and
> more to patch things around the more we dig into issues, and I'd
> suspect that trying to adapt for example sub-transaction and clog
> handling is just the tip of the iceberg and that there are more things
> that need to be taken care of if we continue to move on with this
> approach.

Again, it isn’t patching around to fix issues, it’s totally possible to keep 
old interface.
However it’s possible that current approach is wrong because of some aspect
that i didn’t think of, but now I don’t see good counterarguments.

> Coming to the point: why don't we simplify things? In short:

> - Just use a static list and allocate a chunk of shared memory as
> needed. DSM could come into play here to adapt to the size of a 2PC
> status file, this way there is no need to allocate a chunk of memory
> with an upper bound. Still we could use an hardcoded upper-bound of
> say 2k with max_prepared_transactions, and just write the 2PC status
> file to disk if its size is higher than what can stick into memory.
> - save 2PC information in memory instead of writing it to a 2PC file
> when XLOG_XACT_PREPARE shows up.
> - flush information to disk when there is a valid restart point in
> RecoveryRestartPoint(). We would need as well to tweak
> PrescanPreparedTransactions accordingly than everything we are trying

Re: [HACKERS] Declarative partitioning - another take

2016-09-20 Thread Robert Haas
On Thu, Sep 15, 2016 at 4:53 AM, Amit Langote
 wrote:
> [ new patches ]

Re-reviewing 0001.

+  partexprs
+  pg_node_tree

This documentation doesn't match pg_partition_table.h, which has
partexprsrc and partexprbin.  I don't understand why it's a good idea
to have both, and there seem to be no comments or documentation
supporting that choice anywhere.

+  The optional PARTITION BY clause specifies a method of
+  partitioning the table and the corresponding partition key.  Table
+  thus created is called partitioned table.  Key
+  consists of an ordered list of column names and/or expressions when
+  using the RANGE method, whereas only a single column or
+  expression can be specified when using the LIST method.
+  The type of a key column or an expression must have an associated
+  btree operator class or one must be specified along with the column
+  or the expression.

Both of the sentences in this paragraph that do not begin with "the"
need to begin with "the".  (In my experience, it's generally a feature
of English as spoken in India that connecting words like "the" and "a"
are sometimes left out where non-Indian speakers of English would
include them, so it would be good to watch out for this issue in
general.)

Also, I think this should be rephrased a bit to be more clear about
how the partitioning key works, like this: The optional
PARTITION BY clause specifies a method of
partitioning the table.  The table thus created is called a
partitioned table.  The parenthesized list of
expressions forms the partitioning key for the
table.  When using range partitioning, the partioning key can include
multiple columns or expressions, but for list partitioning, the
partitioning key must consist of a single column or expression.  If no
btree operator class is specified when creating a partitioned table,
the default btree operator class for the datatype will be used.  If
there is none, an error will be reported.

+case RELKIND_PARTITIONED_TABLE:
 options = heap_reloptions(classForm->relkind, datum, false);

Why?  None of the reloptions that pertain to heap seem relevant to a
relkind without storage.

But, ah, do partitioned tables have storage?  I mean, do we end up
with an empty file, or no relfilenode at all?  Can I CLUSTER, VACUUM,
etc. a partitioned table?  It would seem cleaner for the parent to
have no relfilenode at all, but I'm guessing we might need some more
changes for that to work out.

+pg_collation.h pg_range.h pg_transform.h pg_partitioned_table.h\

Whitespace.  Also, here and elsewhere, how about using alphabetical
order, or anyway preserving it insofar as the existing list is
alphabetized?

+/* Remove NO INHERIT flag if rel is a partitioned table */
+if (is_no_inherit &&
+rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ereport(ERROR,
+(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ errmsg("cannot add NO INHERIT constraint to
partitioned table \"%s\"",
+ RelationGetRelationName(rel;

The code and the comment disagree.  I think the code is right and the
comment should be adjusted to say something like /* Partitioned tables
do not have storage, so a NO INHERIT constraint makes no sense. */

+ * IDENTIFICATION
+ *src/backend/utils/misc/partition.c

Wrong.

+} KeyTypeCollInfo;

I don't like this type name very much.  Can we get "Part" in there someplace?

It doesn't seem to be very well-designed, either.  The number of
entries in each array is determined by the partnatts flag in
PartitionKeyData, which has also got various other arrays whose
lengths are determined by partnatts.  Why do we have some arrays in
one structure and some arrays in another structure?  Would it hurt
anything to merge everything into one structure?  Or could
PartitionKeyData include a field of type KeyTypeCollInfo rather than
KeyTypeCollInfo *, saving one pointer reference every place we access
this data?

+/* Allocate in the supposedly short-lived working context */

Why supposedly?

+datum = fastgetattr(tuple, Anum_pg_partitioned_table_partattrs,
+RelationGetDescr(catalog),
+);

Isn't the point of putting the fixed-length fields first that we can
use GETSTRUCT() here?  And even for partattrs as the first
variable-length thing?

+/*
+ * Run the expressions through eval_const_expressions. This is
+ * not just an optimization, but is necessary, because eventually
+ * the planner will be comparing them to similarly-processed qual
+ * clauses, and may fail to detect valid matches without this.
+ * We don't bother with canonicalize_qual, however.
+ */

I'm a bit confused by this, because I would think this processing
ought to have been done before storing anything in the system
catalogs.  I don't see why it should be necessary 

Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-20 Thread Amit Kapila
On Mon, Sep 19, 2016 at 11:39 PM, Robert Haas  wrote:
> On Sat, Sep 17, 2016 at 11:54 PM, Amit Kapila  wrote:
>> In general, I think we should support the cases as required (or
>> written) by you from plpgsql or sql functions.  We need more work to
>> support such cases. There are probably two ways of supporting such
>> cases, we can build some intelligence in plpgsql execution such that
>> it can recognise such queries and allow to use parallelism or we need
>> to think of enabling parallelism for cases where we don't run the plan
>> to completion.  Most of the use cases from plpgsql or sql function
>> fall into later category as they don't generally run the plan to
>> completion.
>
> I think there's certainly more work that could be done to teach
> PL/pgsql about cases where the query will run to completion.  I didn't
> work very hard to make sure we covered all of those; there are
> probably several different cases where parallelism could be safely
> enabled but currently isn't.  Also, I didn't do anything at all to
> update the other PLs, and that would be good, too.
>
> However, I think the chances of supporting parallel query for queries
> not executed to completion any time in the near future are very poor.
>

I think here point is that for any case where there is count of rows
to be selected, we disable parallelism.  There are many genuine cases
like select count(*) into cnt ... which will run to completion, but as
plpgsql passes row count to be 1 or 2, it doesn't enter parallel mode.
There are couple other cases like that.  Do you see a reason for not
enabling parallelism for such cases?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "Some tests to cover hash_index"

2016-09-20 Thread Amit Kapila
On Mon, Sep 19, 2016 at 8:44 PM, Mithun Cy  wrote:
> On Sat, Aug 6, 2016 at 9:41 AM, Amit Kapila  wrote:
>> I wonder why you have included a new file for these tests, why can't be
>> these added to existing hash_index.sql.
> tests in hash_index.sql did not cover overflow pages, above tests were for
> mainly for them.
>

The name of file hash_index_split suggests it focus on split.  Are the
tests focussed more on overflow pages or on split of hash index?

 So I thought having a separate test file can help
> enabling/disabling them in schedule files, when we do not want them running
> as it take slightly high time. If you think otherwise I will reconsider will
> add tests to hash_index.sql.
>

I think you have a point, but not sure if it is worth to add a
separate file.  It might be tricky to choose which file to add new
tests for hash_indexes. Anybody else have opinion on this point?

Can you check how much time it takes as compare to btree or brin index tests?

I am facing below diff with your new patch.

***
*** 1,4 
! --
  -- Cause some overflow insert and splits.
  --
  CREATE TABLE hash_split_heap (keycol INT);
--- 1,4 
! --
  -- Cause some overflow insert and splits.
  --
  CREATE TABLE hash_split_heap (keycol INT);

==

There is an extra space in expected file which is leading to above failure.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2016-09-20 Thread Jesper Pedersen

On 09/20/2016 03:19 AM, Michael Paquier wrote:

You did not get right the comments from Alvaro upthread. The following
functions are added with this patch:
 function hash_metap(text)
 function hash_metap_bytea(bytea)
 function hash_page_items(text,integer)
 function hash_page_items_bytea(bytea)
 function hash_page_stats(text,integer)
 function hash_page_stats_bytea(bytea,integer)

Now the following set of functions would be sufficient:
function hash_metapage_info(bytea)
function hash_page_items(bytea)
function hash_page_stats(bytea)
The last time pageinspect has been updated, when BRIN functions have
been added, it has been discussed to just use (bytea) as an argument
interface and just rely on get_raw_page() to get the pages wanted, so
I think that we had better stick with that and keep things simple.



Yes, I know, Alvaro and you voted for the bytea methods, and Jeff asked 
for both.


Attached is v3 with only the bytea based methods.

Alvaro, Michael and Jeff - Thanks for the review !

Best regards,
 Jesper


>From 0aff82ccb40f00efe9e48cacef9c8a45c1327da2 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 5 Aug 2016 10:16:32 -0400
Subject: [PATCH] pageinspect: Hash index support

---
 contrib/pageinspect/Makefile  |  10 +-
 contrib/pageinspect/hashfuncs.c   | 408 ++
 contrib/pageinspect/pageinspect--1.5--1.6.sql |  64 
 contrib/pageinspect/pageinspect--1.5.sql  | 279 --
 contrib/pageinspect/pageinspect--1.6.sql  | 339 +
 contrib/pageinspect/pageinspect.control   |   2 +-
 doc/src/sgml/pageinspect.sgml | 100 +++
 7 files changed, 917 insertions(+), 285 deletions(-)
 create mode 100644 contrib/pageinspect/hashfuncs.c
 create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql
 delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql
 create mode 100644 contrib/pageinspect/pageinspect--1.6.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index a98237e..c73d728 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -2,13 +2,13 @@
 
 MODULE_big	= pageinspect
 OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
-		  brinfuncs.o ginfuncs.o $(WIN32RES)
+		  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \
-	pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \
-	pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \
-	pageinspect--unpackaged--1.0.sql
+DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \
+	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
+	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
+	pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
 ifdef USE_PGXS
diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c
new file mode 100644
index 000..4d2cd2a
--- /dev/null
+++ b/contrib/pageinspect/hashfuncs.c
@@ -0,0 +1,408 @@
+/*
+ * hashfuncs.c
+ *		Functions to investigate the content of HASH indexes
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pageinspect/hashfuncs.c
+ */
+
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+
+PG_FUNCTION_INFO_V1(hash_metap);
+PG_FUNCTION_INFO_V1(hash_page_items);
+PG_FUNCTION_INFO_V1(hash_page_stats);
+
+/* 
+ * structure for single hash page statistics
+ * 
+ */
+typedef struct HashPageStat
+{
+	uint32		blkno;
+	uint32		live_items;
+	uint32		dead_items;
+	uint32		page_size;
+	uint32		max_avail;
+	uint32		free_size;
+	uint32		avg_item_size;
+	char		type;
+
+	/* opaque data */
+	BlockNumber hasho_prevblkno;
+	BlockNumber hasho_nextblkno;
+	Bucket		hasho_bucket;
+	uint16		hasho_flag;
+	uint16		hasho_page_id;
+} HashPageStat;
+
+
+/*
+ * Verify that the given bytea contains a HASH page, or die in the attempt.
+ * A pointer to the page is returned.
+ */
+static Page
+verify_hash_page(bytea *raw_page)
+{
+	Page		page;
+	int			raw_page_size;
+	HashPageOpaque pageopaque;
+
+	raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+	if (raw_page_size != BLCKSZ)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d",
+		   BLCKSZ, raw_page_size)));
+
+	page = VARDATA(raw_page);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	if (pageopaque->hasho_page_id != HASHO_PAGE_ID)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a HASH page"),
+ errdetail("Expected %08x, got %08x.",
+		   HASHO_PAGE_ID, pageopaque->hasho_page_id)));
+
+	return page;
+}
+
+/* -
+ 

Re: [HACKERS] kqueue

2016-09-20 Thread Matteo Beccati

Hi,

On 16/09/2016 05:11, Thomas Munro wrote:

Still no change measurable on my laptop.  Keith, would you be able to
test this on your rig and see if it sucks any less than the last one?


I've tested kqueue-v6.patch on the Celeron NetBSD machine and numbers 
were constantly lower by about 5-10% vs fairly recent HEAD (same as my 
last pgbench runs).



Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-20 Thread Alex Ignatov


On 18.09.2016 06:54, Amit Kapila wrote:

On Fri, Sep 16, 2016 at 8:48 PM, Alex Ignatov  wrote:


On 16.09.2016 16:50, Amit Kapila wrote:



Can you try by setting force_parallel_mode = off;?  I think it is
sending the whole function execution to worker due to
force_parallel_mode.




No changes:



Okay, it just skipped from my mind that we don't support parallel
queries for SQL statement execution (or statements executed via
exec_stmt_execsql) from plpgsql.  For detailed explanation of why that
is not feasible you can refer one of my earlier e-mails [1] on similar
topic.  I think if we can somehow get the results via Perform
statement, then it could be possible to use parallelism via plpgsql.

However, you can use it via SQL functions, an example is below:

set min_parallel_relation_size =0;
set parallel_tuple_cost=0;
set parallel_setup_cost=0;

Load 'auto_explain';
set auto_explain.log_min_duration = 0;
set auto_explain.log_analyze = true;
set auto_explain.log_nested_statements = true;

create table test_plpgsql(c1 int, c2 char(1000));
insert into test_plpgsql values(generate_series(1,10),'aaa');

create or replace function parallel_test_set_sql() returns
setof bigint as $$
select count(*) from test_plpgsql;
$$language sql PARALLEL SAFE STRICT STABLE;

Then execute function as: select * from parallel_test_set_sql();  You
can see below plan if auto_explain module is loaded.

Finalize Aggregate  (cost=14806.85..14806.86 rows=1 width=8) (actual tim
e=1094.966..1094.967 rows=1 loops=1)
  ->  Gather  (cost=14806.83..14806.84 rows=2 width=8) (actual time=472.
216..1094.943 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
->  Partial Aggregate  (cost=14806.83..14806.84 rows=1 width=8)
(actual time=177.867..177.868 rows=1 loops=3)
  ->  Parallel Seq Scan on test_plpgsql  (cost=0.00..14702.6
7 rows=41667 width=0) (actual time=0.384..142.565 rows=3 loops=3)
CONTEXT:  SQL function "parallel_test_set_sql" statement 1
LOG:  duration: 2965.040 ms  plan:
Query Text: select * from parallel_test_set_sql();
Function Scan on parallel_test_set_sql  (cost=0.25..10.25 rows=1000 widt
h=8) (actual time=2538.620..2776.955 rows=1 loops=1)


In general, I think we should support the cases as required (or
written) by you from plpgsql or sql functions.  We need more work to
support such cases. There are probably two ways of supporting such
cases, we can build some intelligence in plpgsql execution such that
it can recognise such queries and allow to use parallelism or we need
to think of enabling parallelism for cases where we don't run the plan
to completion.  Most of the use cases from plpgsql or sql function
fall into later category as they don't generally run the plan to
completion.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1K8kaO_jRk42-o2rmhSRbKV-3mR%2BiNVcONLdbcSXW5TfQ%40mail.gmail.com



Thank you for you sugestion! That works.

But what  we can do with this function:
create or replace function parallel_test_sql(t int) returns setof bigint as
$$
   select count(*) from (select a,b,c,d,e,sum(bytes) from test where a>= $1 
group by a,b,c,d,e)t;
$$ language sql PARALLEL SAFE STRICT STABLE;

explain (analyze,buffers) select  * from  parallel_test_sql(2);

"Function Scan on parallel_test_sql  (cost=0.25..10.25 rows=1000 width=8) (actual 
time=2410.789..2410.790 rows=1 loops=1)"
"  Buffers: shared hit=63696"
"Planning time: 0.082 ms"
"Execution time: 2410.841 ms"

2016-09-20 14:09:04 MSK [13037]: [75-1] user=ipdr,db=ipdr,app=pgAdmin III - 
Query Tool,client=127.0.0.1 LOG:  duration: 2410.135 ms  plan:
Query Text:
   select count(*) from (select a,b,c,d,e,sum(bytes) from test where 
a>= $1 group by a,b,c,d,e)t;

Aggregate  (cost=230701.42..230701.43 rows=1 width=8)
  ->  HashAggregate  (cost=230363.59..230513.74 rows=15015 width=28)
Group Key: test.a, test.b, test.c, test.d, test.e
->  Seq Scan on test  (cost=0.00..188696.44 rows=372 
width=20)
  Filter: (a >= $1)


No parallelism again. Looks like that Filter: (a >= $1)  breaks parallelism



Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”

2016-09-20 Thread valeriof
Hi Ashutosh,
Thank you for your answer. At the end I realized that the PGDLLEXPORT
keyword was missing from the functions definitions. 

As a side question, what are the options to debug the plugin while it's
being executing? I've seen a debug plugin for Postgres but it seems more for
SQL functions and stored procedures. Is it possible to attach the process
from Visual Studio debugger?

Thanks,
Valerio



--
View this message in context: 
http://postgresql.nabble.com/Error-running-custom-plugin-output-plugins-have-to-declare-the-PG-output-plugin-init-symbol-tp5921145p5921898.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-09-20 Thread Amit Kapila
On Tue, Sep 20, 2016 at 10:51 AM, Amit Kapila  wrote:
> On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
>  wrote:
>> 28.08.2016 09:13, Amit Kapila:
>>
>> On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova
>>  wrote:
>>
>>
>> So the patch is correct.
>> We can go further and remove this index_truncate_tuple() call, because
>> the first key of any inner (or root) page doesn't need any key at all.
>>
>
> Anyway, I think truncation happens if the page is at leaf level and
> that is ensured by check, so I think we can't remove this:
> + if (indnkeyatts != indnatts && P_ISLEAF(pageop))
>
>
> -- I have one more question regarding this truncate high-key concept.
> I think if high key is truncated, then during insertion, for cases
> like below it move to next page, whereas current page needs to be
> splitted.
>
> Assume index on c1,c2,c3 and c2,c3 are including columns.
>
> Actual high key on leaf Page X -
> 3, 2 , 2
> Truncated high key on leaf Page X
> 3
>
> New insertion key
> 3, 1, 2
>
> Now, I think for such cases during insertion if the page X doesn't
> have enough space, it will move to next page whereas ideally, it
> should split current page.  Refer function _bt_findinsertloc() for
> this logic.
>

Basically, here I wanted to know is that do we maintain ordering for
keys with respect to including columns while storing them (In above
example, do we ensure that 3,1,2 is always stored before 3,2,2)?

>
>
> -- I am getting Assertion failure when I use this patch with database
> created with a build before this patch.  However, if I create a fresh
> database it works fine.  Assertion failure details are as below:
>

I have tried this test on my Windows m/c only.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2016-09-20 Thread Heikki Linnakangas

On 03/17/2016 01:43 AM, Peter Geoghegan wrote:

I attach a revision, that makes all the changes that Heikki suggested,
except one. As already noted several times, following this suggestion
would have added a bug. Alvaro preferred my original approach here in
any case. I refer to my original approach of making the new
UNIQUE_CHECK_SPECULATIVE case minimally different from the existing
UNIQUE_CHECK_PARTIAL case currently used for deferred unique
constraints and speculative insertion, as opposed to making the new
UNIQUE_CHECK_SPECULATIVE "like CHECK_UNIQUE_YES, but return FALSE
instead of throwing an error on conflict". That was broken because
CHECK_UNIQUE_YES waits for the outcome of an xact, which
UNIQUE_CHECK_PARTIAL never does, and so UNIQUE_CHECK_SPECULATIVE must
never do.

Any and all waits happen in the first phase of speculative insertion,
and never the seconds. I could give a complicated explanation for why,
involving a deadlock scenario, but a simple explanation will do: it
has always worked that way, and was tested to work that way.

Feedback from Heikki led to these changes for this revision:

* The use of arguments within ExecInsert() was simplified.

* More concise AM documentation.

The docs essentially describe two new concepts:

- What unique index insertion needs to know about speculative
insertion in general. This doesn't just apply to speculative inserters
themselves, of course.

- What speculative insertion is. Why it exists (why we don't just wait
on xact). In other words, what "unprincipled deadlocks" are, and how
they are avoided (from a relatively high level).

I feel like I have a responsibility to make sure that this mechanism
is well documented, especially given that not that many people were
involved in its design. It's possible that no more than the 3 original
authors of UPSERT fully understand speculative insertion -- it's easy
to miss some of the subtleties.


Thanks for working on this, and sorry for disappearing and dropping this 
on the floor earlier. This doesn't apply anymore, thanks to 9c810a2e. 
Shouldn't be hard to fix.


I was in support of this earlier in general, even though I had some 
questions on the details. But now that I look at the patch again, I'm 
not so sure anymore. I don't think this actually makes things clearer. 
It adds new cases that the code needs to deal with. Index AM writers now 
have to care about the difference between a UNIQUE_CHECK_PARTIAL and 
UNIQUE_CHECK_SPECULATIVE. You can have the exact same implementation for 
both, but at the very least the index AM author needs to read the 
paragraph you added to the docs to understand the difference. That adds 
some cognitive load. Likewise, in ExecInsertIndexTuples(), this makes 
the deferred-index case work slightly differently from speculative 
insertion. It's not a big difference, but it again forces you to keep 
one more scenario in mind, when reading the code.


So overall, I think we should just drop this. Maybe a comment somewhere 
would be in order, to point out that ExecInsertIndexTuples() and 
index_insert might perform some unnecessary work, by inserting index 
tuples for a doomed heap tuple, if a speculative insertion fails. But 
that's all.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IF (NOT) EXISTS in psql-completion

2016-09-20 Thread Michael Paquier
On Mon, Sep 19, 2016 at 6:11 PM, Pavel Stehule  wrote:
> I am thinking so commit's description should be inside README

Horiguchi-san, your patch has some whitespace issues, you may want to
get a run with git diff --check. Here are some things I have spotted:
src/bin/psql/tab-complete.c:1074: trailing whitespace.
+"MATERIALIZED VIEW",
src/bin/psql/tab-complete.c:2621: trailing whitespace.
+   COMPLETE_WITH_QUERY(Query_for_list_of_roles,

This set of patches is making psql tab completion move into a better
shape, particularly with 0001 that removes the legendary huge if-elif
and just the routine return immediately in case of a keyword match.
Things could be a little bit more shortened by for example not doing
the refactoring of the tab macros because they are just needed in
tab-complete.c. The other patches introduce further improvements for
the existing infrastructure, but that's a lot of things just for
adding IF [NOT] EXISTS to be honest.

Testing a bit, I have noticed that for example trying to after typing
"create table if", if I attempt to do a tab completion "not exists"
does not show up. I suspect that the other commands are failing at
that as well.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Use of SizeOfIptrData - is that obsolete?

2016-09-20 Thread Pavan Deolasee
Hi,

I happened to notice this comment in src/include/storage/itemptr.h

 * Note: because there is an item pointer in each tuple header and index
 * tuple header on disk, it's very important not to waste space with
 * structure padding bytes.  The struct is designed to be six bytes long
 * (it contains three int16 fields) but a few compilers will pad it to
 * eight bytes unless coerced.  We apply appropriate persuasion where
 * possible, and to cope with unpersuadable compilers, we try to use
 * "SizeOfIptrData" rather than "sizeof(ItemPointerData)" when computing
 * on-disk sizes.
 */

Is that now obsolete? I mean I can still find one reference
in src/backend/executor/nodeTidscan.c, which uses SizeOfIptrData instead of
sizeof (ItemPointerData), but all other places such as heapam_xlog.h now
use sizeof operator. It was changed in 2c03216d831160b when we moved a
bunch of xlog related stuff from htup.h to this new file. Hard to tell if
there were other users before that and they were all dropped in this one
commit or various commits leading to this.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] pageinspect: Hash index support

2016-09-20 Thread Michael Paquier
On Tue, Sep 20, 2016 at 1:11 AM, Jesper Pedersen
 wrote:
> This version adds the overloaded get_raw_page based methods. However, I'm
> not verifying the page other than page_id, as hasho_flag can contain
> multiple flags in Amit's patches. And, I don't see having different page ids
> as a benefit - at least not part of this patch.
>
> We should probably just have one of the method sets, so I'll send a v3 with
> the set voted for.
>
> I kept the 'data' field as is, for now.

$ git diff master --check
contrib/pageinspect/hashfuncs.c:758: space before tab in indent.
+  values);
That's always good to run to check the format of a patch before sending it.

You did not get right the comments from Alvaro upthread. The following
functions are added with this patch:
 function hash_metap(text)
 function hash_metap_bytea(bytea)
 function hash_page_items(text,integer)
 function hash_page_items_bytea(bytea)
 function hash_page_stats(text,integer)
 function hash_page_stats_bytea(bytea,integer)

Now the following set of functions would be sufficient:
function hash_metapage_info(bytea)
function hash_page_items(bytea)
function hash_page_stats(bytea)
The last time pageinspect has been updated, when BRIN functions have
been added, it has been discussed to just use (bytea) as an argument
interface and just rely on get_raw_page() to get the pages wanted, so
I think that we had better stick with that and keep things simple.

Note: the patch checks if a superuser is calling the new functions,
which is a good thing.

I am switching the patch back to "waiting on author". As far as I can
see the patch is in rather good shape, you just need to adapt the docs
and shave all the parts that are not needed for the final result.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-20 Thread Michael Paquier
On Sat, Sep 17, 2016 at 2:45 AM, Stas Kelvich  wrote:
> Looking through old version i’ve noted few things that were bothering me:
>
> * In case of crash replay PREPARE redo accesses SUBTRANS, but 
> StartupSUBTRANS() isn’t called yet
> in StartupXLOG().
> * Several functions in twophase.c have to walk through both PGPROC and 
> pg_twophase directory.
>
> Now I slightly changed order of things in StartupXLOG() so twophase state 
> loaded from from file and
> StartupSUBTRANS called before actual recovery starts. So in all other 
> functions we can be sure that
> file were already loaded in memory.
>
> Also since 2pc transactions now are dumped to files only on checkpoint, we 
> can get rid of
> PrescanPreparedTransactions() — all necessary info can reside in checkpoint 
> itself. I’ve changed
> behaviour of oldestActiveXid write in checkpoint and that’s probably 
> discussable topic, but ISTM
> that simplifies a lot recovery logic in both twophase.c and xlog.c. More 
> comments on that in patch itself.

Finally I am back on this one..

First be more careful about typos in the comments of your code. Just
reading through the patch I found quite a couple of places that needed
to be fixed. cosequent, unavalable, SUBTRANCE are some examples among
many other things..

+   StartupCLOG();
+   StartupSUBTRANS(checkPoint.oldestActiveXid);
+   RecoverPreparedFromFiles();
[...]
/*
-* Startup commit log and subtrans only.  MultiXact and commit
-* timestamp have already been started up and other SLRUs are not
-* maintained during recovery and need not be started yet.
-*/
-   StartupCLOG();
-   StartupSUBTRANS(oldestActiveXID);
Something is definitely wrong in this patch if we begin to do that.
There is no need to move those two calls normally, and I think that we
had better continue to do that only for hot standbys just to improve
2PC handling...

CleanupPreparedTransactionsOnPITR() cleans up pg_twophase, but isn't
it necessary to do something as well for what is in memory?

I have been thinking about this patch quite a bit, and the approach
taken looks really over-complicated to me. Basically what we are
trying to do here is to reuse as much as possible code paths that are
being used by non-recovery code paths during recovery, and then try to
adapt a bunch of things like SUBTRANS structures, CLOG, XID handling
and PGXACT things in sync to handle the 2PC information in memory
correctly. I am getting to think that this is *really* bug-prone in
the future and really going to make maintenance hard.

Taking one step back, and I know that I am a noisy one, but looking at
this patch is making me aware of the fact that it is trying more and
more to patch things around the more we dig into issues, and I'd
suspect that trying to adapt for example sub-transaction and clog
handling is just the tip of the iceberg and that there are more things
that need to be taken care of if we continue to move on with this
approach. Coming to the point: why don't we simplify things? In short:
- Just use a static list and allocate a chunk of shared memory as
needed. DSM could come into play here to adapt to the size of a 2PC
status file, this way there is no need to allocate a chunk of memory
with an upper bound. Still we could use an hardcoded upper-bound of
say 2k with max_prepared_transactions, and just write the 2PC status
file to disk if its size is higher than what can stick into memory.
- save 2PC information in memory instead of writing it to a 2PC file
when XLOG_XACT_PREPARE shows up.
- flush information to disk when there is a valid restart point in
RecoveryRestartPoint(). We would need as well to tweak
PrescanPreparedTransactions accordingly than everything we are trying
to do here.
That would be way more simple than what's proposed here, and we'd
still keep all the performance benefits.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-09-20 Thread Tsunakawa, Takayuki
Hello,

Please let me ask you about possible causes of a certain problem, slow shutdown 
of postmaster when a backend crashes, and whether to fix PostgreSQL.

Our customer is using 64-bit PostgreSQL 9.2.8 on RHEL 6.4.  Yes, the PostgreSQL 
version is rather old but there's no relevant bug fix in later 9.2.x releases.


PROBLEM
==

One backend process (postgres) for an application session crashed due to a 
segmentation fault and dumped a core file.  The cause is a bug of 
pg_dbms_stats.  Another note is that restart_after_crash is off to make 
failover happen.

The problem here is that postmaster took as long as 15 seconds to terminate 
after it had detected a crashed backend.  The messages were output as follows:

20:12:35.004に
LOG:  server process (PID 31894) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: DELETE...(snip)
LOG:  terminating any other active server processes

>From 20:12:35.013 to 20:12:39.074, the following message was output 80 times.

FATAL:  the database system is in recovery mode

20:12:50
The custom monitoring system detected the death of postmaster as a result of 
running "pg_ctl status".

That's it.  You may say the following message should also have been emitted, 
but there's not.  This is because we commented out the ereport() call in 
quickdie() in tcop.c.  That ereport() call can hang depending on the timing, 
which is fixed in 9.4.

WARNING:  terminating connection because of crash of another server process

The customer insists that PostgreSQL takes longer to shut down than expected, 
which risks exceeding their allowed failover time.


CAUSE
==

There's no apparent evidence to indicate the cause, but I could guess a few 
reasons.  What do you think these are correct and should fix PostgreSQL? (I 
think so)

1) postmaster should close the listening ports earlier
As cited above, for 4 seconds, postmaster created 80 dead-end child processes 
which just output "FATAL:  the database system is in recovery mode".  This 
indicates that postmaster is busy handling re-connection requests from 
disconnected applications, preventing postmaster from reaping dead children as 
fast as possible.  This is a waste because postmaster will only shut down.

I think the listening ports should be closed in HandleChildCrash() when the 
condition "(RecoveryError || !restart_after_crash)" is true.


2) make stats collector terminate immediately
stats collector seems to write the permanent stats file even when it receives 
SIGQUIT.  But it's useless because the stat file is reset during recovery.  And 
Tom claimed that writing stats file can take long:

https://www.postgresql.org/message-id/11800.1455135...@sss.pgh.pa.us


3) Anything else?
While postmaster is in PM_WAIT_DEAD_END state, it leaves the listening ports 
open but doesn't call select()/accept().  As a result, incoming connection 
requests are accumulated in the listen queue of the sockets.  Does the OS have 
any bug to slow the process termination when the listen queue is not empty?


Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Calculation of param_source_rels in add_paths_to_joinrel

2016-09-20 Thread Ashutosh Bapat
Hi,
There's code in add_paths_to_joinrel() which computes the set of
target relations that should overlap parameterization of any proposed
join path.

 120 /*
 121  * Decide whether it's sensible to generate parameterized
paths for this
 122  * joinrel, and if so, which relations such paths should
require.  There
 123  * is usually no need to create a parameterized result path
unless there
 124  * is a join order restriction that prevents joining one of
our input rels
 125  * directly to the parameter source rel instead of joining to the other
 126  * input rel.  (But see allow_star_schema_join().)  This restriction
 127  * reduces the number of parameterized paths we have to deal with at
 128  * higher join levels, without compromising the quality of
the resulting
 129  * plan.  We express the restriction as a Relids set that
must overlap the
 130  * parameterization of any proposed join path.
 131  */

The calculations that follow are based on joinrel->relids (baserels
covered by the join) and SpecialJoinInfo list in PlannerInfo. It is
not based on specific combination of relations being joined or the
paths being generated. We should probably do this computation once and
store the result in the joinrel and use it multiple times. That way we
can avoid computing the same set again and again for every pair of
joining relations and their order. Any reasons why we don't do this?

Attached patch moves this code to build_join_rel() and uses it in
add_paths_to_joinrel(). make check-world does not show any failures.

If this change is acceptable, we might actually remove
param_source_rels from JoinPathExtraData and directly access it from
joinrel in try_nestloop_path(), try_mergejoin_path() and
try_hashjoin_path().

Also, the way this code has been written, the declaration of variable
sjinfo masks the earlier declaration with the same name. I am not sure
if that's intentional, but may be we should use another variable name
for the inner sjinfo. I have not included that change in the patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


pg_param_source_rels.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers