Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Fabien COELHO


Hello Aidan,


If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?


I tried that by setting:
  vm.dirty_expire_centisecs = 100
  vm.dirty_writeback_centisecs = 100

So it should start writing returned buffers at most 2s after they are 
returned, if I understood the doc correctly, instead of at most 35s.


The result is that with a 5000s 25tps pretty small load (the system can do 
300tps with the default configuration), I lost 2091 (1.7%) of 
transactions, that is they were beyond the 200ms schedule limit. Another 
change is that overall the lost transactions are more spread than without 
this setting, although there still is stretches of unresponsiveness.


So although the situation is significantly better, it is still far from 
good with the much reduced value I tried.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is this code safe?

2014-08-27 Thread Tom Lane
Pavan Deolasee  writes:
> Can some kind of compiler optimisation reorder things such that the "else
> if" expression is evaluated using the old, uninitialised value of optval?

Any such behavior is patently illegal per the C standard.  Not that that's
always stopped compiler writers; but I think you might be looking at a
compiler bug.  We'd need more context about the exact platform, compiler,
etc.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Is this code safe?

2014-08-27 Thread Pavan Deolasee
Someone reported an issue on XL mailing list about the following code in
fe-connect.c failing on Power PC platform:

1844 if (getsockopt(conn->sock, SOL_SOCKET, SO_ERROR,
1845(char *) &optval, &optlen) == -1)
1846 {
1847 appendPQExpBuffer(&conn->errorMessage,
1848 libpq_gettext("could not get socket error status:
%s\n"),
1849 SOCK_STRERROR(SOCK_ERRNO, sebuf,
sizeof(sebuf)));
1850 goto error_return;
1851 }
1852 else if (optval != 0)
1853 {
1854 /*
1855  * When using a nonblocking connect, we will
typically see
1856  * connect failures at this point, so provide a
friendly
1857  * error message.
1858  */
1859 connectFailureMessage(conn, optval);
1860 pqDropConnection(conn);
1861
1862 /*
1863  * If more addresses remain, keep trying, just as
in the
1864  * case where connect() returned failure
immediately.
1865  */
1866 if (conn->addr_cur->ai_next != NULL)
1867 {
1868 conn->addr_cur = conn->addr_cur->ai_next;
1869 conn->status = CONNECTION_NEEDED;
1870 goto keep_going;
1871 }
1872 goto error_return;
1873 }

TBH the reported failure is in another component of XC code which is
borrowed/copied from the above portion. So it could very well be XC
specific issue. But the reporter claims that initialising optval to 0 where
its declared or commenting out "else if" part fixes his problem. I found
that strange because the preceding getsockopt() should have initialised
optval anyways.

Can some kind of compiler optimisation reorder things such that the "else
if" expression is evaluated using the old, uninitialised value of optval?
Or these branches are evaluated sequentially so there is no chance that the
"else if" expression would not see the new value of optval set by
getsockopt()?

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Dense index?

2014-08-27 Thread Peter Geoghegan
On Wed, Aug 27, 2014 at 10:16 PM, Tatsuo Ishii  wrote:
> I know that the alignment is required for faster memory access, but
> sometimes we may want to save disk space for index to save I/O because
> these days customers want to handle huge number of rows. To make index
> more "dense", can we add an option something like "dense index" to
> align index tuples by using INTALIGN rather than MAXALIGN?

I think we should certainly consider it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dense index?

2014-08-27 Thread Tatsuo Ishii
> Only if you want it to crash hard on most non-Intel architectures.

Of course some CPU architecture prohibits none word boundary access
and we need to do either:

1) do not allow to use the option on such an architecture

2) work hard to use temporary buffer which is properly aligned

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dense index?

2014-08-27 Thread Tom Lane
Tatsuo Ishii  writes:
> I know that the alignment is required for faster memory access, but
> sometimes we may want to save disk space for index to save I/O because
> these days customers want to handle huge number of rows. To make index
> more "dense", can we add an option something like "dense index" to
> align index tuples by using INTALIGN rather than MAXALIGN?

Only if you want it to crash hard on most non-Intel architectures.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Dense index?

2014-08-27 Thread Tatsuo Ishii
While looking into pg_filedump output of int4 btree index, It strikes
me that in leaf pages about 25% of page is wasted because of 8 byte
alignment (MAXALIGN on 64bit architecture): an index tuple consists of
8 byte tuple header + 4 byte key + 4 byte alignment, thus 4/(8+4+4) =
25% is waste.

I know that the alignment is required for faster memory access, but
sometimes we may want to save disk space for index to save I/O because
these days customers want to handle huge number of rows. To make index
more "dense", can we add an option something like "dense index" to
align index tuples by using INTALIGN rather than MAXALIGN?

I feel like this had been discussed before but I couldn't find the
discussion.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing plpgsql.o Symbols on OS X

2014-08-27 Thread Ashesh Vashi
Please add -arch x86_64 to your LD_FLAGS and CFLAGS in your make file.

--

Thanks & Regards,

Ashesh Vashi
EnterpriseDB INDIA: Enterprise PostgreSQL Company




*http://www.linkedin.com/in/asheshvashi*



On Wed, Aug 27, 2014 at 9:29 PM, David E. Wheeler 
wrote:

> Hackers,
>
> I’m trying to build Pavel’s plpgsql_check against the 9.4 beta on OS X
> 10.9, but get these errors:
>
> make
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
> -Wformat-security -fno-strict-aliasing -fwrapv
> -I/usr/local/pgsql/lib/pgxs/src/makefiles/../../src/pl/plpgsql/src -bundle
> -multiply_defined suppress -o plpgsql_check.so plpgsql_check.o
> -L/usr/local/pgsql/lib -L/usr/local/lib  -L/usr/local/lib
> -Wl,-dead_strip_dylibs   -bundle_loader /usr/local/pgsql/bin/postgres
> Undefined symbols for architecture x86_64:
>  "_exec_get_datum_type", referenced from:
>  _check_target in plpgsql_check.o
>  "_plpgsql_build_datatype", referenced from:
>  _check_stmt in plpgsql_check.o
>  "_plpgsql_compile", referenced from:
>  _check_plpgsql_function in plpgsql_check.o
>  "_plpgsql_parser_setup", referenced from:
>  _prepare_expr in plpgsql_check.o
>  "_plpgsql_stmt_typename", referenced from:
>  _put_error in plpgsql_check.o
> ld: symbol(s) not found for architecture x86_64
> clang: error: linker command failed with exit code 1 (use -v to see
> invocation)
> make: *** [plpgsql_check.so] Error 1
>
> Which is odd, because plpgsql_check.c includes plpgsql.h, and those
> symbols do appear to be in plpgsql.so:
>
> $ nm /usr/local/pgsql/lib/plpgsql.so | grep _exec_get_datum_type
> f110 T _exec_get_datum_type
> f380 T _exec_get_datum_type_info
>
> So, uh, what gives? Do I need to something extra to get it to properly
> find plpgsql.so?
>
> Thanks,
>
> David
>
>


Re: [HACKERS] [TODO] Track number of files ready to be archived in pg_stat_archiver

2014-08-27 Thread Michael Paquier
On Thu, Aug 28, 2014 at 7:37 AM, Julien Rouhaud
 wrote:
>
> Attached v2 patch implements this approach. All the work is still done
> in pg_stat_get_archiver, as I don't think that having a specific
> function for that information would be really interesting.


Please be sure to add that to the next commit fest. This is a feature
most welcome within this system view.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specifying the unit in storage parameter

2014-08-27 Thread Michael Paquier
On Wed, Aug 27, 2014 at 10:59 PM, Alvaro Herrera 
wrote:

> Not necessarily, because it's harmless.  It's there for purely
> aesthetical reasons, so it's your choice whether to add it or not.
> Having it there is slightly easier on somebody reading the code,
> perhaps.


On my side, that's up to you Fujii-san. The patch does what it states, I
only think that this extra 0 should be added either everywhere or nowhere.
Not mandatory either: drop test_param_unit in the regression tests after
running the test queries.
Regards,
-- 
Michael


Re: [HACKERS] Btree internal node data?

2014-08-27 Thread Peter Geoghegan
On Wed, Aug 27, 2014 at 7:08 PM, Tatsuo Ishii  wrote:
> While looking into a btree internal page using pg_filedump against an
> int4 index generated pgbench, I noticed that only item 2 has length 8,
> which indicates that the index tuple has only tuple header and has no
> index data. In my understanding this indicates that the item is used
> to represent a down link to a page. Question is, why the item is 2,
> not 1. I thought an index tuple indicating down link is always 1. Is
> this a sign that something goes wrong?


No. On a non-rightmost page, the "high key" item is physically first
(which is a bit odd, because it serves as a high-bound invariant on
the items that the page stores, but it's convenient to do it that way
for other reasons). On an internal page (that is also non-rightmost),
the second item (which is the first "real" item - i.e. the item which
P_FIRSTDATAKEY() returns) is just placeholder garbage. The reason for
that is noted above _bt_compare():

 * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
 * "minus infinity": this routine will always claim it is less than the
 * scankey.  The actual key value stored (if any, which there probably isn't)
 * does not matter.  This convention allows us to implement the Lehman and
 * Yao convention that the first down-link pointer is before the first key.
 * See backend/access/nbtree/README for details.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Btree internal node data?

2014-08-27 Thread Tatsuo Ishii
While looking into a btree internal page using pg_filedump against an
int4 index generated pgbench, I noticed that only item 2 has length 8,
which indicates that the index tuple has only tuple header and has no
index data. In my understanding this indicates that the item is used
to represent a down link to a page. Question is, why the item is 2,
not 1. I thought an index tuple indicating down link is always 1. Is
this a sign that something goes wrong?

Block3 
 -
 Block Offset: 0x6000 Offsets: Lower1164 (0x048c)
 Block: Size 8192  Version4Upper3624 (0x0e28)
 LSN:  logid  2 recoff 0x1550a608  Special  8176 (0x1ff0)
 Items:  285  Free Space: 2460
 Checksum: 0x  Prune XID: 0x  Flags: 0x ()
 Length (including item array): 1164

 -- 
 Item   1 -- Length:   16  Offset: 3624 (0x0e28)  Flags: NORMAL
 Item   2 -- Length:8  Offset: 8168 (0x1fe8)  Flags: NORMAL
 Item   3 -- Length:   16  Offset: 8152 (0x1fd8)  Flags: NORMAL
 Item   4 -- Length:   16  Offset: 8136 (0x1fc8)  Flags: NORMAL
 Item   5 -- Length:   16  Offset: 8120 (0x1fb8)  Flags: NORMAL
[snip]
 Item 281 -- Length:   16  Offset: 3704 (0x0e78)  Flags: NORMAL
 Item 282 -- Length:   16  Offset: 3688 (0x0e68)  Flags: NORMAL
 Item 283 -- Length:   16  Offset: 3672 (0x0e58)  Flags: NORMAL
 Item 284 -- Length:   16  Offset: 3656 (0x0e48)  Flags: NORMAL
 Item 285 -- Length:   16  Offset: 3640 (0x0e38)  Flags: NORMAL

 -
 BTree Index Section:
  Flags: 0x ()
  Blocks: Previous (0)  Next (289)  Level (1)  CycleId (0)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [BUGS] Re: BUG #9555: pg_dump for tables with inheritance recreates the table with the wrong order of columns

2014-08-27 Thread Noah Misch
On Wed, Aug 27, 2014 at 11:24:53AM -0400, Tom Lane wrote:
> On Wed, Aug 27, 2014 at 10:40:53AM -0400, Bruce Momjian wrote:
> > I looked at this issue from March and I think we need to do something. 
> > In summary, the problem is that tables using inheritance can be dumped
> > and reloaded with columns in a different order from the original
> > cluster.
> 
> Yeah ... this has been a well-understood issue for a dozen years, and
> pg_dump goes to considerable trouble to get it right.

pg_dump goes to trouble to preserve attislocal but not to preserve inherited
column order.  Hence this thread about pg_dump getting column order wrong.

> > I think we have several options:
> > 
> > 1.  document this behavior
> 
> That one.

+1; certainly reasonable as a first step.

> > 2.  have ALTER TABLE INHERIT issue a warning about future reordering

That warning would summarize as "WARNING: this object is now subject to a
known bug".  -0; I'm not strongly opposed, but it's not our norm.

> > 3.  use the pg_dump binary-upgrade code when such cases happen

+1.  We have the convention that, while --binary-upgrade can inject catalog
hacks, regular pg_dump uses standard, documented DDL.  I like that convention
on general aesthetic grounds and for its benefit to non-superusers.  Let's
introduce the DDL needed to fix this bug while preserving that convention,
namely DDL to toggle attislocal.

> > My crude approach for #3 would be for pg_dump to loop over the columns
> > and, where pg_attribute.attinhcount == 0, check to see if there is a
> > matching column name in any inherited table.

That doesn't look right.  attinhcount is essentially a cache; it shall equal
the number of parents having a matching column.  The approach we use in binary
upgrade mode ought to suffice.

> > Will such tables load fine
> > because pg_dump binary-upgrade mode doesn't do any data loading?

We're now talking about changes to pg_dump's normal (non-binary-upgrade) mode,
right?  pg_dump always gives COPY a column list, so I don't expect trouble on
the data load side of things.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] re-reading SSL certificates during server reload

2014-08-27 Thread Robert Haas
On Wed, Aug 27, 2014 at 6:40 AM, Magnus Hagander  wrote:
> On Wed, Aug 27, 2014 at 11:56 AM, Alexey Klyukin  wrote:
>> Greetings,
>>
>> Is there a strong reason to disallow reloading server key and cert files
>> during the PostgreSQL reload?
>
> Key and cert files are loaded in the postmaster. We'd need to change
> that.

Why?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-27 Thread Alvaro Herrera
Thomas Munro wrote:

> Thanks, I hadn't seen this, I should have checked the archives better.
> I have actually already updated my patch to handle EvalPlanQualFetch
> with NOWAIT and SKIP LOCKED with isolation specs, see attached.  I
> will compare with Craig's and see if I screwed anything up... of
> course I am happy to merge and submit a new patch on top of Craig's if
> it's going to be committed.

Thanks, please rebase.

> I haven't yet figured out how to get get into a situation where
> heap_lock_updated_tuple_rec waits.

Well, as I think I said in the first post I mentioned this, maybe there
is no such situation.  In any case, like the EvalPlanQualFetch issue, we
can fix it later if we find it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FOR [SHARE|UPDATE] NOWAIT may still block in EvalPlanQualFetch

2014-08-27 Thread Alvaro Herrera
Craig Ringer wrote:
> FOR SHARE|UPDATE NOWAIT will still block if they have to follow a ctid
> chain because the call to EvalPlanQualFetch doesn't take a param for
> noWait, so it doesn't know not to block if the updated row can't be locked.

Applied with some further editorialization.


In another thread, I wrote in response to Thomas Munro:

> I tried Craig's patch with your test case and found that it stalls in
> XactLockTableWait inside EPQFetch because it doesn't throw an error in
> the noWait case before waiting.  I think I will fix that and push,
> including both test cases.

Done.  I also made the noWait case try a ConditionalXactLockTableWait
before raising an error, because there's a small chance that the
updating transaction aborts just before we check; in this case there is
no need to raise an error.

> I am wondering about backpatching Craig's fix.  It looks to me like it
> should be backpatched as far back as NOWAIT exists, but that was in 8.1
> and we haven't ever gotten a complaint until Craig's report AFAIK, which
> I understand wasn't coming from a user finding a problem but rather some
> new development.  So I hesitate.

On further reflection, the reason users don't complain is that it's
quite difficult to notice that there is an issue.  And on the other
hand, since they are already specifying NOWAIT in the query, surely they
must already be expecting an error to be raised; it's not like this
introduces a new failure mode.  My inclination would be to backpatch the
fix all the way back.

However, due to time constraints (because it fails to apply cleanly
to older branches) and due to lack of user complaints, I only
backpatched to 9.4.  We can always revisit that, if there's demand.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing plpgsql.o Symbols on OS X

2014-08-27 Thread Tom Lane
"David E. Wheeler"  writes:
> Hackers,
> I’m trying to build Pavel’s plpgsql_check against the 9.4 beta on OS X 10.9, 
> but get these errors:

> make
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
> -Wformat-security -fno-strict-aliasing -fwrapv  
> -I/usr/local/pgsql/lib/pgxs/src/makefiles/../../src/pl/plpgsql/src -bundle 
> -multiply_defined suppress -o plpgsql_check.so plpgsql_check.o 
> -L/usr/local/pgsql/lib -L/usr/local/lib  -L/usr/local/lib 
> -Wl,-dead_strip_dylibs   -bundle_loader /usr/local/pgsql/bin/postgres
> Undefined symbols for architecture x86_64:
>  "_exec_get_datum_type", referenced from:
>  _check_target in plpgsql_check.o
>  "_plpgsql_build_datatype", referenced from:
>  _check_stmt in plpgsql_check.o
>  "_plpgsql_compile", referenced from:
>  _check_plpgsql_function in plpgsql_check.o
>  "_plpgsql_parser_setup", referenced from:
>  _prepare_expr in plpgsql_check.o
>  "_plpgsql_stmt_typename", referenced from:
>  _put_error in plpgsql_check.o
> ld: symbol(s) not found for architecture x86_64
> clang: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> make: *** [plpgsql_check.so] Error 1

> Which is odd, because plpgsql_check.c includes plpgsql.h, and those symbols 
> do appear to be in plpgsql.so:

> $ nm /usr/local/pgsql/lib/plpgsql.so | grep _exec_get_datum_type
> f110 T _exec_get_datum_type
> f380 T _exec_get_datum_type_info

Yeah, but plpgsql.so is mentioned nowhere on your command line.

I'm not too sure about the dynamic-linking rules on OS X, but I'd not be
surprised if you need to provide a reference to plpgsql.so in its final
installed location (ie, a reference to it in the build tree may appear to
link and then fail at runtime).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TODO] Track number of files ready to be archived in pg_stat_archiver

2014-08-27 Thread Julien Rouhaud
Le 25/08/2014 19:00, Gilles Darold a écrit :
> Le 21/08/2014 10:17, Julien Rouhaud a écrit :
>> Hello,
>>
>> Attached patch implements the following TODO item :
>>
>> Track number of WAL files ready to be archived in pg_stat_archiver
>>
>> However, it will track the total number of any file ready to be
>> archived, not only WAL files.
>>
>> Please let me know what you think about it.
>>
>> Regards.
> 
> Hi,
> 
> Maybe looking at archive ready count will be more efficient if it is
> done in the view definition through a function. This will avoid any
> issue with incrementing/decrement of archiverStats.ready_count and the
> patch will be more simple. Also I don't think we need an other memory
> allocation for that, the counter information is always in the number of
> .ready files in the archive_status directory and the call to
> pg_stat_archiver doesn't need high speed performances.
> 
> For example having a new function called
> pg_stat_get_archive_ready_count() that does the same at what you add
> into pgstat_read_statsfiles() and the pg_stat_archiver defined as follow:
> 
> CREATE VIEW pg_stat_archiver AS
> s.failed_count,
> s.last_failed_wal,
> s.last_failed_time,
> pg_stat_get_archive_ready() as ready_count,
> s.stats_reset
> FROM pg_stat_get_archiver() s;
> 
> The function pg_stat_get_archive_ready_count() will also be available
> for any other querying.
> 

Indeed, this approach should be more efficient. It also avoid unexpected
results, like if someone has the bad idea to remove a .ready file in
pg_xlog/archive_status directory.

Attached v2 patch implements this approach. All the work is still done
in pg_stat_get_archiver, as I don't think that having a specific
function for that information would be really interesting.
-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
*** a/doc/src/sgml/monitoring.sgml
--- b/doc/src/sgml/monitoring.sgml
***
*** 728,733  postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
--- 728,738 
Time of the last failed archival operation
   
   
+   ready_count
+   bigint
+   Number of files waiting to be archived
+  
+  
stats_reset
timestamp with time zone
Time at which these statistics were last reset
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***
*** 697,702  CREATE VIEW pg_stat_archiver AS
--- 697,703 
  s.failed_count,
  s.last_failed_wal,
  s.last_failed_time,
+ s.ready_count,
  s.stats_reset
  FROM pg_stat_get_archiver() s;
  
*** a/src/backend/utils/adt/pgstatfuncs.c
--- b/src/backend/utils/adt/pgstatfuncs.c
***
*** 15,25 
--- 15,27 
  #include "postgres.h"
  
  #include "access/htup_details.h"
+ #include "access/xlog_internal.h"
  #include "catalog/pg_type.h"
  #include "funcapi.h"
  #include "libpq/ip.h"
  #include "miscadmin.h"
  #include "pgstat.h"
+ #include "storage/fd.h"
  #include "utils/builtins.h"
  #include "utils/inet.h"
  #include "utils/timestamp.h"
***
*** 1737,1752  Datum
  pg_stat_get_archiver(PG_FUNCTION_ARGS)
  {
  	TupleDesc	tupdesc;
! 	Datum		values[7];
! 	bool		nulls[7];
  	PgStat_ArchiverStats *archiver_stats;
  
  	/* Initialise values and NULL flags arrays */
  	MemSet(values, 0, sizeof(values));
  	MemSet(nulls, 0, sizeof(nulls));
  
  	/* Initialise attributes information in the tuple descriptor */
! 	tupdesc = CreateTemplateTupleDesc(7, false);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "archived_count",
  	   INT8OID, -1, 0);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "last_archived_wal",
--- 1739,1758 
  pg_stat_get_archiver(PG_FUNCTION_ARGS)
  {
  	TupleDesc	tupdesc;
! 	Datum		values[8];
! 	bool		nulls[8];
  	PgStat_ArchiverStats *archiver_stats;
+ 	char		XLogArchiveStatusDir[MAXPGPATH];
+ 	DIR		   *rldir;
+ 	struct		dirent *rlde;
+ 	int			ready_count;
  
  	/* Initialise values and NULL flags arrays */
  	MemSet(values, 0, sizeof(values));
  	MemSet(nulls, 0, sizeof(nulls));
  
  	/* Initialise attributes information in the tuple descriptor */
! 	tupdesc = CreateTemplateTupleDesc(8, false);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "archived_count",
  	   INT8OID, -1, 0);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "last_archived_wal",
***
*** 1759,1765  pg_stat_get_archiver(PG_FUNCTION_ARGS)
  	   TEXTOID, -1, 0);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_failed_time",
  	   TIMESTAMPTZOID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
  	   TIMESTAMPTZOID, -1, 0);
  
  	BlessTupleDesc(tupdesc);
--- 1765,1773 
  	   TEXTOID, -1, 0);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_failed_time",
  	   TIMESTAMPTZOID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "ready_count",
! 	   INT8OID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 8,

Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 6:46 PM, Merlin Moncure  wrote:
>
> Yeah: I was overthinking it.   My mind was on parallel processing of
> the aggregate (which is not what Pavel was proposing) because that
> just happens to be what I'm working on currently -- using dblink to
> decompose various aggregates and distribute the calculation across
> servers.  "Woudn't it nice to have to the server to that itself", I
> impulsively thought.

But you'd have part of it too. Because then you'd have semantically
independent parallel nodes in the plan that do some meaningful data
wrangling and spit little output, whereas the previous plan did not do
much with the data and spit loads of rows as a result. This is a big
previous step for parallel execution really.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Merlin Moncure
On Wed, Aug 27, 2014 at 3:27 PM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> associative bit just makes it easier (which is important of course!).
>> mean for example can be pushed down if the 'pushed down' aggregates
>> return to the count to the "reaggregator" so that you can weight the
>> final average.  that's a lot more complicated though.
>
> The real question is what you're expecting to get out of such an
> "optimization".  If the aggregate has to visit all rows then it's
> not apparent to me that any win emerges from the extra complication.
>
> We do already have optimization of min/max across inheritance trees,
> and that's certainly a win because you don't have to visit all rows.
>
> regression=# create table pp(f1 int unique);
> CREATE TABLE
> regression=# create table cc(unique(f1)) inherits(pp);
> CREATE TABLE
> regression=# create table cc2(unique(f1)) inherits(pp);
> CREATE TABLE
> regression=# explain select max(f1) from pp;
>  QUERY PLAN
> 
>  Result  (cost=0.51..0.52 rows=1 width=0)
>InitPlan 1 (returns $0)
>  ->  Limit  (cost=0.46..0.51 rows=1 width=4)
>->  Merge Append  (cost=0.46..267.71 rows=4777 width=4)
>  Sort Key: pp.f1
>  ->  Index Only Scan Backward using pp_f1_key on pp  
> (cost=0.12..8.14 rows=1 width=4)
>Index Cond: (f1 IS NOT NULL)
>  ->  Index Only Scan Backward using cc_f1_key on cc  
> (cost=0.15..85.94 rows=2388 width=4)
>Index Cond: (f1 IS NOT NULL)
>  ->  Index Only Scan Backward using cc2_f1_key on cc2  
> (cost=0.15..85.94 rows=2388 width=4)
>Index Cond: (f1 IS NOT NULL)
>  Planning time: 0.392 ms
> (12 rows)
>
> That doesn't currently extend to the GROUP BY case unfortunately.

Yeah: I was overthinking it.   My mind was on parallel processing of
the aggregate (which is not what Pavel was proposing) because that
just happens to be what I'm working on currently -- using dblink to
decompose various aggregates and distribute the calculation across
servers.  "Woudn't it nice to have to the server to that itself", I
impulsively thought.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Tomas Vondra
On 27 Srpen 2014, 21:41, Merlin Moncure wrote:
> On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule 
>>
>> Are there some plans to use partitioning for aggregation?
>
> Besides min/max, what other aggregates (mean/stddev come to mind)
> would you optimize and how would you determine which ones could be?
> Where is that decision made?
>
> For example, could user defined aggregates be pushed down if you had a
> reaggregation routine broken out from the main one?

I think that what Pavel suggests is that when you are aggregating by

GROUP BY x

and 'x' happens to be used for partitioning (making it impossible to
groups from different partitions to overlap), then it's perfectly fine to
perform the aggregation per partition, and just append the results.

If you need sorted output, you can sort the results (assuming the
cardinality of the output is much lower than the actual data).

This "append first, then aggregate" may be the cause for switch to sort
(because of fear that the amount of group will exceed work_mem), while we
could just as fine process each partition by hash aggregate separately.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread Heikki Linnakangas

On 08/28/2014 12:03 AM, Kevin Grittner wrote:

Heikki Linnakangas  wrote:

I suggest adding a new hook to the ParseState struct, (p_rangevar_hook
?). The planner calls it whenever it sees a reference to a table, and
the hook function returns back some sort of placeholder reference to the
tuplestore. With variables, the hook returns a Param node, and at
execution time, the executor calls the paramFetch hook to fetch the
value of the param. For relations/tuplestores, I guess we'll need to
invent something like a Param node, but for holding information about
the relation. Like your TsrData struct, but without the pointer to the
tuplestore. At execution time, in the SPI_execute call, you pass the
pointer to the tuplestore in the ParamListInfo struct, like you pass
parameter values.

Does this make sense?


I see your point, but SPI first has to be made aware of the
tuplestores and their corresponding names and TupleDesc structures.
Does it make sense to keep the SPI_register_tuplestore() and
SPI_unregister_tuplestore() functions for the client side of the
API, and pass things along to the parse analysis through execution
phases using the techniques you describe?


Sorry, I didn't understand that. What do you mean by "first", and the 
"client side of the API"? I don't see any need for the 
SPI_register_tuplestore() and and SPI_unregister_tuplestore() functions 
if you use the hooks.



In essence, make the relations work like PL/pgSQL
variables do. If you squint a little, the new/old relation is a variable
from the function's point of view, and a parameter from the
planner/executor's point of view. It's just a variable/parameter that
holds a set of tuples, instead of a single Datum.


I don't have to squint that hard -- I've always been comfortable
with the definition of a table as a relation variable, and it's not
too big a stretch to expand that to a tuplestore.  ;-)  In fact, I
will be surprised if someone doesn't latch onto this to create a
new "declared temporary table" that only exists within the scope of
a compound statement (i.e., a BEGIN/END block).  You would DECLARE
them just like you would a scalar variable in a PL, and they would
have the same scope.


Yeah, that would be cool :-).

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread Stephen Frost
* Kevin Grittner (kgri...@ymail.com) wrote:
> Heikki Linnakangas  wrote:
> > In essence, make the relations work like PL/pgSQL
> > variables do. If you squint a little, the new/old relation is a variable
> > from the function's point of view, and a parameter from the
> > planner/executor's point of view. It's just a variable/parameter that
> > holds a set of tuples, instead of a single Datum.
> 
> I don't have to squint that hard -- I've always been comfortable
> with the definition of a table as a relation variable, and it's not
> too big a stretch to expand that to a tuplestore.  ;-)  In fact, I
> will be surprised if someone doesn't latch onto this to create a
> new "declared temporary table" that only exists within the scope of
> a compound statement (i.e., a BEGIN/END block).  You would DECLARE
> them just like you would a scalar variable in a PL, and they would
> have the same scope.
> 
> I'll take a look at doing this in the next couple days, and see
> whether doing it that way is as easy as it seems on the face of it.

(not following this very closely, but saw this...)

Yes, please? :)

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread Kevin Grittner
Jim Nasby  wrote:

> Something to keep in mind is that users will definitely think about NEW/OLD as
> tables. I suspect that it won't be long after release before someone asks
> why they can't create an index on it. :)

I'm comfortable saying "No" to that.  But it's a good point -- I'll 
review error checking and documentation to make sure that it is 
clear.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread Kevin Grittner
Heikki Linnakangas  wrote:
> On 08/27/2014 02:26 AM, Kevin Grittner wrote:

>> spi-tuplestore-registry allows tuplestores, with associated name
>> and TupleDesc, to be registered with the current SPI connection.
>> Queries planned or executed on that connection will recognize the
>> name as a tuplestore relation.  It doesn't care who is registering
>> the tuplestores or what happens to them.  It doesn't depend on
>> anything else.
>> 5 files changed, 445 insertions(+)
>>
>> ...
>>
>> plpgsql-after-trigger-transition-tables takes the tuplestores from
>> TriggerData and registers them with SPI before trigger planning and
>> execution.  It depends on the trigger-transition-tables and
>> spi-tuplestore-registry patches to build, and won't do anything
>> useful at run time without the executor-tuplestore-relations patch.
>> 3 files changed, 37 insertions(+), 11 deletions(-)
>
> This is a surprising way to expose the NEW/OLD relations to the
> planner/executor. The problem is the same as with making PL/pgSQL
> variables available to the planner/executor in queries within a PL/pgSQL
> function, and the solution we have for that is the "parser hooks" you
> pass to SPI_prepare_params. This tuplestore registry is a different
> solution to the same problem - we could've implemented parameters with a
> registry like this as well. Let's not mix two different designs.
>
> I suggest adding a new hook to the ParseState struct, (p_rangevar_hook
> ?). The planner calls it whenever it sees a reference to a table, and
> the hook function returns back some sort of placeholder reference to the
> tuplestore. With variables, the hook returns a Param node, and at
> execution time, the executor calls the paramFetch hook to fetch the
> value of the param. For relations/tuplestores, I guess we'll need to
> invent something like a Param node, but for holding information about
> the relation. Like your TsrData struct, but without the pointer to the
> tuplestore. At execution time, in the SPI_execute call, you pass the
> pointer to the tuplestore in the ParamListInfo struct, like you pass
> parameter values.
>
> Does this make sense?

I see your point, but SPI first has to be made aware of the
tuplestores and their corresponding names and TupleDesc structures.
Does it make sense to keep the SPI_register_tuplestore() and
SPI_unregister_tuplestore() functions for the client side of the
API, and pass things along to the parse analysis through execution
phases using the techniques you describe?  That would eliminate the
need for the SPI_get_caller_tuplestore() function and the
parse_tuplestore.[ch] files, and change how the data is fetched in
parse analysis and execution phases, but that seems fairly minimal
-- there are exactly three places that would need to call the new
hooks where the patch is now getting the information from the
registry.

> In essence, make the relations work like PL/pgSQL
> variables do. If you squint a little, the new/old relation is a variable
> from the function's point of view, and a parameter from the
> planner/executor's point of view. It's just a variable/parameter that
> holds a set of tuples, instead of a single Datum.

I don't have to squint that hard -- I've always been comfortable
with the definition of a table as a relation variable, and it's not
too big a stretch to expand that to a tuplestore.  ;-)  In fact, I
will be surprised if someone doesn't latch onto this to create a
new "declared temporary table" that only exists within the scope of
a compound statement (i.e., a BEGIN/END block).  You would DECLARE
them just like you would a scalar variable in a PL, and they would
have the same scope.

I'll take a look at doing this in the next couple days, and see
whether doing it that way is as easy as it seems on the face of it.

Thanks!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Pavel Stehule
2014-08-27 22:27 GMT+02:00 Tom Lane :

> Merlin Moncure  writes:
> > associative bit just makes it easier (which is important of course!).
> > mean for example can be pushed down if the 'pushed down' aggregates
> > return to the count to the "reaggregator" so that you can weight the
> > final average.  that's a lot more complicated though.
>
> The real question is what you're expecting to get out of such an
> "optimization".  If the aggregate has to visit all rows then it's
> not apparent to me that any win emerges from the extra complication.
>

I expect a remove a hashing or sorting part of aggregation. It can reduce
aggregation to seq scan only.

Pavel


>
> We do already have optimization of min/max across inheritance trees,
> and that's certainly a win because you don't have to visit all rows.
>
> regression=# create table pp(f1 int unique);
> CREATE TABLE
> regression=# create table cc(unique(f1)) inherits(pp);
> CREATE TABLE
> regression=# create table cc2(unique(f1)) inherits(pp);
> CREATE TABLE
> regression=# explain select max(f1) from pp;
>  QUERY PLAN
>
> 
>  Result  (cost=0.51..0.52 rows=1 width=0)
>InitPlan 1 (returns $0)
>  ->  Limit  (cost=0.46..0.51 rows=1 width=4)
>->  Merge Append  (cost=0.46..267.71 rows=4777 width=4)
>  Sort Key: pp.f1
>  ->  Index Only Scan Backward using pp_f1_key on pp
> (cost=0.12..8.14 rows=1 width=4)
>Index Cond: (f1 IS NOT NULL)
>  ->  Index Only Scan Backward using cc_f1_key on cc
> (cost=0.15..85.94 rows=2388 width=4)
>Index Cond: (f1 IS NOT NULL)
>  ->  Index Only Scan Backward using cc2_f1_key on cc2
> (cost=0.15..85.94 rows=2388 width=4)
>Index Cond: (f1 IS NOT NULL)
>  Planning time: 0.392 ms
> (12 rows)
>
> That doesn't currently extend to the GROUP BY case unfortunately.
>
> regards, tom lane
>


Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-08-27 Thread Pavel Stehule
Hi

I chose \? xxx, because it is related to psql features. I wrote commands:

\? options
\? variables

comments?

Regards

Pavel



2014-08-26 13:48 GMT+02:00 Andres Freund :

> On 2014-08-26 13:44:16 +0200, Pavel Stehule wrote:
> > 2014-08-26 13:30 GMT+02:00 Petr Jelinek :
> >
> > > On 26/08/14 13:20, Andres Freund wrote:
> > >
> > >>
> > >> I'm looking at committing this, but I wonder: Shouldn't this be
> > >> accessible from inside psql as well? I.e. as a backslash command?
> > >>
> > >>
> > > +1
> > >
> >
> > have you idea about command name?  \?+
>
> Some ideas:
>
> \hv
> \help-variables
> \? set
> \? variables
>
>
> Greetings,
>
> Andres Freund
>
> --
>  Andres Freund http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
commit 1c00470629f31d61886ff9ae95a6855693f358c8
Author: Pavel Stehule 
Date:   Wed Aug 27 22:47:07 2014 +0200

access to help_variables and usage from psql via psql command

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 74d4618..6d3189d 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -568,6 +568,15 @@ EOF
   
 
 
+
+  --help-variables
+  
+  
+  Show help about psql variables,
+  and exit.
+  
+  
+
   
  
 
@@ -2572,10 +2581,12 @@ testdb=> \setenv LESS -imx4F
 
 
   
-\?
+\? [ options | variables ]
 
 
-Shows help information about the backslash commands.
+Shows help information about the backslash commands.  This command can have a
+option "variables" or "options" to take help for psql configuration variables
+or psql command line options.
 
 
   
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index e27ff8c..12cbb20 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1503,7 +1503,19 @@ exec_command(const char *cmd,
 
 	/* \? -- slash command help */
 	else if (strcmp(cmd, "?") == 0)
-		slashUsage(pset.popt.topt.pager);
+	{
+		char	   *opt0 = psql_scan_slash_option(scan_state,
+	OT_NORMAL, NULL, false);
+
+		if (!opt0)
+			slashUsage(pset.popt.topt.pager);
+		else if (strcmp(opt0, "variables") == 0)
+			help_variables(pset.popt.topt.pager);
+		else if (strcmp(opt0, "options") == 0)
+			usage(pset.popt.topt.pager);
+		else
+			slashUsage(pset.popt.topt.pager);
+	}
 
 #if 0
 
diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c
index f8f000f..5e7953d 100644
--- a/src/bin/psql/help.c
+++ b/src/bin/psql/help.c
@@ -46,11 +46,12 @@
 #define ON(var) (var ? _("on") : _("off"))
 
 void
-usage(void)
+usage(unsigned short int pager)
 {
 	const char *env;
 	const char *user;
 	char	   *errstr;
+	FILE	   *output;
 
 	/* Find default user, in case we need it. */
 	user = getenv("PGUSER");
@@ -64,77 +65,82 @@ usage(void)
 		}
 	}
 
-	printf(_("psql is the PostgreSQL interactive terminal.\n\n"));
-	printf(_("Usage:\n"));
-	printf(_("  psql [OPTION]... [DBNAME [USERNAME]]\n\n"));
+	output = PageOutput(59, pager);
+
+	fprintf(output, _("psql is the PostgreSQL interactive terminal.\n\n"));
+	fprintf(output, _("Usage:\n"));
+	fprintf(output, _("  psql [OPTION]... [DBNAME [USERNAME]]\n\n"));
 
-	printf(_("General options:\n"));
+	fprintf(output, _("General options:\n"));
 	/* Display default database */
 	env = getenv("PGDATABASE");
 	if (!env)
 		env = user;
-	printf(_("  -c, --command=COMMANDrun only single command (SQL or internal) and exit\n"));
-	printf(_("  -d, --dbname=DBNAME  database name to connect to (default: \"%s\")\n"), env);
-	printf(_("  -f, --file=FILENAME  execute commands from file, then exit\n"));
-	printf(_("  -l, --list   list available databases, then exit\n"));
-	printf(_("  -v, --set=, --variable=NAME=VALUE\n"
-			 "   set psql variable NAME to VALUE\n"));
-	printf(_("  -V, --versionoutput version information, then exit\n"));
-	printf(_("  -X, --no-psqlrc  do not read startup file (~/.psqlrc)\n"));
-	printf(_("  -1 (\"one\"), --single-transaction\n"
+	fprintf(output, _("  -c, --command=COMMANDrun only single command (SQL or internal) and exit\n"));
+	fprintf(output, _("  -d, --dbname=DBNAME  database name to connect to (default: \"%s\")\n"), env);
+	fprintf(output, _("  -f, --file=FILENAME  execute commands from file, then exit\n"));
+	fprintf(output, _("  -l, --list   list available databases, then exit\n"));
+	fprintf(output, _("  -v, --set=, --variable=NAME=VALUE\n"
+			 "   set psql variable NAME to VALUE e.g.: -v ON_ERROR_STOP=1\n"));
+	fprintf(output, _("  -V, --versionoutput version information, then exit\n"));
+	fprintf(output, _("  -X, --no-psqlrc  do not read startup file (~/.psqlrc)\n"));
+	fprintf(output, _("  -1 (\"one\"), --single-transaction\n"
 			 "   execute as a single transaction (if non-interact

Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Tom Lane
Merlin Moncure  writes:
> associative bit just makes it easier (which is important of course!).
> mean for example can be pushed down if the 'pushed down' aggregates
> return to the count to the "reaggregator" so that you can weight the
> final average.  that's a lot more complicated though.

The real question is what you're expecting to get out of such an
"optimization".  If the aggregate has to visit all rows then it's
not apparent to me that any win emerges from the extra complication.

We do already have optimization of min/max across inheritance trees,
and that's certainly a win because you don't have to visit all rows.

regression=# create table pp(f1 int unique);
CREATE TABLE
regression=# create table cc(unique(f1)) inherits(pp);
CREATE TABLE
regression=# create table cc2(unique(f1)) inherits(pp);
CREATE TABLE
regression=# explain select max(f1) from pp;
 QUERY PLAN 


 Result  (cost=0.51..0.52 rows=1 width=0)
   InitPlan 1 (returns $0)
 ->  Limit  (cost=0.46..0.51 rows=1 width=4)
   ->  Merge Append  (cost=0.46..267.71 rows=4777 width=4)
 Sort Key: pp.f1
 ->  Index Only Scan Backward using pp_f1_key on pp  
(cost=0.12..8.14 rows=1 width=4)
   Index Cond: (f1 IS NOT NULL)
 ->  Index Only Scan Backward using cc_f1_key on cc  
(cost=0.15..85.94 rows=2388 width=4)
   Index Cond: (f1 IS NOT NULL)
 ->  Index Only Scan Backward using cc2_f1_key on cc2  
(cost=0.15..85.94 rows=2388 width=4)
   Index Cond: (f1 IS NOT NULL)
 Planning time: 0.392 ms
(12 rows)

That doesn't currently extend to the GROUP BY case unfortunately.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-27 Thread Alvaro Herrera
Thomas Munro wrote:
> On 27 August 2014 17:18, Alvaro Herrera  wrote:
> > Thomas Munro wrote:

> >> Yes it does, thanks Alvaro and Craig.  I think the attached spec
> >> reproduces the problem using that trick, ie shows NOWAIT blocking,
> >> presumably in EvalPlanQualFetch (though I haven't stepped through it
> >> with a debugger yet).  I'm afraid I'm out of Postgres hacking cycles
> >> for a few days, but next weekend I should have a new patch that fixes
> >> this by teaching EvalPlanQualFetch about wait policies, with isolation
> >> tests for NOWAIT and SKIP LOCKED.
> >
> > Hmm, http://www.postgresql.org/message-id/51fb6703.9090...@2ndquadrant.com
> 
> Thanks, I hadn't seen this, I should have checked the archives better.
> I have actually already updated my patch to handle EvalPlanQualFetch
> with NOWAIT and SKIP LOCKED with isolation specs, see attached.  I
> will compare with Craig's and see if I screwed anything up... of
> course I am happy to merge and submit a new patch on top of Craig's if
> it's going to be committed.

I tried Craig's patch with your test case and found that it stalls in
XactLockTableWait inside EPQFetch because it doesn't throw an error in
the noWait case before waiting.  I think I will fix that and push,
including both test cases.  Then we can see about rebasing your patch.

I am wondering about backpatching Craig's fix.  It looks to me like it
should be backpatched as far back as NOWAIT exists, but that was in 8.1
and we haven't ever gotten a complaint until Craig's report AFAIK, which
I understand wasn't coming from a user finding a problem but rather some
new development.  So I hesitate.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Merlin Moncure
On Wed, Aug 27, 2014 at 2:46 PM, Claudio Freire  wrote:
> On Wed, Aug 27, 2014 at 4:41 PM, Merlin Moncure  wrote:
>> On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule  
>> wrote:
>>> Hi
>>>
>>> one user asked about using a partitioning for faster aggregates queries.
>>>
>>> I found so there is not any optimization.
>>>
>>> create table x1(a int, d date);
>>> create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
>>> create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
>>> create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);
>>>
>>> When I have this schema, then optimizer try to do
>>>
>>> postgres=# explain verbose select max(a) from x1 group by d order by d;
>>>QUERY PLAN
>>> 
>>>  GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
>>>Output: max(x1.a), x1.d
>>>Group Key: x1.d
>>>->  Sort  (cost=684.79..706.19 rows=8561 width=8)
>>>  Output: x1.d, x1.a
>>>  Sort Key: x1.d
>>>  ->  Append  (cost=0.00..125.60 rows=8561 width=8)
>>>->  Seq Scan on public.x1  (cost=0.00..0.00 rows=1 width=8)
>>>  Output: x1.d, x1.a
>>>->  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
>>> width=8)
>>>  Output: x_1.d, x_1.a
>>>->  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
>>> width=8)
>>>  Output: x_2.d, x_2.a
>>>->  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
>>> width=8)
>>>  Output: x_3.d, x_3.a
>>>->  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
>>> width=8)
>>>  Output: x_4.d, x_4.a
>>>  Planning time: 0.333 ms
>>>
>>> It can be reduced to:
>>>
>>>   sort by d
>>>   Append
>>>Aggegate (a), d
>>>  seq scan from x_1
>>>Aggregate (a), d
>>>  seq scan from x_2
>>>
>>> Are there some plans to use partitioning for aggregation?
>>
>> Besides min/max, what other aggregates (mean/stddev come to mind)
>> would you optimize and how would you determine which ones could be?
>> Where is that decision made?
>
>
> You can't with mean and stddev, only with associative aggregates.

associative bit just makes it easier (which is important of course!).
mean for example can be pushed down if the 'pushed down' aggregates
return to the count to the "reaggregator" so that you can weight the
final average.  that's a lot more complicated though.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Pavel Stehule
2014-08-27 21:46 GMT+02:00 Claudio Freire :

> On Wed, Aug 27, 2014 at 4:41 PM, Merlin Moncure 
> wrote:
> > On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule 
> wrote:
> >> Hi
> >>
> >> one user asked about using a partitioning for faster aggregates queries.
> >>
> >> I found so there is not any optimization.
> >>
> >> create table x1(a int, d date);
> >> create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
> >> create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
> >> create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);
> >>
> >> When I have this schema, then optimizer try to do
> >>
> >> postgres=# explain verbose select max(a) from x1 group by d order by d;
> >>QUERY PLAN
> >>
> 
> >>  GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
> >>Output: max(x1.a), x1.d
> >>Group Key: x1.d
> >>->  Sort  (cost=684.79..706.19 rows=8561 width=8)
> >>  Output: x1.d, x1.a
> >>  Sort Key: x1.d
> >>  ->  Append  (cost=0.00..125.60 rows=8561 width=8)
> >>->  Seq Scan on public.x1  (cost=0.00..0.00 rows=1
> width=8)
> >>  Output: x1.d, x1.a
> >>->  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
> >> width=8)
> >>  Output: x_1.d, x_1.a
> >>->  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
> >> width=8)
> >>  Output: x_2.d, x_2.a
> >>->  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
> >> width=8)
> >>  Output: x_3.d, x_3.a
> >>->  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
> >> width=8)
> >>  Output: x_4.d, x_4.a
> >>  Planning time: 0.333 ms
> >>
> >> It can be reduced to:
> >>
> >>   sort by d
> >>   Append
> >>Aggegate (a), d
> >>  seq scan from x_1
> >>Aggregate (a), d
> >>  seq scan from x_2
> >>
> >> Are there some plans to use partitioning for aggregation?
> >
> > Besides min/max, what other aggregates (mean/stddev come to mind)
> > would you optimize and how would you determine which ones could be?
> > Where is that decision made?
>
>
> You can't with mean and stddev, only with associative aggregates.
>
> That's min, max, sum, bit_and, bit_or, bool_and, bool_or, count.
>

I don't think

I have a partitions by X .. and my query has group by clause GROUP BY X

so I can calculate any aggregate

Pavel


Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Pavel Stehule
2014-08-27 21:41 GMT+02:00 Merlin Moncure :

> On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule 
> wrote:
> > Hi
> >
> > one user asked about using a partitioning for faster aggregates queries.
> >
> > I found so there is not any optimization.
> >
> > create table x1(a int, d date);
> > create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
> > create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
> > create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);
> >
> > When I have this schema, then optimizer try to do
> >
> > postgres=# explain verbose select max(a) from x1 group by d order by d;
> >QUERY PLAN
> >
> 
> >  GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
> >Output: max(x1.a), x1.d
> >Group Key: x1.d
> >->  Sort  (cost=684.79..706.19 rows=8561 width=8)
> >  Output: x1.d, x1.a
> >  Sort Key: x1.d
> >  ->  Append  (cost=0.00..125.60 rows=8561 width=8)
> >->  Seq Scan on public.x1  (cost=0.00..0.00 rows=1
> width=8)
> >  Output: x1.d, x1.a
> >->  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
> > width=8)
> >  Output: x_1.d, x_1.a
> >->  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
> > width=8)
> >  Output: x_2.d, x_2.a
> >->  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
> > width=8)
> >  Output: x_3.d, x_3.a
> >->  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
> > width=8)
> >  Output: x_4.d, x_4.a
> >  Planning time: 0.333 ms
> >
> > It can be reduced to:
> >
> >   sort by d
> >   Append
> >Aggegate (a), d
> >  seq scan from x_1
> >Aggregate (a), d
> >  seq scan from x_2
> >
> > Are there some plans to use partitioning for aggregation?
>
> Besides min/max, what other aggregates (mean/stddev come to mind)
> would you optimize and how would you determine which ones could be?
> Where is that decision made?
>

I am thinking so all aggregates are possible

when you have a partitions by column X -- then you have a natural sets by X,

so you can directly calculate any aggregates on any column when GROUP BY
clause is a "GROUP BY X"

isn't it?

probably some similar optimizations are possible when you have "GROUP BY
X,Y" -- minimally you have more sets, and you can do aggregations on
smaller sets.

Pavel


>
> For example, could user defined aggregates be pushed down if you had a
> reaggregation routine broken out from the main one?
>
> merlin
>


Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 4:41 PM, Merlin Moncure  wrote:
> On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule  
> wrote:
>> Hi
>>
>> one user asked about using a partitioning for faster aggregates queries.
>>
>> I found so there is not any optimization.
>>
>> create table x1(a int, d date);
>> create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
>> create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
>> create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);
>>
>> When I have this schema, then optimizer try to do
>>
>> postgres=# explain verbose select max(a) from x1 group by d order by d;
>>QUERY PLAN
>> 
>>  GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
>>Output: max(x1.a), x1.d
>>Group Key: x1.d
>>->  Sort  (cost=684.79..706.19 rows=8561 width=8)
>>  Output: x1.d, x1.a
>>  Sort Key: x1.d
>>  ->  Append  (cost=0.00..125.60 rows=8561 width=8)
>>->  Seq Scan on public.x1  (cost=0.00..0.00 rows=1 width=8)
>>  Output: x1.d, x1.a
>>->  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
>> width=8)
>>  Output: x_1.d, x_1.a
>>->  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
>> width=8)
>>  Output: x_2.d, x_2.a
>>->  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
>> width=8)
>>  Output: x_3.d, x_3.a
>>->  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
>> width=8)
>>  Output: x_4.d, x_4.a
>>  Planning time: 0.333 ms
>>
>> It can be reduced to:
>>
>>   sort by d
>>   Append
>>Aggegate (a), d
>>  seq scan from x_1
>>Aggregate (a), d
>>  seq scan from x_2
>>
>> Are there some plans to use partitioning for aggregation?
>
> Besides min/max, what other aggregates (mean/stddev come to mind)
> would you optimize and how would you determine which ones could be?
> Where is that decision made?


You can't with mean and stddev, only with associative aggregates.

That's min, max, sum, bit_and, bit_or, bool_and, bool_or, count.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Merlin Moncure
On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule  wrote:
> Hi
>
> one user asked about using a partitioning for faster aggregates queries.
>
> I found so there is not any optimization.
>
> create table x1(a int, d date);
> create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
> create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
> create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);
>
> When I have this schema, then optimizer try to do
>
> postgres=# explain verbose select max(a) from x1 group by d order by d;
>QUERY PLAN
> 
>  GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
>Output: max(x1.a), x1.d
>Group Key: x1.d
>->  Sort  (cost=684.79..706.19 rows=8561 width=8)
>  Output: x1.d, x1.a
>  Sort Key: x1.d
>  ->  Append  (cost=0.00..125.60 rows=8561 width=8)
>->  Seq Scan on public.x1  (cost=0.00..0.00 rows=1 width=8)
>  Output: x1.d, x1.a
>->  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
> width=8)
>  Output: x_1.d, x_1.a
>->  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
> width=8)
>  Output: x_2.d, x_2.a
>->  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
> width=8)
>  Output: x_3.d, x_3.a
>->  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
> width=8)
>  Output: x_4.d, x_4.a
>  Planning time: 0.333 ms
>
> It can be reduced to:
>
>   sort by d
>   Append
>Aggegate (a), d
>  seq scan from x_1
>Aggregate (a), d
>  seq scan from x_2
>
> Are there some plans to use partitioning for aggregation?

Besides min/max, what other aggregates (mean/stddev come to mind)
would you optimize and how would you determine which ones could be?
Where is that decision made?

For example, could user defined aggregates be pushed down if you had a
reaggregation routine broken out from the main one?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Pavel Stehule
Hi

one user asked about using a partitioning for faster aggregates queries.

I found so there is not any optimization.

create table x1(a int, d date);
create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);

When I have this schema, then optimizer try to do

postgres=# explain verbose select max(a) from x1 group by d order by d;
   QUERY
PLAN

 GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
   Output: max(x1.a), x1.d
   Group Key: x1.d
   ->  Sort  (cost=684.79..706.19 rows=8561 width=8)
 Output: x1.d, x1.a
 Sort Key: x1.d
 ->  Append  (cost=0.00..125.60 rows=8561 width=8)
   ->  Seq Scan on public.x1  (cost=0.00..0.00 rows=1 width=8)
 Output: x1.d, x1.a
   ->  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
width=8)
 Output: x_1.d, x_1.a
   ->  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
width=8)
 Output: x_2.d, x_2.a
   ->  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
width=8)
 Output: x_3.d, x_3.a
   ->  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
width=8)
 Output: x_4.d, x_4.a
 Planning time: 0.333 ms

It can be reduced to:

  sort by d
  Append
   Aggegate (a), d
 seq scan from x_1
   Aggregate (a), d
 seq scan from x_2

Are there some plans to use partitioning for aggregation?

Regards

Pavel


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-27 Thread Arthur Silva
On Wed, Aug 27, 2014 at 1:09 AM, Arthur Silva  wrote:

> It won't be faster by any means, but it should definitely be incorporated
> if any format changes are made (like Tom already suggested).
>
> I think it's important we gather at least 2 more things before making any
> calls:
> * Josh tests w/ cache aware patch, which should confirm cache aware is
> indeed prefered
> * Tests with toast hacked to use lz4 instead, which might ease any
> decisions
>
>
> --
> Arthur Silva
>
>
>
> On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan  wrote:
>
>> On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva 
>> wrote:
>> > The difference is small but I's definitely faster, which makes sense
>> since
>> > cache line misses are probably slightly reduced.
>> > As in the previous runs, I ran the query a dozen times and took the
>> average
>> > after excluding runs with a high deviation.
>>
>> I'm not surprised that it hasn't beaten HEAD. I haven't studied the
>> problem in detail, but I don't think that the "cache awareness" of the
>> new revision is necessarily a distinct advantage.
>>
>> --
>> Peter Geoghegan
>>
>
>
I'm attaching a quick-n-dirty patch that uses lz4 compression instead of
pglz in case someone wants to experiment with it. Seems to work in my test
env, I'll make more tests when I get home.

PS: gotta love gmail fixed defaults of top-posting...


lz4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-08-27 Thread Tom Lane
Jim Nasby  writes:
> On 8/26/14, 8:40 AM, Heikki Linnakangas wrote:
>> Just so everyone is on the same page on what kind of queries this helps 
>> with, here are some examples from the added regression tests:
>> 
> -- Test join removals for semi and anti joins
> CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY, val INT);
> CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT REFERENCES b(id));
> -- should remove semi join to b
> EXPLAIN (COSTS OFF)
> SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
> 
> SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);

> I also fail to see a use for examples that are that silly *unless* we're 
> talking machine-generated SQL, but I suspect that normally uses JOINS.

> Where I would expect this to be useful is in cases where we can pre-evaluate 
> some other condition in the subqueries to make the subqueries useless (ie: 
> SELECT id FROM b WHERE 1=1), or where the condition could be passed through 
> (ie: SELECT id FROM b WHERE id=42). Another possibility would be if there's a 
> condition in the subquery that could trigger constraint elimination.

Unless I'm misunderstanding something, pretty much *any* WHERE restriction
in the subquery would defeat this optimization, since it would no longer
be certain that there was a match to an arbitrary outer-query row.  So
it seems unlikely to me that this would fire in enough real-world cases
to be worth including.  I am definitely not a fan of carrying around
deadwood in the planner.

If the majority of the added code is code that will be needed for
less-bogus optimizations, it might be all right; but I'd kind of want to
see the less-bogus optimizations working first.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

2014-08-27 Thread Jim Nasby

On 8/27/14, 1:38 AM, Andres Freund wrote:

It occurs to me that it'd also be nice to have some
>stats available on how this is performing; perhaps a dtrace probe for
>whenever we overflow to the hash table, and one that shows maximum
>usage for a statement? (Presumably that's not much extra code or
>overhead...)

I don't use dtrace, so*I*  won't do that. Personally I just dynamically
add probes using "perf probe" when I need to track something like this.


Yeah, I didn't mean dtrace directly; don't we have some macro that equates to 
dtrace or perf-probe depending on architecture?


I don't see how you could track maximum usage without more
compliations/slowdowns than warranted.


I was thinking we'd only show maximum if we overflowed, but maybe it's still 
too much overhead in that case.

In any case, I was thinking this would be trivial to add now, but if it's not 
then someone can do it when there's actual need.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Function to know last log write timestamp

2014-08-27 Thread Jim Nasby

On 8/27/14, 7:33 AM, Fujii Masao wrote:

On Tue, Aug 19, 2014 at 1:07 AM, Robert Haas  wrote:

On Fri, Aug 15, 2014 at 7:17 AM, Fujii Masao  wrote:

On Fri, Aug 15, 2014 at 3:40 AM, Andres Freund  wrote:

On 2014-08-14 14:37:22 -0400, Robert Haas wrote:

On Thu, Aug 14, 2014 at 2:21 PM, Andres Freund  wrote:

On 2014-08-14 14:19:13 -0400, Robert Haas wrote:

That's about the idea. However, what you've got there is actually
unsafe, because shmem->counter++ is not an atomic operation.  It reads
the counter (possibly even as two separate 4-byte loads if the counter
is an 8-byte value), increments it inside the CPU, and then writes the
resulting value back to memory.  If two backends do this concurrently,
one of the updates might be lost.


All these are only written by one backend, so it should be safe. Note
that that coding pattern, just without memory barriers, is all over
pgstat.c


Ah, OK.  If there's a separate slot for each backend, I agree that it's safe.

We should probably add barriers to pgstat.c, too.


Yea, definitely. I think this is rather borked on "weaker"
architectures. It's just that the consequences of an out of date/torn
value are rather low, so it's unlikely to be noticed.

Imo we should encapsulate the changecount modifications/checks somehow
instead of repeating the barriers, Asserts, comments et al everywhere.


So what about applying the attached patch first, which adds the macros
to load and store the changecount with the memory barries, and changes
pgstat.c use them. Maybe this patch needs to be back-patch to at least 9.4?

After applying the patch, I will rebase the pg_last_xact_insert_timestamp
patch and post it again.


That looks OK to me on a relatively-quick read-through.  I was
initially a bit worried about this part:

   do
   {
! pgstat_increment_changecount_before(beentry);
   } while ((beentry->st_changecount & 1) == 0);

pgstat_increment_changecount_before is an increment followed by a
write barrier.  This seemed like funny coding to me at first because
while-test isn't protected by any sort of barrier.  But now I think
it's correct, because there's only one process that can possibly write
to that data, and that's the one that is making the test, and it had
certainly better see its own modifications in program order no matter
what.

I wouldn't object to back-patching this to 9.4 if we were earlier in
the beta cycle, but at this point I'm more inclined to just put it in
9.5.  If we get an actual bug report about any of this, we can always
back-patch the fix at that time.  But so far that seems mostly
hypothetical, so I think the less-risky course of action is to give
this a longer time to bake before it hits an official release.


Sounds reasonable. So, barring any objection, I will apply the patch
only to the master branch.


It's probably worth adding a comment explaining why it's safe to do this 
without a barrier...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sequence Scan doubts

2014-08-27 Thread Jim Nasby

On 8/24/14, 6:22 AM, Haribabu Kommi wrote:

Yes, we are mainly targeting CPU-limited sequential scans, Because of
this reason
only I want the worker to handle the predicates also not just reading
the tuples from
disk.


In that case, I would suggest focusing on parallel execution of conditions 
regardless of where they show up in the query plan. In my experience, they 
often have nothing to do with a seqscan.

Here's a real-world example. We have a view that pivots our applications 
accounting journal into a ledger. The expensive part of the view is this:

sum(
CASE
WHEN b.tag::text = 'installment_principal'::text THEN b.type_cd -- 
type_cd is either 1, 0, or -1
ELSE 0::numeric
END
) * transaction_amount AS installment_principal

The view with this pivot has about 100 of these case statements. Frequently we 
only reference a few of them, but anytime we need to refer to 20+ the 
evaluation of that expression gets VERY cpu-expensive compared to the rest of 
the query.

The other thing I would look at before seqscan filters is join processing and 
bitmap index index combining (ie: ANDing together the results of several bitmap 
index scans). Those are things that can be very CPU intensive even when doing 
simple equality comparisons.

BTW, it's also possible that these cases would be good fits for GPU parallel 
execution.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread David Fetter
On Wed, Aug 27, 2014 at 11:51:40AM -0500, Jim Nasby wrote:
> On 8/27/14, 2:23 AM, Heikki Linnakangas wrote:
> >Does this make sense? In essence, make the relations work like
> >PL/pgSQL variables do. If you squint a little, the new/old relation
> >is a variable from the function's point of view, and a parameter
> >from the planner/executor's point of view. It's just a
> >variable/parameter that holds a set of tuples, instead of a single
> >Datum.
> 
> Something to keep in mind is that users will definitely think about
> NEW/OLD as tables. I suspect that it won't be long after release
> before someone asks why they can't create an index on it. :)

Continuing with this digression, that request seems more likely with
views and foreign tables, given that they persist across statements.
I'm given to understand that other systems have at least the former.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Similar to csvlog but not really, json logs?

2014-08-27 Thread Petr Jelinek

On 27/08/14 18:53, Andres Freund wrote:

On 2014-08-26 23:04:48 -0500, Jim Nasby wrote:

On 8/26/14, 8:45 PM, Michael Paquier wrote:

Hi all,

As mentioned here, we support multiple logging format:
http://www.postgresql.org/docs/devel/static/runtime-config-logging.html
Now what about a json format logging with one json object per log entry?

A single json entry would need more space than a csv one as we need to
track the field names with their values. Also, there is always the
argument that if an application needs json-format logs, it could use
csvlog on Postgres-side and do the transformation itself. But wouldn't
it be a win for application or tools if such an option is available
in-core?

Note that I am not planning to work on that in a close future, but it
would be a good TODO item for beginners if it is worth pursuing.


Perhaps instead of doing this in-core it would be better to make log handling more 
extensible? I'm thinking add a specific "binary" format and an external tool 
that can parse that and do whatever the user wants with it. That means we don't have to 
keep adding more complexity to the internal log handling (which already has the risk of 
being a bottleneck), while allowing maximum user flexibility.


There's a logging hook. Most of this should be doable from there.



Yes, as demonstrated by https://github.com/mpihlak/pg_logforward


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-08-27 Thread Jim Nasby

On 8/26/14, 8:40 AM, Heikki Linnakangas wrote:

Just so everyone is on the same page on what kind of queries this helps with, 
here are some examples from the added regression tests:


-- Test join removals for semi and anti joins
CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY, val INT);
CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT REFERENCES b(id));
-- should remove semi join to b
EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE b_id IN(SELECT id FROM b);



SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);


I also fail to see a use for examples that are that silly *unless* we're 
talking machine-generated SQL, but I suspect that normally uses JOINS.

Where I would expect this to be useful is in cases where we can pre-evaluate 
some other condition in the subqueries to make the subqueries useless (ie: 
SELECT id FROM b WHERE 1=1), or where the condition could be passed through 
(ie: SELECT id FROM b WHERE id=42). Another possibility would be if there's a 
condition in the subquery that could trigger constraint elimination.

Those are the real world cases I'd expect to see from anything reasonably sane (an 
adjective that doesn't always apply to some of the users I have to support...) My $0.01 
on the burden of carrying the "useless" tests and code around is that it 
doesn't seem like all that much overhead...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 19:00:12 +0200, Fabien COELHO wrote:
> 
> >off:
> >
> >$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 
> >200
> >number of skipped transactions: 1345 (6.246 %)
> >
> >on:
> >
> >$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 
> >200
> >number of skipped transactions: 1 (0.005 %)
> 
> >That machine is far from idle right now, so the noise is pretty high.
> 
> What is the OS and FS? Could it be XFS?

That's linux v3.17-rc2 + ext4.

> >But rather nice initial results.
> 
> Indeed, I can confirm:
> 
> I did 5000s 25tps tests:
>  - Off: 8002 transactions lost (6.3%)
>  - On: 158 transactions "lost" (0.12%).
> 
> Although it is still 13 times larger than the 12 (0.01%) lost with my every
> 0.2s CHECKPOINT hack, it is nevertheless much much better than before!
> 
> The bad news, under pgbench unthrottled load, the tps is divided by 2 (300
> -> 150, could have been worse), *BUT* is also much smoother, the tps is not
> going to 0, but stay in 50-100 range before the next spike.

Yea, I'm not surprised. With a sensible (aka larger) checkpoint_timeout
the performance penalty isn't that big, but it's there. That's why I
think (as mentioned to Heikki nearby) it needs to be combined with
sorting during the checkpoint phase.

> I'm wondering about he order of operations. It seems to me that you sync
> just after giving back a buffer.

Yep. Was just a rather quick patch...

> Maybe it would be better to pipeline it,
> that is something like:
> 
>   round 0:
> send buffers 0
> sleep?
> 
>   round N:
> sync buffers N-1
> send buffers N
> sleep?
> 
>   final N sync:
> sync buffer N

Yes, I think we're going to need to leave a it more room for write
combining and such here. But I think it's going to better to issue
flushes for several buffers together - just not after each write(). To
be really beneficial it needs sorted output though.

> I have not found how to control the checkpoint pacing interval, if there is
> such a thing. With a 200ms lag limit on pgbench, it would be nice if it is
> less than 200ms.

Not sure what you mean.

> I found this old thread "Add basic checkpoint sync spreading" by Greg Smith
> and Simons Riggs, dating from 2010:
> http://www.postgresql.org/message-id/4ce07548.4030...@2ndquadrant.com
> https://commitfest.postgresql.org/action/patch_view?id=431 which ends up
> "returned with feedback".

I didn't really like the unapplied remainder of what was proposed in
there.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-08-27 Thread Fabien COELHO



[...]


Yeah, something like that. I don't think it would be necessary to set 
statement_timeout, you can inject that in your script or postgresql.conf if 
you want. I don't think aborting a transaction that's already started is 
necessary either. You could count it as LATE, but let it finish first.


I've implemented something along these simplified lines. The latency is 
not limited as such, but slow (over the limit) queries are counted and 
reported.


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..96e5fb9 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,18 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * Transactions which take longer that this limit are counted as late
+ * and reported as such, although they are completed anyway.
+ *
+ * When under throttling: execution time slots which are more than
+ * this late (in us) are simply skipped, and the corresponding transaction
+ * is counted as such... it is not even started;
+ * otherwise above the limit transactions are counted as such, with the latency
+ * measured wrt the transaction schedule, not its actual start.
+ */
+int64		latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +250,8 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
+	int64		latency_late; /* late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +264,8 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
+	int64		latency_late;
 } TResult;
 
 /*
@@ -367,6 +383,10 @@ usage(void)
 		 "  -f, --file=FILENAME  read transaction script from FILENAME\n"
 		   "  -j, --jobs=NUM   number of threads (default: 1)\n"
 		   "  -l, --logwrite transaction times to log file\n"
+		   "  -L, --limit=NUM  count transactions lasting more than NUM ms.\n"
+		   "   under throttling (--rate), transactions behind schedule\n"
+		   "   more than NUM ms are skipped, and those finishing more\n"
+		   "   than NUM ms after their scheduled start are counted.\n"
 		   "  -M, --protocol=simple|extended|prepared\n"
 		   "   protocol for submitting queries (default: simple)\n"
 		   "  -n, --no-vacuum  do not run VACUUM before tests\n"
@@ -1016,6 +1036,24 @@ top:
 
 		thread->throttle_trigger += wait;
 
+		if (latency_limit)
+		{
+			instr_time	now;
+			int64		now_us;
+			INSTR_TIME_SET_CURRENT(now);
+			now_us = INSTR_TIME_GET_MICROSEC(now);
+			while (thread->throttle_trigger < now_us - latency_limit)
+			{
+/* if too far behind, this slot is skipped, and we
+ * iterate till the next nearly on time slot.
+ */
+int64 wait = (int64) (throttle_delay *
+	1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));
+thread->throttle_trigger += wait;
+thread->throttle_latency_skipped ++;
+			}
+		}
+
 		st->until = thread->throttle_trigger;
 		st->sleeping = 1;
 		st->throttling = true;
@@ -1080,13 +1118,17 @@ top:
 			thread->exec_count[cnum]++;
 		}
 
-		/* transaction finished: record latency under progress or throttling */
-		if ((progress || throttle_delay) && commands[st->state + 1] == NULL)
+		/* transaction finished: record latency under progress or throttling,
+		 * ot if latency limit is set
+		 */
+		if ((progress || throttle_delay || latency_limit) &&
+			commands[st->state + 1] == NULL)
 		{
 			instr_time	diff;
-			int64		latency;
+			int64		latency, now;
 
 			INSTR_TIME_SET_CURRENT(diff);
+			now = INSTR_TIME_GET_MICROSEC(diff);
 			INSTR_TIME_SUBTRACT(diff, st->txn_begin);
 			latency = INSTR_TIME_GET_MICROSEC(diff);
 			st->txn_latencies += latency;
@@ -1099,6 +1141,19 @@ top:
 			 * would take 256 hours.
 			 */
 			st->txn_sqlats += latency * latency;
+
+			/* record over the limit transactions if needed.
+			 */
+			if (latency_limit)
+			{
+/* Under throttling, late means wrt to the initial schedule,
+ * not the actual transaction start
+ */
+if (throttle_delay)
+	latency = now - thread->throttle_trigger;
+if (latency > latency_limit)
+	thread->latency_late++;
+			}
 		}
 
 		/*
@@ -1294,7 +1349,7 @@ top:
 	}
 
 	/* Record transaction start time under logging, progress or throttling */
-	if ((logfile || progress || throttle_delay) && st->state == 0)
+	if ((logfile || progress || throttle_delay || latency_limit) && st->state == 0)
 		INSTR_TIME_SET_CURRENT(st->txn_begin);
 
 	/* Record statement start time if per-command latencies are requested */
@@ -2351,7 +2406,8 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 			 TState *threads,

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-27 Thread Thomas Munro
On 27 August 2014 17:18, Alvaro Herrera  wrote:
> Thomas Munro wrote:
>> On 25 August 2014 02:57, Alvaro Herrera  wrote:
>> > Thomas Munro wrote:
>> >> The difficulty of course will be testing all these racy cases 
>> >> reproducibly...
>> >
>> > Does this help?
>> > http://www.postgresql.org/message-id/51fb4305.3070...@2ndquadrant.com
>> > The useful trick there is forcing a query to get its snapshot and then
>> > go to sleep before actually doing anything, by way of an advisory lock.
>>
>> Yes it does, thanks Alvaro and Craig.  I think the attached spec
>> reproduces the problem using that trick, ie shows NOWAIT blocking,
>> presumably in EvalPlanQualFetch (though I haven't stepped through it
>> with a debugger yet).  I'm afraid I'm out of Postgres hacking cycles
>> for a few days, but next weekend I should have a new patch that fixes
>> this by teaching EvalPlanQualFetch about wait policies, with isolation
>> tests for NOWAIT and SKIP LOCKED.
>
> Hmm, http://www.postgresql.org/message-id/51fb6703.9090...@2ndquadrant.com

Thanks, I hadn't seen this, I should have checked the archives better.
I have actually already updated my patch to handle EvalPlanQualFetch
with NOWAIT and SKIP LOCKED with isolation specs, see attached.  I
will compare with Craig's and see if I screwed anything up... of
course I am happy to merge and submit a new patch on top of Craig's if
it's going to be committed.

I haven't yet figured out how to get get into a situation where
heap_lock_updated_tuple_rec waits.

Best regards,
Thomas Munro
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 231dc6a..0469705 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -45,7 +45,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressioncount | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
 [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
-[ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT ] [...] ]
+[ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
 
@@ -1283,7 +1283,7 @@ FETCH { FIRST | NEXT } [ count ] {
 The locking clause has the general form
 
 
-FOR lock_strength [ OF table_name [, ...] ] [ NOWAIT ]
+FOR lock_strength [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ]
 
 
 where lock_strength can be one of
@@ -1359,11 +1359,17 @@ KEY SHARE
 

 To prevent the operation from waiting for other transactions to commit,
-use the NOWAIT option.  With NOWAIT, the statement
-reports an error, rather than waiting, if a selected row
-cannot be locked immediately.  Note that NOWAIT applies only
-to the row-level lock(s) — the required ROW SHARE
-table-level lock is still taken in the ordinary way (see
+use either the NOWAIT or SKIP LOCKED
+option.  With NOWAIT, the statement reports an error, rather
+than waiting, if a selected row cannot be locked immediately.
+With SKIP LOCKED, any selected rows that cannot be
+immediately locked are skipped.  Skipping locked rows provides an
+inconsistent view of the data, so this is not suitable for general purpose
+work, but can be used to avoid lock contention with multiple consumers
+accessing a queue-like table.  Note that NOWAIT
+and SKIP LOCKED apply only to the row-level lock(s)
+— the required ROW SHARE table-level lock is
+still taken in the ordinary way (see
 ).  You can use
 
 with the NOWAIT option first,
@@ -1386,14 +1392,14 @@ KEY SHARE

 

-Multiple locking
-clauses can be written if it is necessary to specify different locking
-behavior for different tables.  If the same table is mentioned (or
-implicitly affected) by more than one locking clause,
-then it is processed as if it was only specified by the strongest one.
-Similarly, a table is processed
-as NOWAIT if that is specified in any of the clauses
-affecting it.
+Multiple locking clauses can be written if it is necessary to specify
+different locking behavior for different tables.  If the same table is
+mentioned (or implicitly affected) by more than one locking clause, then
+it is processed as if it was only specified by the strongest one.
+Similarly, a table is processed as NOWAIT if that is specified
+in any of the clauses affecting it.  Otherwise, it is processed
+as SKIP LOCKED if that is specified in any of the
+clauses affecting it.

 

@@ -1930,9 +1936,9 @@ SELECT distributors.* WHERE distributors.name = 'Westward';
 PostgreSQL allows it in any SELECT
 query as well as in sub-SELECTs, but this is an extension.
 The FOR NO KEY UPDATE, FOR SHARE and
-FOR KEY SHARE variants,
-as well as the NOWAIT option,
-do not appear in the standard.
+FOR KEY SHARE variants, as well as the NOWAIT
+and SKIP LOCKED options, do not appear in the
+standard.

Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Fabien COELHO



off:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
number of skipped transactions: 1345 (6.246 %)

on:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
number of skipped transactions: 1 (0.005 %)



That machine is far from idle right now, so the noise is pretty high.


What is the OS and FS? Could it be XFS?


But rather nice initial results.


Indeed, I can confirm:

I did 5000s 25tps tests:
 - Off: 8002 transactions lost (6.3%)
 - On: 158 transactions "lost" (0.12%).

Although it is still 13 times larger than the 12 (0.01%) lost with my 
every 0.2s CHECKPOINT hack, it is nevertheless much much better than 
before!


The bad news, under pgbench unthrottled load, the tps is divided by 2 (300 
-> 150, could have been worse), *BUT* is also much smoother, the tps is 
not going to 0, but stay in 50-100 range before the next spike.


I'm wondering about he order of operations. It seems to me that you sync 
just after giving back a buffer. Maybe it would be better to pipeline it, 
that is something like:


  round 0:
send buffers 0
sleep?

  round N:
sync buffers N-1
send buffers N
sleep?

  final N sync:
sync buffer N

I have not found how to control the checkpoint pacing interval, if there 
is such a thing. With a 200ms lag limit on pgbench, it would be nice if it 
is less than 200ms.


I found this old thread "Add basic checkpoint sync spreading" by Greg 
Smith and Simons Riggs, dating from 2010: 
http://www.postgresql.org/message-id/4ce07548.4030...@2ndquadrant.com 
https://commitfest.postgresql.org/action/patch_view?id=431 which ends up 
"returned with feedback".


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Similar to csvlog but not really, json logs?

2014-08-27 Thread Andres Freund
On 2014-08-26 23:04:48 -0500, Jim Nasby wrote:
> On 8/26/14, 8:45 PM, Michael Paquier wrote:
> >Hi all,
> >
> >As mentioned here, we support multiple logging format:
> >http://www.postgresql.org/docs/devel/static/runtime-config-logging.html
> >Now what about a json format logging with one json object per log entry?
> >
> >A single json entry would need more space than a csv one as we need to
> >track the field names with their values. Also, there is always the
> >argument that if an application needs json-format logs, it could use
> >csvlog on Postgres-side and do the transformation itself. But wouldn't
> >it be a win for application or tools if such an option is available
> >in-core?
> >
> >Note that I am not planning to work on that in a close future, but it
> >would be a good TODO item for beginners if it is worth pursuing.
> 
> Perhaps instead of doing this in-core it would be better to make log handling 
> more extensible? I'm thinking add a specific "binary" format and an external 
> tool that can parse that and do whatever the user wants with it. That means 
> we don't have to keep adding more complexity to the internal log handling 
> (which already has the risk of being a bottleneck), while allowing maximum 
> user flexibility.

There's a logging hook. Most of this should be doable from there.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread Jim Nasby

On 8/27/14, 2:23 AM, Heikki Linnakangas wrote:

Does this make sense? In essence, make the relations work like PL/pgSQL 
variables do. If you squint a little, the new/old relation is a variable from 
the function's point of view, and a parameter from the planner/executor's point 
of view. It's just a variable/parameter that holds a set of tuples, instead of 
a single Datum.


Something to keep in mind is that users will definitely think about NEW/OLD as 
tables. I suspect that it won't be long after release before someone asks why 
they can't create an index on it. :)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Similar to csvlog but not really, json logs?

2014-08-27 Thread Jim Nasby

On 8/26/14, 8:45 PM, Michael Paquier wrote:

Hi all,

As mentioned here, we support multiple logging format:
http://www.postgresql.org/docs/devel/static/runtime-config-logging.html
Now what about a json format logging with one json object per log entry?

A single json entry would need more space than a csv one as we need to
track the field names with their values. Also, there is always the
argument that if an application needs json-format logs, it could use
csvlog on Postgres-side and do the transformation itself. But wouldn't
it be a win for application or tools if such an option is available
in-core?

Note that I am not planning to work on that in a close future, but it
would be a good TODO item for beginners if it is worth pursuing.


Perhaps instead of doing this in-core it would be better to make log handling more 
extensible? I'm thinking add a specific "binary" format and an external tool 
that can parse that and do whatever the user wants with it. That means we don't have to 
keep adding more complexity to the internal log handling (which already has the risk of 
being a bottleneck), while allowing maximum user flexibility.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> On 08/27/2014 04:20 PM, Andres Freund wrote:
> >On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
> >>>I think a somewhat smarter version of the explicit flushes in the
> >>>hack^Wpatch I posted nearby is going to more likely to be successful.
> >>
> >>
> >>That path is "dangerous" (as in, may not work as intended) if the
> >>filesystem doesn't properly understand range flushes (ehem, like
> >>ext3).
> >
> >The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> >guaranteeing durability. And - afaik - not implemented in a file system
> >specific manner. It just initiates writeback for individual pages. It
> >doesn't cause barrier, journal flushes or anything to be issued. That's
> >still done by the fsync() later.
> >
> >The big disadvantage is that it's a OS specific solution, but I don't
> >think we're going to find anything that isn't in this area.
> 
> I've been thinking for a long time that we should interleave the writes and
> the fsyncs. That still forces up to 1GB of dirty buffers to disk at once,
> causing a spike, but at least not more than that.

I think there are considerable benefits to sorting checkpoint io by file
and offset in that file. Obviously the likelihood of sequential IO is
higher; but there's also less chance that other processes write out
dirty buffers that have to be flushed out by the fsync() in a drive by
manner.

I don't think it's good enough to solve the problem Fabien is talking
about though. 1GB is heck of a lot of IO to submit at once. That'll
cause latency issues unless you have a write back controller with more
than 1GB of cache.

So I think we need both, control over the amount of dirty data in the
kernel *and* sorted writeouts. To the point that I've been tinkering
with converting buftable.c into a radix tree. That'd allow to
efficiently scan all buffers of a filenode in order. Useful for
checkpoints, but also for lots of other things. Unfortunately our buffer
tags are freakishly huge, making the worst case memory requirements and
the depth of tree quite bad.

> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> them out in order 
> (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
> The performance impact of that was inconclusive, but one thing that it
> allows nicely is to interleave the fsyncs, so that you write all the buffers
> for one file, then fsync it, then next file and so on. IIRC the biggest
> worry with that patch was that sorting the buffers requires a fairly large
> amount of memory, and making a large allocation in the checkpointer might
> cause an out-of-memory, which would be bad.
> 
> I don't think anyone's seriously worked on this area since. If the impact on
> responsiveness or performance is significant, I'm pretty sure the OOM
> problem could be alleviated somehow.

It's a major problem imo.

What I'd been thinking of is to checkpoint writeout in batches. Collect
100k buffers, sort them, write them out. Go to the next 100k.

> For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs,
> see attached. It works by repeatedly scanning the buffer pool, writing
> buffers belonging to a single relation segment at a time. I would be
> interested to hear how this performs in your test case.

I bet it's not fundamentally changing the amount of transactions that
don't make the deadline - there's more than enough dirty buffers in one
1GB segment to cause issues. But I think it might already be a
significant benefit for peak throughput *and* latency if you combine it
with my approach of initiating writeout to disk during the whole
sync. My patch can slow things down considerably in the worst case by
causing superflous random IO, which your patch should alleviate.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Heikki Linnakangas

On 08/27/2014 04:20 PM, Andres Freund wrote:

On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:

I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.



That path is "dangerous" (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).


The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.


I've been thinking for a long time that we should interleave the writes 
and the fsyncs. That still forces up to 1GB of dirty buffers to disk at 
once, causing a spike, but at least not more than that. Also, the 
scheduling of a spread checkpoint is currently a bit bogus; we don't 
take into account the time needed for the fsync phase.


A long time ago, Itagaki Takahiro wrote a patch sort the buffers and 
write them out in order 
(http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp). 
The performance impact of that was inconclusive, but one thing that it 
allows nicely is to interleave the fsyncs, so that you write all the 
buffers for one file, then fsync it, then next file and so on. IIRC the 
biggest worry with that patch was that sorting the buffers requires a 
fairly large amount of memory, and making a large allocation in the 
checkpointer might cause an out-of-memory, which would be bad.


I don't think anyone's seriously worked on this area since. If the 
impact on responsiveness or performance is significant, I'm pretty sure 
the OOM problem could be alleviated somehow.


For the kicks, I wrote a quick & dirty patch for interleaving the 
fsyncs, see attached. It works by repeatedly scanning the buffer pool, 
writing buffers belonging to a single relation segment at a time. I 
would be interested to hear how this performs in your test case.


- Heikki

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 938c554..0f2e4e0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1217,6 +1217,12 @@ BufferSync(int flags)
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	RelFileNode target_rnode = { 0, 0, 0 };
+	ForkNumber	target_forkNum = InvalidForkNumber;
+	int			target_segno = 0;
+	bool		has_target = false;
+	int			outer_num_to_scan = 0;
+	int			outer_buf_id = 0;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1281,10 +1287,30 @@ BufferSync(int flags)
 	buf_id = StrategySyncStart(NULL, NULL);
 	num_to_scan = NBuffers;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	for (;;)
 	{
 		volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
 
+		if (num_to_scan == 0)
+		{
+			if (has_target)
+			{
+/*
+ * We have finished writing out buffers belonging to this
+ * relation segment. fsync it now.
+ */
+mdsync_seg(target_rnode, target_forkNum, target_segno);
+
+/* continue the outer scan where we left */
+num_to_scan = outer_num_to_scan;
+buf_id = outer_buf_id;
+has_target = false;
+			}
+			else
+break; /* all done! */
+		}
+
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
 		 * at a single bit. It's possible that someone else writes the buffer
@@ -1299,7 +1325,36 @@ BufferSync(int flags)
 		 */
 		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			RelFileNode this_rnode = bufHdr->tag.rnode;
+			ForkNumber	this_forkNum = bufHdr->tag.forkNum;
+			int			this_segno = bufHdr->tag.blockNum / (RELSEG_SIZE / BLCKSZ);
+			bool		skip = false;
+
+			if (has_target)
+			{
+if (!RelFileNodeEquals(this_rnode, target_rnode) ||
+	this_forkNum != target_forkNum ||
+	this_segno != target_segno)
+{
+	/*
+	 * This buffer belongs to another relation than the one
+	 * we're targeting right now. We'll revisit it later.
+	 */
+	skip = true;
+}
+			}
+			else
+			{
+target_rnode = this_rnode;
+target_forkNum = this_forkNum;
+target_segno = this_segno;
+has_target = true;
+/* remember where we left the outer scan */
+outer_buf_id = buf_id;
+outer_num_to_scan = num_to_scan;
+			}
+
+			if (!skip && SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 BgWriterStats.m_buf_written_checkpoints++;
@@ -1328,6 +1383,7 @@ BufferSync(int flags)
 
 		if (++buf_id >= NBuffers)
 			buf_id = 0;
+		num_to_scan--;
 	}
 
 	/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/st

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-27 Thread Alvaro Herrera
Thomas Munro wrote:
> On 25 August 2014 02:57, Alvaro Herrera  wrote:
> > Thomas Munro wrote:
> >> The difficulty of course will be testing all these racy cases 
> >> reproducibly...
> >
> > Does this help?
> > http://www.postgresql.org/message-id/51fb4305.3070...@2ndquadrant.com
> > The useful trick there is forcing a query to get its snapshot and then
> > go to sleep before actually doing anything, by way of an advisory lock.
> 
> Yes it does, thanks Alvaro and Craig.  I think the attached spec
> reproduces the problem using that trick, ie shows NOWAIT blocking,
> presumably in EvalPlanQualFetch (though I haven't stepped through it
> with a debugger yet).  I'm afraid I'm out of Postgres hacking cycles
> for a few days, but next weekend I should have a new patch that fixes
> this by teaching EvalPlanQualFetch about wait policies, with isolation
> tests for NOWAIT and SKIP LOCKED.

Hmm, http://www.postgresql.org/message-id/51fb6703.9090...@2ndquadrant.com


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Similar to csvlog but not really, json logs?

2014-08-27 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


Stephen Frost wrote:

> To try to clarify that a bit, as it comes across as rather opaque even
> on my re-reading, consider a case where you can't have the
> "credit_card_number" field ever exported to an audit or log file, but
> you're required to log all other changes to a table.  Then consider that
> such a situation extends to individual INSERT or UPDATE commands- you
> need the command logged, but you can't have the contents of that column
> in the log file.

Perhaps you need a better example. Storing raw credit cards in the database 
is a bad idea (and potential PCI violation); audit/log files are only one 
of the many ways things can leak out. Encrypting sensitive columns is a 
solution that solves your auditing problem, and works on all current versions 
of Postgres. :)

> Our current capabilities around logging and auditing are dismal

No arguments there.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201408271200
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAlP+AKgACgkQvJuQZxSWSsjf7gCg00BwRbwRi/UPrHBs1RdfWX/I
TRsAn2CDrG/ycetKOQFbn/4rnSSYPz9j
=Ju0B
-END PGP SIGNATURE-




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Missing plpgsql.o Symbols on OS X

2014-08-27 Thread David E . Wheeler
Hackers,

I’m trying to build Pavel’s plpgsql_check against the 9.4 beta on OS X 10.9, 
but get these errors:

make
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv  
-I/usr/local/pgsql/lib/pgxs/src/makefiles/../../src/pl/plpgsql/src -bundle 
-multiply_defined suppress -o plpgsql_check.so plpgsql_check.o 
-L/usr/local/pgsql/lib -L/usr/local/lib  -L/usr/local/lib 
-Wl,-dead_strip_dylibs   -bundle_loader /usr/local/pgsql/bin/postgres
Undefined symbols for architecture x86_64:
 "_exec_get_datum_type", referenced from:
 _check_target in plpgsql_check.o
 "_plpgsql_build_datatype", referenced from:
 _check_stmt in plpgsql_check.o
 "_plpgsql_compile", referenced from:
 _check_plpgsql_function in plpgsql_check.o
 "_plpgsql_parser_setup", referenced from:
 _prepare_expr in plpgsql_check.o
 "_plpgsql_stmt_typename", referenced from:
 _put_error in plpgsql_check.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [plpgsql_check.so] Error 1

Which is odd, because plpgsql_check.c includes plpgsql.h, and those symbols do 
appear to be in plpgsql.so:

$ nm /usr/local/pgsql/lib/plpgsql.so | grep _exec_get_datum_type
f110 T _exec_get_datum_type
f380 T _exec_get_datum_type_info

So, uh, what gives? Do I need to something extra to get it to properly find 
plpgsql.so?

Thanks,

David



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-27 Thread Arthur Silva
Em 26/08/2014 09:16, "Fujii Masao"  escreveu:
>
> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed 
wrote:
> > Hello,
> > Thank you for comments.
> >
> >>Could you tell me where the patch for "single block in one run" is?
> > Please find attached patch for single block compression in one run.
>
> Thanks! I ran the benchmark using pgbench and compared the results.
> I'd like to share the results.
>
> [RESULT]
> Amount of WAL generated during the benchmark. Unit is MB.
>
> MultipleSingle
> off202.0201.5
> on6051.06053.0
> pglz3543.03567.0
> lz43344.03485.0
> snappy3354.03449.5
>
> Latency average during the benchmark. Unit is ms.
>
> MultipleSingle
> off19.119.0
> on55.357.3
> pglz45.045.9
> lz444.244.7
> snappy43.443.3
>
> These results show that FPW compression is really helpful for decreasing
> the WAL volume and improving the performance.
>
> The compression ratio by lz4 or snappy is better than that by pglz. But
> it's difficult to conclude which lz4 or snappy is best, according to these
> results.
>
> ISTM that compression-of-multiple-pages-at-a-time approach can compress
> WAL more than compression-of-single-... does.
>
> [HOW TO BENCHMARK]
> Create pgbench database with scall factor 1000.
>
> Change the data type of the column "filler" on each pgbench table
> from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
> gen_random_uuid() in order to avoid empty column, e.g.,
>
>  alter table pgbench_accounts alter column filler type text using
> gen_random_uuid()::text
>
> After creating the test database, run the pgbench as follows. The
> number of transactions executed during benchmark is almost same
> between each benchmark because -R option is used.
>
>   pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
>
> checkpoint_timeout is 5min, so it's expected that checkpoint was
> executed at least two times during the benchmark.
>
> Regards,
>
> --
> Fujii Masao
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

It'd be interesting to check avg cpu usage as well.


Re: [HACKERS] [BUGS] Re: BUG #9555: pg_dump for tables with inheritance recreates the table with the wrong order of columns

2014-08-27 Thread Tom Lane
Bruce Momjian  writes:
> I looked at this issue from March and I think we need to do something. 
> In summary, the problem is that tables using inheritance can be dumped
> and reloaded with columns in a different order from the original
> cluster.

Yeah ... this has been a well-understood issue for a dozen years, and
pg_dump goes to considerable trouble to get it right.

> I think we have several options:

> 1.  document this behavior

That one.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-08-27 Thread Heikki Linnakangas

On 08/27/2014 06:08 PM, Fabien COELHO wrote:

I've submitted this "simple" lag limit version because being able to
measure quickly and simply (un)responsiveness seems like a good idea,
especially given the current state of things.


Ok, fair enough. I don't think doing a "latency limit" would be significantly
harder, but I can't force you. I'll mark this as Returned with Feedback then.


Hmmm. I can distinguish just the two cases. Rather mark it as "waiting on
author", I may give it a go.


Feel free to mark it as such if you think you can get a new version 
posted in the next few days.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-08-27 Thread Fabien COELHO



As for an actual "latency limit" under throttling, this is significantly
more tricky and invasive to implement... ISTM that it would mean:
[...] 


Yeah, something like that. I don't think it would be necessary to set 
statement_timeout, you can inject that in your script or postgresql.conf if 
you want. I don't think aborting a transaction that's already started is 
necessary either. You could count it as LATE, but let it finish first.


If you remove all difficult cases from the spec, it is obviously much 
simpler to implement:-) It seems that your simplified version of "latency 
limit" would be just to distinguish LATE from ONTIME among the committed 
ones, compared to the current version, and not to actually limit the 
latency, which is the tricky part.



I've submitted this "simple" lag limit version because being able to
measure quickly and simply (un)responsiveness seems like a good idea,
especially given the current state of things.


Ok, fair enough. I don't think doing a "latency limit" would be significantly 
harder, but I can't force you. I'll mark this as Returned with Feedback then.


Hmmm. I can distinguish just the two cases. Rather mark it as "waiting on 
author", I may give it a go.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-27 Thread Robert Haas
On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila  wrote:
> Today, while working on updating the patch to improve locking
> I found that as now we are going to have a new process, we need
> a separate latch in StrategyControl to wakeup that process.
> Another point is I think it will be better to protect
> StrategyControl->completePasses with victimbuf_lck rather than
> freelist_lck, as when we are going to update it we will already be
> holding the victimbuf_lck and it doesn't make much sense to release
> the victimbuf_lck and reacquire freelist_lck to update it.

Sounds reasonable.  I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer.  I don't see much point
in fine-tuning the locking regimen until that's done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-27 Thread Robert Haas
On Tue, Aug 26, 2014 at 11:10 AM, Tom Lane  wrote:
> Amit Kapila  writes:
>> On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas  wrote:
 I think you should get rid of BufFreelistLock completely and just
 decide that freelist_lck will protect everything the freeNext links, plus
 everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
 will protect nextVictimBuffer and nothing else.
>
>> Another point is I think it will be better to protect
>> StrategyControl->completePasses with victimbuf_lck rather than
>> freelist_lck, as when we are going to update it we will already be
>> holding the victimbuf_lck and it doesn't make much sense to release
>> the victimbuf_lck and reacquire freelist_lck to update it.
>
> I'm rather concerned by this cavalier assumption that we can protect
> fields a,b,c with one lock and fields x,y,z in the same struct with some
> other lock.
>
> A minimum requirement for that to work safely at all is that the fields
> are of atomically fetchable/storable widths; which might be okay here
> but it's a restriction that bears thinking about (and documenting).
>
> But quite aside from safety, the fields are almost certainly going to
> be in the same cache line which means contention between processes that
> are trying to fetch or store them concurrently.  For a patch whose sole
> excuse for existence is to improve performance, that should be a very
> scary concern.
>
> (And yes, I realize these issues already affect the freelist.  Perhaps
> that's part of the reason we have performance issues with it.)

False sharing is certainly a concern that has crossed my mine while
looking at Amit's work, but the performance numbers he's posted
upthread are stellar.  Maybe we can squeeze some additional
performance out of this by padding out the cache lines, but it's
probably minor compared to the gains he's already seeing.  I think we
should focus on trying to lock in those gains, and then we can
consider what further things we may want to do after that.  If it
turns out that structure-padding is among those things, that's easy
enough to do as a separate patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hardening pg_upgrade

2014-08-27 Thread Robert Haas
On Wed, Aug 27, 2014 at 10:13 AM, Bruce Momjian  wrote:
> On Wed, Aug 27, 2014 at 09:54:11AM -0400, Robert Haas wrote:
>> Having said that, there are obviously advantages for our users if we
>> don't get too crazy about requiring that.  I've used products in the
>> past where to get from version 3 to version 11 you have to upgrade
>> from 3 to 5, then 5 to 7, then 7 to 9, and then 9 to 11.  That's
>> somewhat understandable from the vendor's point of view, but it's not
>> a lot of fun, and I think we should definitely avoid imposing those
>> kinds of requirements on our users.
>>
>> What we're talking about here is much milder than that.  For the
>> pg_upgrade case, you can upgrade from 8.3 to any of 8.4, 9.0, 9.1,
>> 9.2, 9.3, and 9.4.  You only need to do a two-step upgrade if you want
>> to leapfrog more than 6 major release versions.  That seems like a
>> wide-enough window that it shouldn't inconvenience many people.  For
>> the pg_dump case, you can upgrade from 7.2 or 7.3 to 7.4, 8.0, 8.1,
>> 8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, or 9.4; that is, 11 or 12 major
>> releases.  The number of people who want to skip more than a dozen
>> releases in a single upgrade should be very small, and we might
>> council those people that they'd be better off with a step-wise
>> upgrade for other reasons - like the application-level compatibility
>> breaks we've made over the years - anyway.
>
> Two things --- first, removing 8.3 support in pg_upgrade allowed me to
> remove lots of dead code, so it was a win.  Second, I think you need to
> look at the time span from old to new versions to understand if a
> double-step release is reasonable.  If that 3-5-7-9 release step spans
> two years, it is too short --- if it spans 15 years, it is probably fine
> as few people would wait 15 years to upgrade.

Right, I agree with all of that and was not intending to dispute any of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-27 Thread Robert Haas
On Tue, Aug 26, 2014 at 8:14 AM, Fujii Masao  wrote:
> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed  wrote:
>> Hello,
>> Thank you for comments.
>>
>>>Could you tell me where the patch for "single block in one run" is?
>> Please find attached patch for single block compression in one run.
>
> Thanks! I ran the benchmark using pgbench and compared the results.
> I'd like to share the results.
>
> [RESULT]
> Amount of WAL generated during the benchmark. Unit is MB.
>
> MultipleSingle
> off202.0201.5
> on6051.06053.0
> pglz3543.03567.0
> lz43344.03485.0
> snappy3354.03449.5
>
> Latency average during the benchmark. Unit is ms.
>
> MultipleSingle
> off19.119.0
> on55.357.3
> pglz45.045.9
> lz444.244.7
> snappy43.443.3
>
> These results show that FPW compression is really helpful for decreasing
> the WAL volume and improving the performance.

Yeah, those look like good numbers.  What happens if you run it at
full speed, without -R?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Fabien COELHO


Hello,


If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?


I can try, when I have finished with the current round of testing.

Note that, as Andres put it, it currently "sucks". Having to tinker with 
linux kernel parameters just to handle a small load without being offline 
10% of the time does not look very good, so even if it works, ISTM that a 
pg side solution is desirable.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [BUGS] Re: BUG #9555: pg_dump for tables with inheritance recreates the table with the wrong order of columns

2014-08-27 Thread Bruce Momjian
On Mon, Mar 17, 2014 at 07:12:12PM -0400, Noah Misch wrote:
> On Fri, Mar 14, 2014 at 12:33:04PM -0300, Alvaro Herrera wrote:
> > Tom Lane wrote:
> > > Alvaro Herrera  writes:
> > > > I wonder if the real fix here is to have ALTER / INHERIT error out of
> > > > the columns in B are not a prefix of those in A.
> > > 
> > > Years ago, we sweated quite a lot of blood to make these cases work.
> > > I'm not thrilled about throwing away all that effort because one person
> > > doesn't like the behavior.
> 
> Agreed.  That also makes the current pg_dump behavior a bug.  Column order
> matters; pg_dump is failing to recreate a semantically-equivalent database.
> 
> > Hm, well in that case it makes sense to consider the original
> > suggestion: if the columns in the parent are not a prefix of those of
> > the child, use ALTER INHERIT after creating both tables rather than
> > CREATE TABLE INHERITS.
> > 
> > It'd be a lot of new code in pg_dump though.  I am not volunteering ...
> 
> "pg_dump --binary-upgrade" already gets this right.  Perhaps it won't take too
> much code to make dumpTableSchema() reuse that one part of its binary-upgrade
> approach whenever the columns of B are not a prefix of those in A.

[thread moved to hackers]

I looked at this issue from March and I think we need to do something. 
In summary, the problem is that tables using inheritance can be dumped
and reloaded with columns in a different order from the original
cluster.  What is a basically happening is that these queries:

CREATE TABLE A(a int, b int, c int);
CREATE TABLE B(a int, c int);
ALTER TABLE A INHERIT B;

cause pg_dump to generate this:

CREATE TABLE b (
a integer,
c integer
);
CREATE TABLE a (
a integer,
b integer,
c integer
)
INHERITS (b);

which issues these warnings when run:

NOTICE:  merging column "a" with inherited definition
NOTICE:  merging column "c" with inherited definition

and produces this table "a":

test2=> \d a
   Table "public.a"
 Column |  Type   | Modifiers
+-+---
 a  | integer |
-->  c  | integer |
 b  | integer |

Notice the column reordering.  The logic is that a CREATE TABLE INHERITS
should place the inherited parent columns _first_.  This can't be done
by ALTER TABLE INHERIT because the table might already contain data.

I think we have several options:

1.  document this behavior
2.  have ALTER TABLE INHERIT issue a warning about future reordering
3.  use the pg_dump binary-upgrade code when such cases happen

My crude approach for #3 would be for pg_dump to loop over the columns
and, where pg_attribute.attinhcount == 0, check to see if there is a
matching column name in any inherited table.  Will such tables load fine
because pg_dump binary-upgrade mode doesn't do any data loading?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 10:32:19 -0400, Aidan Van Dyk wrote:
> On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO  wrote:
> 
> >
> > Hello Andres,
> >
> >  [...]
> >>
> >> I think you're misunderstanding how spread checkpoints work.
> >>
> >
> > Yep, definitely:-) On the other hand I though I was seeking something
> > "simple", namely correct latency under small load, that I would expect out
> > of the box.
> >
> > What you describe is reasonable, and is more or less what I was hoping
> > for, although I thought that bgwriter was involved from the start and
> > checkpoint would only do what is needed in the end. My mistake.
> >
> >
> If all you want is to avoid the write storms when fsyncs start happening on
> slow storage, can you not just adjust the kernel vm.dirty* tunables to
> start making the kernel write out dirty buffers much sooner instead of
> letting them accumulate until fsyncs force them out all at once?

Well. For one that's a os specific global tunable requiring root to be
adjustable. For another we actually do want some buffering: If a backend
writes out a buffer's data itself (happens very frequently) it *should*
get buffered... So I don't think a process independent tunable is going
to do the trick.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-08-27 Thread Heikki Linnakangas

On 08/27/2014 02:37 PM, Fabien COELHO wrote:

As for an actual "latency limit" under throttling, this is significantly
more tricky and invasive to implement... ISTM that it would mean:

   - if the tx is not stated an the latency is already consummed, SKIP++.

   - if the tx is after its schedule start time but under latency, then
 start it, and maybe inject a "SET TIMEOUT...".

   - if a tx is being processed but reaches its latency limit (after
 schedule start time), abort it coldly, ROLLBACK++ (well if the tx is
 really started, there could also be shell commands and \set stuff in a
 pgbench script, which mean started is not really started, so it would
 be INTERRUPT++ if no BEGIN was sent).

   - if a tx is finished but the final commit returned after the latency
 deadline, you cannot abort it anymore but it is late nevertheless,
 LATE++.


Yeah, something like that. I don't think it would be necessary to set 
statement_timeout, you can inject that in your script or postgresql.conf 
if you want. I don't think aborting a transaction that's already started 
is necessary either. You could count it as LATE, but let it finish first.



This is doable but far beyond my current needs. Moreover, I'm not sure
that such a patch would pass because of invasiveness and complexity, so it
could be a total loss of time.


Ok, but *why* are you doing a "lag limit", and not a "latency limit"?


Because it is much simpler (see above) and is enough for testing pg
responsiveness issue, which is my current objective, and models some
client timeout behavior.


Under what circumstances is the lag limit a more useful setting?


It is not "more" useful" per se, it is what I'm using to test pg
unresponsivness with a simple to define and interpret measure wrt
throttling.

If I would do "latency limit" under throttling, it would be (1) more time
to develop, more complex, more invasive in the code (see above, + also the
implementation when not under throttling), (2) more complex to interpret,
with at least 5 possible outcomes (skipped, interrupted, committed on
time, committed but late, aborted), (3) this added information would not
be useful to me.

I've submitted this "simple" lag limit version because being able to
measure quickly and simply (un)responsiveness seems like a good idea,
especially given the current state of things.


Ok, fair enough. I don't think doing a "latency limit" would be 
significantly harder, but I can't force you. I'll mark this as Returned 
with Feedback then.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Aidan Van Dyk
On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO  wrote:

>
> Hello Andres,
>
>  [...]
>>
>> I think you're misunderstanding how spread checkpoints work.
>>
>
> Yep, definitely:-) On the other hand I though I was seeking something
> "simple", namely correct latency under small load, that I would expect out
> of the box.
>
> What you describe is reasonable, and is more or less what I was hoping
> for, although I thought that bgwriter was involved from the start and
> checkpoint would only do what is needed in the end. My mistake.
>
>
If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?

>
a.


-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a
slave.


Re: [HACKERS] Hardening pg_upgrade

2014-08-27 Thread Bruce Momjian
On Wed, Aug 27, 2014 at 09:54:11AM -0400, Robert Haas wrote:
> Having said that, there are obviously advantages for our users if we
> don't get too crazy about requiring that.  I've used products in the
> past where to get from version 3 to version 11 you have to upgrade
> from 3 to 5, then 5 to 7, then 7 to 9, and then 9 to 11.  That's
> somewhat understandable from the vendor's point of view, but it's not
> a lot of fun, and I think we should definitely avoid imposing those
> kinds of requirements on our users.
> 
> What we're talking about here is much milder than that.  For the
> pg_upgrade case, you can upgrade from 8.3 to any of 8.4, 9.0, 9.1,
> 9.2, 9.3, and 9.4.  You only need to do a two-step upgrade if you want
> to leapfrog more than 6 major release versions.  That seems like a
> wide-enough window that it shouldn't inconvenience many people.  For
> the pg_dump case, you can upgrade from 7.2 or 7.3 to 7.4, 8.0, 8.1,
> 8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, or 9.4; that is, 11 or 12 major
> releases.  The number of people who want to skip more than a dozen
> releases in a single upgrade should be very small, and we might
> council those people that they'd be better off with a step-wise
> upgrade for other reasons - like the application-level compatibility
> breaks we've made over the years - anyway.

Two things --- first, removing 8.3 support in pg_upgrade allowed me to
remove lots of dead code, so it was a win.  Second, I think you need to
look at the time span from old to new versions to understand if a
double-step release is reasonable.  If that 3-5-7-9 release step spans
two years, it is too short --- if it spans 15 years, it is probably fine
as few people would wait 15 years to upgrade.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

2014-08-27 Thread Alvaro Herrera
Andres Freund wrote:
> On 2014-08-26 22:19:47 -0400, Tom Lane wrote:
> > Andres Freund  writes:

> > I would say that the issue most deserving of performance testing is your
> > sizing of the linear-search array --- it's not obvious that 8 is a good
> > size.
> 
> It's about the size of a cacheline on all common architectures, that's
> how I found it. I don't think it makes a very big difference whether we
> make it 4 or 12, but outside of that range I think it'll be unlikely to
> be beneficial. The regression tests never go about three or four pins or
> so currently, so I think that's a number unlikely to regularly be
> crossed in practice.

FWIW scanning a minmax index will keep three pages pinned IIRC
(metapage, current revmap page, current regular page).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-08-27 Thread Tom Lane
I wrote:
> Robert Haas  writes:
>> Hmm, I'm worried that may be an API contract violation.

> Indeed it is.  You could get away with it if you check the
> EXEC_FLAG_EXPLAIN_ONLY flag before doing anything with visible
> side-effects, but it's still pretty ugly.

Actually, there's another problem there.  What of UPDATE or DELETE with a
LIMIT clause, which is something that seems to be coming down the pike:
https://commitfest.postgresql.org/action/patch_view?id=1550

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-27 Thread Robert Haas
On Tue, Aug 26, 2014 at 7:17 AM, Fujii Masao  wrote:
> On Wed, Aug 20, 2014 at 1:14 PM, Michael Paquier
>  wrote:
>> On Wed, Aug 20, 2014 at 2:06 AM, Robert Haas  wrote:
>>>
>>> On Sat, Aug 16, 2014 at 10:27 AM, Amit Kapila 
>>> wrote:
>>> > I think ideally it would have been better if we could have logged
>>> > replication commands under separate log_level, but as still there
>>> > is no consensus on extending log_statement and nobody is even
>>> > willing to pursue, it seems okay to go ahead and log these under
>>> > 'all' level.
>>>
>>> I think the consensus is clearly for a separate GUC.
>>
>> +1.
>
> Okay. Attached is the updated version of the patch which I posted before.
> This patch follows the consensus and adds separate parameter
> "log_replication_command".

Looks fine to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-08-27 Thread Tom Lane
Robert Haas  writes:
> On Mon, Aug 25, 2014 at 8:58 AM, Albe Laurenz  wrote:
>> Reading the code, I noticed that the pushed down UPDATE or DELETE statement 
>> is executed
>> during postgresBeginForeignScan rather than during 
>> postgresIterateForeignScan.
>> It probably does not matter, but is there a reason to do it different from 
>> the normal scan?

> Hmm, I'm worried that may be an API contract violation.

Indeed it is.  You could get away with it if you check the
EXEC_FLAG_EXPLAIN_ONLY flag before doing anything with visible
side-effects, but it's still pretty ugly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specifying the unit in storage parameter

2014-08-27 Thread Alvaro Herrera
Fujii Masao wrote:
> On Tue, Aug 26, 2014 at 3:27 AM, Alvaro Herrera
>  wrote:
> > Fujii Masao wrote:
> >> On Thu, Aug 21, 2014 at 4:20 PM, Michael Paquier
> >>  wrote:
> >
> >> > Looking at the patch, the parameter "fillfactor" in the category
> >> > RELOPT_KIND_HEAP (the first element in intRelOpts of reloptions.c) is
> >> > not updated with the new field. It is only a one-line change.
> >> > @@ -97,7 +97,7 @@ static relopt_int intRelOpts[] =
> >> > "Packs table pages only to this percentage",
> >> > RELOPT_KIND_HEAP
> >> > },
> >> > -   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100
> >> > +   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100, 0
> >> > },
> >>
> >> Oh, good catch. I wonder why I did such a mistake...
> >
> > Uninitialized elements at end of struct are filled with zeroes.
> 
> Yeah, that's the reason why I could not notice the problem at compile time.

Right -- it's not something the compiler would warn you about.

> > We do
> > have other examples of this -- for instance, config_generic in the guc.c
> > tables are almost always only 5 members long even though the struct is
> > quite a bit longer than that.  Most entries do not even have "flags" set.
> 
> So you imply that the trailing zero (which the patch adds as flag)
> in the reloption struct should be dropped?

Not necessarily, because it's harmless.  It's there for purely
aesthetical reasons, so it's your choice whether to add it or not.
Having it there is slightly easier on somebody reading the code,
perhaps.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-08-27 Thread Robert Haas
On Mon, Aug 25, 2014 at 8:58 AM, Albe Laurenz  wrote:
> Reading the code, I noticed that the pushed down UPDATE or DELETE statement 
> is executed
> during postgresBeginForeignScan rather than during postgresIterateForeignScan.
> It probably does not matter, but is there a reason to do it different from 
> the normal scan?

Hmm, I'm worried that may be an API contract violation.  ISTM that we
might initialize nodes that we never read from - they can show up in
the EXPLAIN-plan as (never executed) - and things that aren't executed
shouldn't do work, especially work that permanently modifies data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hardening pg_upgrade

2014-08-27 Thread Robert Haas
On Mon, Aug 25, 2014 at 3:04 PM, Tom Lane  wrote:
> Bruce Momjian  writes:
>> OK, I will move in the direction of removing 8.3 support and use a
>> single query to pull schema information.   I was hesistant to remove 8.3
>> support as I know we have kept pg_dump support all the way back to 7.0,
>> but it seems pg_upgrade need not have the same version requirements.
>
> Not really related, but ... I've been thinking that it's time to rip out
> pg_dump's support for server versions before 7.3 or 7.4.  That would let
> us get rid of a lot of klugy code associated with the lack of schemas
> and dependency info in the older versions.  It's possible that we should
> move the cutoff even further --- I've not looked closely at how much could
> be removed by dropping versions later than 7.3.
>
> Aside from the question of how much old code could be removed, there's the
> salient point of how do we test pg_dump against such old branches?  The
> further back you go the harder it is to even build PG on modern platforms,
> and the less likely it will work (I note for example that pre-8.0
> configure doesn't try to use -fwrapv, let alone some of the other switches
> we've found necessary on recent gcc).  I've usually tested pg_dump patches
> against old servers by running them against builds I have in captivity on
> my old HPPA box ... but once that dies, I'm *not* looking forward to
> trying to rebuild 7.x on my current machines.

I think it's fine for us to start requiring two-step upgrades beyond a
certain point, and I think removing pg_dump support for pre-7.4
versions of the server is very reasonable.  I also think removing
pg_upgrade support for 8.3 is reasonable.  Many products require
multi-step upgrades when crossing multiple release versions, and I
think we can, too.

Having said that, there are obviously advantages for our users if we
don't get too crazy about requiring that.  I've used products in the
past where to get from version 3 to version 11 you have to upgrade
from 3 to 5, then 5 to 7, then 7 to 9, and then 9 to 11.  That's
somewhat understandable from the vendor's point of view, but it's not
a lot of fun, and I think we should definitely avoid imposing those
kinds of requirements on our users.

What we're talking about here is much milder than that.  For the
pg_upgrade case, you can upgrade from 8.3 to any of 8.4, 9.0, 9.1,
9.2, 9.3, and 9.4.  You only need to do a two-step upgrade if you want
to leapfrog more than 6 major release versions.  That seems like a
wide-enough window that it shouldn't inconvenience many people.  For
the pg_dump case, you can upgrade from 7.2 or 7.3 to 7.4, 8.0, 8.1,
8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, or 9.4; that is, 11 or 12 major
releases.  The number of people who want to skip more than a dozen
releases in a single upgrade should be very small, and we might
council those people that they'd be better off with a step-wise
upgrade for other reasons - like the application-level compatibility
breaks we've made over the years - anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER SYSTEM RESET?

2014-08-27 Thread Fujii Masao
On Mon, Aug 25, 2014 at 1:34 PM, Amit Kapila  wrote:
> On Wed, Jul 30, 2014 at 9:11 AM, Amit Kapila 
> wrote:
>> I have verified the patch and found that it works well for
>> all scenario's.  Few minor suggestions:
>>
>> 1.
>> !values to the postgresql.auto.conf file.
>> !Setting the parameter to DEFAULT, or using the
>> !RESET variant, removes the configuration entry
>> from
>>
>> It would be better if we can write a separate line for RESET ALL
>> as is written in case of both Alter Database and Alter Role in their
>> respective documentation.
>>
>> 2.
>> ! %type  generic_set set_rest set_rest_more generic_reset
>> reset_rest SetResetClause FunctionSetResetClause
>>
>> Good to break it into 2 lines.
>>
>> 3. I think we can add some text on top of function
>> AlterSystemSetConfigFile() to explain functionality w.r.t reset all.
>
> I have updated the patch to address the above points.
>
> I will mark this patch as "Ready For Committer" as most of the
> review comments were already addressed by Vik and remaining
> I have addressed in attached patch.

The patch looks good to me. One minor comment is; probably you need to
update the tab-completion code.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 10:20 AM, Andres Freund  wrote:
> On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
>> > I think a somewhat smarter version of the explicit flushes in the
>> > hack^Wpatch I posted nearby is going to more likely to be successful.
>>
>>
>> That path is "dangerous" (as in, may not work as intended) if the
>> filesystem doesn't properly understand range flushes (ehem, like
>> ext3).
>
> The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> guaranteeing durability. And - afaik - not implemented in a file system
> specific manner. It just initiates writeback for individual pages. It
> doesn't cause barrier, journal flushes or anything to be issued. That's
> still done by the fsync() later.
>
> The big disadvantage is that it's a OS specific solution, but I don't
> think we're going to find anything that isn't in this area.

I guess it should work then.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Code bug or doc bug?

2014-08-27 Thread David G Johnston
Bruce Momjian wrote
> On Wed, Aug 27, 2014 at 09:05:41AM -0400, Robert Haas wrote:
>> Another idea is to have a command that you can run, while connected to
>> a particular database, that updates the default tablespace for that
>> database without actually moving any data on disk - i.e. it sets
>> pg_database.dattablespace, and then updates every pg_class row where
>> reltablespace = 0 to the old default tablespace, and pg_class row
>> where reltablespace = the new tablespace ID to 0.  Then you can move
>> individual relations afterwards if you feel like it.  But that might
>> still require a lot of locks, and I think we also have a limitation
>> that some relations (the mapped ones?) have to be in the database's
>> default tablespace, which obviously wouldn't work here.
>> 
>> So it's a tricky problem.
> 
> Is there a doc patch to make here?

1. Last sentence change suggestion: "The target tablespace must be empty."

2. Based on Robert's comments it sounds like a "You cannot change the
default tablespace of the current database." comment should be added as
well.

Side note: I have no clue what the "mapped relations" Robert refers to
are...

If the locking problem is unsolvable, which seems to be the only realistic
reason why updating pg_class cannot be done somewhere in the process, could
we make it so that the same physical tablespace location can have multiple
pointers?  The problem here would be that a subsequent move would only grab
those relations that are in the current tablespace by default and would
leave the ones that were present originally - unless they get moved in the
interim to the default tablespace (in this case by changing their oid to 0
manually first).

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Code-bug-or-doc-bug-tp5816052p5816550.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set new system identifier using pg_resetxlog

2014-08-27 Thread Robert Haas
On Mon, Aug 25, 2014 at 4:06 PM, Heikki Linnakangas
 wrote:
> I didn't understand this one. But it seems like the obvious solution is to
> not use the consumer's system identifier as the slot name. Or rename it
> afterwards.

You can't use the consumer's system identifier as the slot name,
because you have to create the slot before you create the consumer.
But you could rename it afterwards, or just use some other naming
convention entirely, which is why I'm -0.25 on this whole proposal.
What the 2ndQuadrant folks are proposing is not unreasonable (which is
why I'm only -0.25) but it opens an (admittedly small) can of worms
that I see no real need to open.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
> > I think a somewhat smarter version of the explicit flushes in the
> > hack^Wpatch I posted nearby is going to more likely to be successful.
> 
> 
> That path is "dangerous" (as in, may not work as intended) if the
> filesystem doesn't properly understand range flushes (ehem, like
> ext3).

The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Code bug or doc bug?

2014-08-27 Thread Bruce Momjian
On Wed, Aug 27, 2014 at 09:05:41AM -0400, Robert Haas wrote:
> Another idea is to have a command that you can run, while connected to
> a particular database, that updates the default tablespace for that
> database without actually moving any data on disk - i.e. it sets
> pg_database.dattablespace, and then updates every pg_class row where
> reltablespace = 0 to the old default tablespace, and pg_class row
> where reltablespace = the new tablespace ID to 0.  Then you can move
> individual relations afterwards if you feel like it.  But that might
> still require a lot of locks, and I think we also have a limitation
> that some relations (the mapped ones?) have to be in the database's
> default tablespace, which obviously wouldn't work here.
> 
> So it's a tricky problem.

Is there a doc patch to make here?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-08-27 Thread Robert Haas
On Mon, Aug 25, 2014 at 1:35 AM, Andrew Gierth
 wrote:
> If you look at the latest patch post, there's a small patch in it that
> does nothing but unreserve the keywords and fix ruleutils to make
> deparse/parse work. The required fix to ruleutils is an example of
> violating your "four kinds of keywords" principle, but quoting
> keywords still works.

I think it would be intolerable to lose the ability to quote keywords.
That could easily create situations where there's no reasonable way to
dump an older database in such a fashion that it can be reloaded into
a newer database.  So it's good that you avoided that.

The "four kinds of keywords" principle is obviously much less
absolute.  We've talked before about introducing additional categories
of keywords, and that might be a good thing to do for one reason or
another.  But I think it's not good to do it in a highly idiosyncratic
way: I previously proposed reserving concurrently only when it follows
CREATE INDEX, and not in any other context, but Tom argued that it had
to become a type_func_name_keyword since users would be confused to
find that concurrently (but not any other keyword) needed quoting
there.  In retrospect, I tend to think he probably had it right.
There is a good amount of third-party software out there that tries to
be smart about quoting PostgreSQL keywords - for example, pgAdmin has
code for that, or did last I looked - so by making things more
complicated, we run the risk not only of bugs in our own software but
also bugs in other people's software, as well as user confusion.  So I
still think the right solution is probably to reserve CUBE across the
board, and not just in the narrowest context that we can get away
with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 10:15 AM, Andres Freund  wrote:
> On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
>> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO  wrote:
>> >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>> >> that causes the problem, not the the writes themselves.
>> >
>> >
>> > Hmmm. My (poor) understanding is that fsync would work fine if everything
>> > was already written beforehand:-) that is it has nothing to do but assess
>> > that all is already written. If there is remaining write work, it starts
>> > doing it "now" with the disastrous effects I'm complaining about.
>> >
>> > When I say "pacing does not work", I mean that things where not written out
>> > to disk by the OS, it does not mean that pg did not ask for it.
>> >
>> > However it does not make much sense for an OS scheduler to wait several
>> > minutes with tens of thousands of pages to write and do nothing about it...
>> > So I'm wondering.
>>
>> Maybe what's needed, is to slightly tweak checkpoint logic to give the
>> kernel some time to flush buffers.
>>
>> Correct me if I'm wrong, but the checkpointer does the sync right
>> after the reads. Of course there will be about 30s-worth of
>> accumulated writes (it's the default amount of time the kernel holds
>> on to dirty buffers).
>>
>> Perhaps it should be delayed a small time, say 30s, to let the kernel
>> do the writing on its own.
>
> The kernel *starts* to write out pages after 30s, it doesn't finish
> doing so. So I don't think that's going to work.
>
> I think a somewhat smarter version of the explicit flushes in the
> hack^Wpatch I posted nearby is going to more likely to be successful.


That path is "dangerous" (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO  wrote:
> >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
> >> that causes the problem, not the the writes themselves.
> >
> >
> > Hmmm. My (poor) understanding is that fsync would work fine if everything
> > was already written beforehand:-) that is it has nothing to do but assess
> > that all is already written. If there is remaining write work, it starts
> > doing it "now" with the disastrous effects I'm complaining about.
> >
> > When I say "pacing does not work", I mean that things where not written out
> > to disk by the OS, it does not mean that pg did not ask for it.
> >
> > However it does not make much sense for an OS scheduler to wait several
> > minutes with tens of thousands of pages to write and do nothing about it...
> > So I'm wondering.
> 
> Maybe what's needed, is to slightly tweak checkpoint logic to give the
> kernel some time to flush buffers.
> 
> Correct me if I'm wrong, but the checkpointer does the sync right
> after the reads. Of course there will be about 30s-worth of
> accumulated writes (it's the default amount of time the kernel holds
> on to dirty buffers).
> 
> Perhaps it should be delayed a small time, say 30s, to let the kernel
> do the writing on its own.

The kernel *starts* to write out pages after 30s, it doesn't finish
doing so. So I don't think that's going to work.

I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 10:10 AM, Claudio Freire  wrote:
> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO  wrote:
>>> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>>> that causes the problem, not the the writes themselves.
>>
>>
>> Hmmm. My (poor) understanding is that fsync would work fine if everything
>> was already written beforehand:-) that is it has nothing to do but assess
>> that all is already written. If there is remaining write work, it starts
>> doing it "now" with the disastrous effects I'm complaining about.
>>
>> When I say "pacing does not work", I mean that things where not written out
>> to disk by the OS, it does not mean that pg did not ask for it.
>>
>> However it does not make much sense for an OS scheduler to wait several
>> minutes with tens of thousands of pages to write and do nothing about it...
>> So I'm wondering.
>
> Maybe what's needed, is to slightly tweak checkpoint logic to give the
> kernel some time to flush buffers.
>
> Correct me if I'm wrong, but the checkpointer does the sync right
> after the reads. Of course there will be about 30s-worth of
> accumulated writes (it's the default amount of time the kernel holds
> on to dirty buffers).
>
> Perhaps it should be delayed a small time, say 30s, to let the kernel
> do the writing on its own.


Errata: just after the writes :-p


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO  wrote:
>> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>> that causes the problem, not the the writes themselves.
>
>
> Hmmm. My (poor) understanding is that fsync would work fine if everything
> was already written beforehand:-) that is it has nothing to do but assess
> that all is already written. If there is remaining write work, it starts
> doing it "now" with the disastrous effects I'm complaining about.
>
> When I say "pacing does not work", I mean that things where not written out
> to disk by the OS, it does not mean that pg did not ask for it.
>
> However it does not make much sense for an OS scheduler to wait several
> minutes with tens of thousands of pages to write and do nothing about it...
> So I'm wondering.

Maybe what's needed, is to slightly tweak checkpoint logic to give the
kernel some time to flush buffers.

Correct me if I'm wrong, but the checkpointer does the sync right
after the reads. Of course there will be about 30s-worth of
accumulated writes (it's the default amount of time the kernel holds
on to dirty buffers).

Perhaps it should be delayed a small time, say 30s, to let the kernel
do the writing on its own.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Code bug or doc bug?

2014-08-27 Thread Robert Haas
On Sun, Aug 24, 2014 at 6:26 PM, Josh Berkus  wrote:
> Quoth our docs
> (http://www.postgresql.org/docs/9.3/static/sql-alterdatabase.html):
>
> "The fourth form changes the default tablespace of the database. Only
> the database owner or a superuser can do this; you must also have create
> privilege for the new tablespace. This command physically moves any
> tables or indexes in the database's old default tablespace to the new
> tablespace. Note that tables and indexes in non-default tablespaces are
> not affected."
>
> Yet:
>
> jberkus=# alter database phc set tablespace ssd;
> ERROR:  some relations of database "phc" are already in tablespace "ssd"
> HINT:  You must move them back to the database's default tablespace
> before using this command.
>
> Aside from being a stupid limitation (I need to copy the tables back to
> the old tablespace so that I can recopy them to the new one?), the above
> seems to be in direct contradiction to the docs.

I think that it works OK to move objects from tablespace A to table B
while there are also objects in tablespace C, where B != C, but not to
move objects from tablespace A to tablespace B while there are already
objects in tablespace B.  So I think the documentation is right as far
as it goes, but there's an undocumented limitation there.

The reasons for the limitation are:

1. We can't move a database while there are users connected to it.
This means that we can't modify any of the data in the database in the
process of relocating it.  In particular, we can't update it's copy of
pg_class.

2. By convention, pg_class.reltablespace = 0 when the relation is in
the database's default tablespace, and only contains a non-zero OID
when the relation is in some other tablespace.  This is what lets this
feature work at all: the pg_class.reltablespace value for every
relation we're moving is guaranteed to be 0 before the move, and is
still correctly valued as 0 after the move.  But it also means there
can't be any relations from that database in the new tablespace,
because any such relations would need pg_class.reltablespace to get
updated from the OID of that tablespace to 0.

I don't see any easy way to lift this limitation.  If it were possible
to move a database while users are connected to it, then of course you
could connect to the database to move it and update pg_class, but
you'd have to take an exclusive lock on every relation in the database
simultaneously, which might blow out the lock table, deadlock against
other sessions, and other messy things.

Another idea is to have a command that you can run, while connected to
a particular database, that updates the default tablespace for that
database without actually moving any data on disk - i.e. it sets
pg_database.dattablespace, and then updates every pg_class row where
reltablespace = 0 to the old default tablespace, and pg_class row
where reltablespace = the new tablespace ID to 0.  Then you can move
individual relations afterwards if you feel like it.  But that might
still require a lot of locks, and I think we also have a limitation
that some relations (the mapped ones?) have to be in the database's
default tablespace, which obviously wouldn't work here.

So it's a tricky problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] re-reading SSL certificates during server reload

2014-08-27 Thread Stephen Frost
* Magnus Hagander (mag...@hagander.net) wrote:
> That's certainly an issue. Potentially bigger ones are that you cannot
> replace an expired certificate or CRL without a restart.

+100.  I had forgotten about that issue- but it definitely sucks. :(

> Some of this is going to have to be at least partially reworked anyway
> in the work that Heikki has been diong to support non-openssl
> libraries. Making a change like this at the same time is probably a
> good idea.

Agreed.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Function to know last log write timestamp

2014-08-27 Thread Fujii Masao
On Tue, Aug 19, 2014 at 1:07 AM, Robert Haas  wrote:
> On Fri, Aug 15, 2014 at 7:17 AM, Fujii Masao  wrote:
>> On Fri, Aug 15, 2014 at 3:40 AM, Andres Freund  
>> wrote:
>>> On 2014-08-14 14:37:22 -0400, Robert Haas wrote:
 On Thu, Aug 14, 2014 at 2:21 PM, Andres Freund  
 wrote:
 > On 2014-08-14 14:19:13 -0400, Robert Haas wrote:
 >> That's about the idea. However, what you've got there is actually
 >> unsafe, because shmem->counter++ is not an atomic operation.  It reads
 >> the counter (possibly even as two separate 4-byte loads if the counter
 >> is an 8-byte value), increments it inside the CPU, and then writes the
 >> resulting value back to memory.  If two backends do this concurrently,
 >> one of the updates might be lost.
 >
 > All these are only written by one backend, so it should be safe. Note
 > that that coding pattern, just without memory barriers, is all over
 > pgstat.c

 Ah, OK.  If there's a separate slot for each backend, I agree that it's 
 safe.

 We should probably add barriers to pgstat.c, too.
>>>
>>> Yea, definitely. I think this is rather borked on "weaker"
>>> architectures. It's just that the consequences of an out of date/torn
>>> value are rather low, so it's unlikely to be noticed.
>>>
>>> Imo we should encapsulate the changecount modifications/checks somehow
>>> instead of repeating the barriers, Asserts, comments et al everywhere.
>>
>> So what about applying the attached patch first, which adds the macros
>> to load and store the changecount with the memory barries, and changes
>> pgstat.c use them. Maybe this patch needs to be back-patch to at least 9.4?
>>
>> After applying the patch, I will rebase the pg_last_xact_insert_timestamp
>> patch and post it again.
>
> That looks OK to me on a relatively-quick read-through.  I was
> initially a bit worried about this part:
>
>   do
>   {
> ! pgstat_increment_changecount_before(beentry);
>   } while ((beentry->st_changecount & 1) == 0);
>
> pgstat_increment_changecount_before is an increment followed by a
> write barrier.  This seemed like funny coding to me at first because
> while-test isn't protected by any sort of barrier.  But now I think
> it's correct, because there's only one process that can possibly write
> to that data, and that's the one that is making the test, and it had
> certainly better see its own modifications in program order no matter
> what.
>
> I wouldn't object to back-patching this to 9.4 if we were earlier in
> the beta cycle, but at this point I'm more inclined to just put it in
> 9.5.  If we get an actual bug report about any of this, we can always
> back-patch the fix at that time.  But so far that seems mostly
> hypothetical, so I think the less-risky course of action is to give
> this a longer time to bake before it hits an official release.

Sounds reasonable. So, barring any objection, I will apply the patch
only to the master branch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sequence Scan doubts

2014-08-27 Thread Robert Haas
On Thu, Aug 21, 2014 at 2:47 AM, Haribabu Kommi
 wrote:
> Implementation of "Parallel Sequence Scan"
>
> Approach:
>
> 1."Parallel Sequence Scan" can achieved by using the background
> workers doing the job of actual sequence scan including the
> qualification check also.
>
> 2. Planner generates the parallel scan plan by checking the possible
> criteria of the table to do a parallel scan and generates the tasks
> (range of blocks).
>
> 3. In the executor Init phase, Try to copy the necessary data required
> by the workers and start the workers.
>
> 4. In the executor run phase, just get the tuples which are sent by
> the workers and process them further in the plan node execution.

Well, this is something I've thought quite a bit about already.  Many
of my thoughts on parallelism are here:

https://wiki.postgresql.org/wiki/Parallel_Sort

Although the page title is parallel sort, many of the concerns are
applicable to parallelism of any sort.

I posted some patches containing some of the necessary infrastructure here:

http://archives.postgresql.org/message-id/CA+Tgmoam66dTzCP8N2cRcS6S6dBMFX+JMba+mDf68H=kakn...@mail.gmail.com

I seem to have forgotten to add that message to the CommitFest.  Crap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency & bgwriter not doing its job

2014-08-27 Thread Fabien COELHO


Hello Amit,

I see there is some merit in your point which is to make bgwriter more 
useful than its current form. I could see 3 top level points to think 
about whether improvement in any of those can improve the current 
situation:



a. Scanning of buffer pool to find the dirty buffers that can
be flushed.


Yep, that could be more aggressive, or the aggressiveness could be made 
into an adjustible parameter. There is a comment about that in the source. 
However I tested setting the round to 1s instead of 120s, and it had no 
positive effect on my test.



b. Deciding on what is criteria to flush a buffer


Indeed.

For instance when the IO load is low, there is no reason not to send out 
some buffers, it is a free lunch even if it must be done again later. If 
the load is high, this is another matter. So it would mean being able to 
decide whether the current IO load is low or not. It could be different on 
different disk... Hmmm, not that simple.



c. Sync of buffers


Yes. That is more or less the effect of my rough approach of calling 
CHECKPOINT every 0.2 seconds.  Andres Freund just implemented a quick 
linux-specific patch which does that within CHECKPOINT pacing, and which 
greatly improves the situation, although it could still be a little 
better.



[separate xlog test]


I do not have a setup available for that right now. Not sure this would be 
an issue for low loads.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] After switching primary server while using replication slot.

2014-08-27 Thread Fujii Masao
On Fri, Aug 22, 2014 at 11:29 PM, Andres Freund  wrote:
> Hi,
>
> On 2014-08-20 13:14:30 -0400, Robert Haas wrote:
>> On Tue, Aug 19, 2014 at 6:25 AM, Fujii Masao  wrote:
>> > On Mon, Aug 18, 2014 at 11:16 PM, Sawada Masahiko  
>> > wrote:
>> >> Hi all,
>> >> After switching primary serer while using repliaction slot, the
>> >> standby server will not able to connect new primary server.
>> >> Imagine this situation, if primary server has two ASYNC standby
>> >> servers, also use each replication slots.
>> >> And the one standby(A) apply WAL without problems. But another one
>> >> standby(B) has stopped after connected to primary server.
>> >> (or sending WAL is too delayed)
>> >>
>> >> In this situation, the standby(B) has not received WAL segment file
>> >> while stopping itself.
>> >> And the primary server can not remove WAL segments which has not been
>> >> received to all standby.
>> >> Therefore the primary server have to keep the WAL segment file which
>> >> has not been received to all standby.
>> >> But standby(A) can do checkpoint itself, and then it's possible to
>> >> recycle WAL segments.
>> >> The number of WAL segment of each server are different.
>> >> ( The number of WAL files of standby(A) having smaller than primary 
>> >> server.)
>> >> After the primary server is crashed, the standby(A) promote to primary,
>> >> we can try to connect standby(B) to standby(A) as new standby server.
>> >> But it will be failed because the standby(A) server might not have WAL
>> >> segment files that standby(B) required.
>> >
>> > This sounds valid concern.
>> >
>> >> To resolve this situation, I think that we should make master server
>> >> to notify about removal of WAL segment to all standby servers.
>> >> And the standby servers recycle WAL segments files base on that 
>> >> information.
>
> I think that'll end up being really horrible, at least if done in an
> obligatory fashion. In a cascaded setup it's really sensible to only
> retain WAL on the intermediate nodes. Consider e.g. a setup - rather
> common these days actually - where there's a master somewhere and then a
> cascading standby on each continent feeding off to further nodes on that
> continent. You don't want to retain nodes on each continent (or on the
> primary) just because one node somewhere is down for maintenance.
>
>
> If you really want something like this we should probably add the
> infrastructure for one standby to maintain a replication slot on another
> standby server. So, if you have a setup like:
>
> A
>/ \
>  /\
> B  C
>/ \ /\
> .... ..  ..
>
> B and C can coordinate that they keep enough WAL for each other. You can
> actually easily write a external tool for that today. Just create a
> replication slot oin B for C and the other way round and have a tool
> update them once a minute or so.
>
> I'm not sure if we want that builtin.
>
>> >> Thought?
>> >
>> > How does the server recycle WAL files after it's promoted from the
>> > standby to master?
>> > It does that as it likes? If yes, your approach would not be enough.
>> >
>> > The approach prevents unexpected removal of WAL files while the standby
>> > is running. But after the standby is promoted to master, it might recycle
>> > needed WAL files immediately. So another standby may still fail to retrieve
>> > the required WAL file after the promotion.
>> >
>> > ISTM that, in order to address this, we might need to log all the 
>> > replication
>> > slot activities and replicate them to the standby. I'm not sure if this
>> > breaks the design of replication slot at all, though.
>
> Yes, that'd break it. You can't WAL log anything on a standby, and
> replication slots can be modified on standbys.

So current solution for the problem Sawada reported is to increase
wal_keep_segments on the standby to enough high maybe.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Simplify calls of pg_class_aclcheck when multiple modes are used

2014-08-27 Thread Michael Paquier
Hi all,

In a couple of code paths we do the following to check permissions on an
object:
if (pg_class_aclcheck(relid, userid, ACL_USAGE) != ACLCHECK_OK &&
pg_class_aclcheck(relid, userid, ACL_UPDATE) != ACLCHECK_OK)
ereport(ERROR, blah);

Wouldn't it be better to simplify that with a single call of
pg_class_aclcheck, gathering together the modes that need to be checked? In
the case above, the call to pg_class_aclcheck would become like that:
if (pg_class_aclcheck(relid, userid,
 ACL_USAGE | ACL_UPDATE) != ACLCHECK_OK)
ereport(ERROR, blah);

That's not a critical thing, but it would save some cycles. Patch is
attached.
Regards,
-- 
Michael
From e6b23e537d223e4bdb3abada2d761e630c8b27c0 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 27 Aug 2014 20:45:31 +0900
Subject: [PATCH] Minimize calls of pg_class_aclcheck to minimum necessary

In a couple of code paths, pg_class_aclcheck is called in succession with
multiple different modes set. This patch combines those modes to have a
single call of this function and reduce a bit process overhead for
permission checking.
---
 src/backend/commands/sequence.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 3b89dd0..6d5f65b 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -546,8 +546,8 @@ nextval_internal(Oid relid)
 	/* open and AccessShareLock sequence */
 	init_sequence(relid, &elm, &seqrel);
 
-	if (pg_class_aclcheck(elm->relid, GetUserId(), ACL_USAGE) != ACLCHECK_OK &&
-		pg_class_aclcheck(elm->relid, GetUserId(), ACL_UPDATE) != ACLCHECK_OK)
+	if (pg_class_aclcheck(elm->relid, GetUserId(),
+		  ACL_USAGE | ACL_UPDATE) != ACLCHECK_OK)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg("permission denied for sequence %s",
@@ -759,8 +759,8 @@ currval_oid(PG_FUNCTION_ARGS)
 	/* open and AccessShareLock sequence */
 	init_sequence(relid, &elm, &seqrel);
 
-	if (pg_class_aclcheck(elm->relid, GetUserId(), ACL_SELECT) != ACLCHECK_OK &&
-		pg_class_aclcheck(elm->relid, GetUserId(), ACL_USAGE) != ACLCHECK_OK)
+	if (pg_class_aclcheck(elm->relid, GetUserId(),
+		  ACL_SELECT | ACL_USAGE) != ACLCHECK_OK)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg("permission denied for sequence %s",
@@ -801,8 +801,8 @@ lastval(PG_FUNCTION_ARGS)
 	/* nextval() must have already been called for this sequence */
 	Assert(last_used_seq->last_valid);
 
-	if (pg_class_aclcheck(last_used_seq->relid, GetUserId(), ACL_SELECT) != ACLCHECK_OK &&
-		pg_class_aclcheck(last_used_seq->relid, GetUserId(), ACL_USAGE) != ACLCHECK_OK)
+	if (pg_class_aclcheck(last_used_seq->relid, GetUserId(),
+		  ACL_SELECT | ACL_USAGE) != ACLCHECK_OK)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg("permission denied for sequence %s",
-- 
2.1.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specifying the unit in storage parameter

2014-08-27 Thread Fujii Masao
On Tue, Aug 26, 2014 at 3:27 AM, Alvaro Herrera
 wrote:
> Fujii Masao wrote:
>> On Thu, Aug 21, 2014 at 4:20 PM, Michael Paquier
>>  wrote:
>
>> > Looking at the patch, the parameter "fillfactor" in the category
>> > RELOPT_KIND_HEAP (the first element in intRelOpts of reloptions.c) is
>> > not updated with the new field. It is only a one-line change.
>> > @@ -97,7 +97,7 @@ static relopt_int intRelOpts[] =
>> > "Packs table pages only to this percentage",
>> > RELOPT_KIND_HEAP
>> > },
>> > -   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100
>> > +   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100, 0
>> > },
>>
>> Oh, good catch. I wonder why I did such a mistake...
>
> Uninitialized elements at end of struct are filled with zeroes.

Yeah, that's the reason why I could not notice the problem at compile time.

> We do
> have other examples of this -- for instance, config_generic in the guc.c
> tables are almost always only 5 members long even though the struct is
> quite a bit longer than that.  Most entries do not even have "flags" set.

So you imply that the trailing zero (which the patch adds as flag)
in the reloption struct should be dropped?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Audit of logout

2014-08-27 Thread Fujii Masao
On Sat, Aug 23, 2014 at 3:44 PM, Amit Kapila  wrote:
> On Tue, Aug 5, 2014 at 8:04 PM, Fujii Masao  wrote:
>>
>> Yep, the attached patch introduces PGC_SU_BACKEND and
>> changes the contexts of log_connections and log_disconnections
>> to PGC_SU_BACKEND. Review?
>>

Thanks for reviewing the patch!

> 1.
> ! else if (context != PGC_POSTMASTER && context != PGC_SU_BACKEND &&
> ! context != PGC_SU_BACKEND && source != PGC_S_CLIENT)
>
> In the above check for PGC_SU_BACKEND is repeated, here
> one of the check should be PGC_SU_BACKEND  and other
> should be PGC_BACKEND.

Right. Fixed. Attached is the updated version of the patch.
BTW, I also added the following into the document of log_connections
and log_disconnections.

Only superusers can change this setting at session start.

> 2.
> + case PGC_SU_BACKEND:
> + if (context == PGC_BACKEND)
> + {
> ..
> ..
> + return 0;
> + }
>   case PGC_BACKEND:
>   if (context == PGC_SIGHUP)
>
> Changing PGC_SU_BACKEND parameter (log_connections) is
> visible even with a non-super user client due to above code.
> Shouldn't it be only visible for super-user logins?
>
> Simple steps to reproduce the problem:
> a. start Server (default configuration)
> b. connect with superuser
> c. change in log_connections to on in postgresql.conf
> d. perform select pg_reload_conf();
> e. connect with non-super-user
> f.  show log_connections;  --This step shows the value as on,
>--whereas I think it should have been
> off

In this case, log_connections is changed in postgresql.conf and it's
reloaded, so ISTM that it's natural that even non-superuser sees the
changed value. No? Maybe I'm missing something.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 4231,4236  local0.*/var/log/postgresql
--- 4231,4237 
 
  Causes each attempted connection to the server to be logged,
  as well as successful completion of client authentication.
+ Only superusers can change this setting at session start.
  This parameter cannot be changed after session start.
  The default is off.
 
***
*** 4258,4263  local0.*/var/log/postgresql
--- 4259,4265 
  log_connections but at session termination,
  and includes the duration of the session.  This is off by
  default.
+ Only superusers can change this setting at session start.
  This parameter cannot be changed after session start.
 

*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***
*** 3258,3264  get_stats_option_name(const char *arg)
   * argv[0] is ignored in either case (it's assumed to be the program name).
   *
   * ctx is PGC_POSTMASTER for secure options, PGC_BACKEND for insecure options
!  * coming from the client, or PGC_SUSET for insecure options coming from
   * a superuser client.
   *
   * If a database name is present in the command line arguments, it's
--- 3258,3264 
   * argv[0] is ignored in either case (it's assumed to be the program name).
   *
   * ctx is PGC_POSTMASTER for secure options, PGC_BACKEND for insecure options
!  * coming from the client, or PGC_SU_BACKEND for insecure options coming from
   * a superuser client.
   *
   * If a database name is present in the command line arguments, it's
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***
*** 957,963  process_startup_options(Port *port, bool am_superuser)
  	GucContext	gucctx;
  	ListCell   *gucopts;
  
! 	gucctx = am_superuser ? PGC_SUSET : PGC_BACKEND;
  
  	/*
  	 * First process any command-line switches that were included in the
--- 957,963 
  	GucContext	gucctx;
  	ListCell   *gucopts;
  
! 	gucctx = am_superuser ? PGC_SU_BACKEND : PGC_BACKEND;
  
  	/*
  	 * First process any command-line switches that were included in the
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***
*** 509,514  const char *const GucContext_Names[] =
--- 509,515 
  	 /* PGC_INTERNAL */ "internal",
  	 /* PGC_POSTMASTER */ "postmaster",
  	 /* PGC_SIGHUP */ "sighup",
+ 	 /* PGC_SU_BACKEND */ "superuser-backend",
  	 /* PGC_BACKEND */ "backend",
  	 /* PGC_SUSET */ "superuser",
  	 /* PGC_USERSET */ "user"
***
*** 907,913  static struct config_bool ConfigureNamesBool[] =
  		NULL, NULL, NULL
  	},
  	{
! 		{"log_connections", PGC_BACKEND, LOGGING_WHAT,
  			gettext_noop("Logs each successful connection."),
  			NULL
  		},
--- 908,914 
  		NULL, NULL, NULL
  	},
  	{
! 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
  			gettext_noop("Logs each successful connection."),
  			NULL
  		},
***
*** 916,922  static struct config_bool ConfigureNamesBool[] =
  		NULL, NULL, NULL
  	},
  	{
! 		{"log_disconnections", PGC_BACKEND, LOGGING_WHAT,
  			gettext_noop("Logs end of a session, inclu

Re: [HACKERS] pgbench throttling latency limit

2014-08-27 Thread Fabien COELHO


Hello Heikki,


[...]
With a latency limit on when the query should finish, as opposed to how 
late it can start, it's a lot easier to give a number. For example, your 
requirements might state that a user must always get a response to a click on 
a web page in 200 ms, so you set the limit to 200 ms.


Yep. See below for the details.


[...] Why is your patch more interesting than what I described?


It is more interesting because it exists, it is short and simple, it 
works, and it is useful right now to test pg responsiveness and also to 
model some timeout behavior on the client side?



I'm pretty sure we don't need both.


Why not? Testing performance is tricky enough, the tool must be flexible.

I'm pretty sure that I'm interested in testing pg responsiveness right 
now, so I did the simpler one I need for that purpose. It somehow models 
an application/pooler queue management timeout, that would anyway proceed 
with what is already started.



[...]

I was thinking that if a query is already late when the connection becomes 
free to execute it, it would not be executed. It would be skipped, just as in 
your patch.


As for an actual "latency limit" under throttling, this is significantly 
more tricky and invasive to implement... ISTM that it would mean:


 - if the tx is not stated an the latency is already consummed, SKIP++.

 - if the tx is after its schedule start time but under latency, then
   start it, and maybe inject a "SET TIMEOUT...".

 - if a tx is being processed but reaches its latency limit (after
   schedule start time), abort it coldly, ROLLBACK++ (well if the tx is
   really started, there could also be shell commands and \set stuff in a
   pgbench script, which mean started is not really started, so it would
   be INTERRUPT++ if no BEGIN was sent).

 - if a tx is finished but the final commit returned after the latency
   deadline, you cannot abort it anymore but it is late nevertheless,
   LATE++.

This is doable but far beyond my current needs. Moreover, I'm not sure 
that such a patch would pass because of invasiveness and complexity, so it 
could be a total loss of time.



Ok, but *why* are you doing a "lag limit", and not a "latency limit"?


Because it is much simpler (see above) and is enough for testing pg 
responsiveness issue, which is my current objective, and models some

client timeout behavior.


Under what circumstances is the lag limit a more useful setting?


It is not "more" useful" per se, it is what I'm using to test pg 
unresponsivness with a simple to define and interpret measure wrt 
throttling.


If I would do "latency limit" under throttling, it would be (1) more time 
to develop, more complex, more invasive in the code (see above, + also the 
implementation when not under throttling), (2) more complex to interpret, 
with at least 5 possible outcomes (skipped, interrupted, committed on 
time, committed but late, aborted), (3) this added information would not 
be useful to me.


I've submitted this "simple" lag limit version because being able to 
measure quickly and simply (un)responsiveness seems like a good idea, 
especially given the current state of things.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing comment block at the top of streamutil.h and receivelog.h

2014-08-27 Thread Fujii Masao
On Wed, Aug 27, 2014 at 2:07 PM, Michael Paquier
 wrote:
> On Wed, Aug 27, 2014 at 12:31 PM, Fujii Masao  wrote:
>> On Wed, Aug 27, 2014 at 10:34 AM, Michael Paquier
>>  wrote:
>> I think that it's better to add an include guard like
>>
>> #ifndef RECEIVELOG_H
>> #define RECEIVELOG_H
>> ...
>> #endif
> Oh indeed, that was missing as well! Please see attached.

Thanks! Applied.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sample LDIF for pg_service.conf no longer works

2014-08-27 Thread Magnus Hagander
On Sat, Aug 16, 2014 at 5:31 AM, Noah Misch  wrote:
> When using pg_service.conf with LDAP, we document[1] the following sample LDIF
> for populating the LDAP server:
>
> version:1
> dn:cn=mydatabase,dc=mycompany,dc=com
> changetype:add
> objectclass:top
> objectclass:groupOfUniqueNames
> cn:mydatabase
> uniqueMember:host=dbserver.mycompany.com
> uniqueMember:port=5439
> uniqueMember:dbname=mydb
> uniqueMember:user=mydb_user
> uniqueMember:sslmode=require
>
> That presumably worked at one point, but OpenLDAP 2.4.23 and OpenLDAP 2.4.39
> both reject it cryptically:
>
> ldap_add: Invalid syntax (21)
> additional info: uniqueMember: value #0 invalid per syntax
>
> uniqueMember is specified to bear a distinguished name.  While OpenLDAP does
> not verify that uniqueMember values correspond to known DNs, it does verify
> that the value syntactically could be a DN.  To give examples, "o=foobar" is
> always accepted, but "xyz=foobar" is always rejected: "xyz" is not an LDAP DN
> attribute type.  Amid the LDAP core schema, "device" is the best-fitting
> objectClass having the generality required.  Let's convert to that, as
> attached.  I have verified that this works end-to-end.

+1.

I've run into that problem as wel,l just not had time to prepare a
proper example in the core schema :)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] re-reading SSL certificates during server reload

2014-08-27 Thread Magnus Hagander
On Wed, Aug 27, 2014 at 11:56 AM, Alexey Klyukin  wrote:
> Greetings,
>
> Is there a strong reason to disallow reloading server key and cert files
> during the PostgreSQL reload?

Key and cert files are loaded in the postmaster. We'd need to change
that. I'm not saying that's not a good idea, but it's not as easy as
just allowing it :)


> Basically, once you run multiple databases in a cluster and use different
> DNS names to connect to different databases (in order for those databases to
> be moved somewhere without changing the client code), and enable SSL
> certificate checking, the problem becomes evident: in order to add a new
> database to the existing cluster you have to add its name to the SSL
> certificate for the server, and in order for this changes to come into
> effect you have to restart the server.

That's certainly an issue. Potentially bigger ones are that you cannot
replace an expired certificate or CRL without a restart.


> In the documentation for server cert and key file there is a notice that
> this parameter can only be reloaded during the server start. It seems that
> the only place the backend certificates are loaded is inside the
> secure_initialize, which, in order, calls initialize_SSL().
>
> From my point of view, I see nothing preventing separation of the
> certificate reload code and SSL library initialization and calling the
> former during the server reload.  It might happen that with the new
> certificate file that some of the existing connections will be unable to
> reconnect, or, if the certificate is invalid, the server will be unable to
> restart, but this are the sort of problems that also happen with reload of
> pg_hba.conf as well, so these alone does not sound like a significant
> showstopper.

I agree that separating this out would probably be a useful idea. We
should probably treat a failed load of cerrtificates the same way we
do with pg_hba if we can - which is log an error and revert back to
the currently loaded one.

Some of this is going to have to be at least partially reworked anyway
in the work that Heikki has been diong to support non-openssl
libraries. Making a change like this at the same time is probably a
good idea.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >