date:20140911



Hello Robert,


I am not objecting to the functionality; I'm objecting to bolting on
ad-hoc operators one at a time.  I think an expression syntax would
let us do this in a much more scalable way.  If I had time, I'd go do
that, but I don't.  We could add abs(x) and hash(x) and it would all
be grand.


Ok. I do agree that an expression syntax would be great!

However, that would not diminish nor change much the amount and kind of 
code necessary to add an operator or a function: the parser would have to 
be updated, and the expression structure, and the executor. Currently the 
pgbench parser and expression are very limited, but they are also very 
generic so that nothing is needed to add a new operator there, only the 
execution code needs to be updated, and the update would be basically the 
same (if this is this function or operator, actually do it...) if the 
execution part of a real expression syntax would have to be updated.


So although I agree that a real expression syntax would be great nice to 
have, I do not think that it is valid objection to block this patch.


Moreover, upgrading pgbench to handle an actual expression syntax is not 
so trivial, because for instance some parts of the text needs to be parsed 
(set syntax) while others would need not to be pased (SQL commands), so 
some juggling would be needed in the lexer, or maybe only call the parser 
on some lines and not others... Everything is possible, but this one would 
require some careful thinking.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

2014-09-11 Thread Erik Rijkers

On Wed, September 10, 2014 23:50, Stephen Frost wrote:
  [rls_9-10-2014.patch]

I can't get this to apply; I attach the complaints of patch.


Erik Rijkers





-- git clone git://git.postgresql.org/git/postgresql.git master
Cloning into 'master'...

-- git branch
* master

patching file doc/src/sgml/catalogs.sgml
patching file doc/src/sgml/config.sgml
Hunk #1 succeeded at 5411 (offset 114 lines).
patching file doc/src/sgml/keywords.sgml
patching file doc/src/sgml/ref/allfiles.sgml
patching file doc/src/sgml/ref/alter_policy.sgml
patching file doc/src/sgml/ref/alter_role.sgml
patching file doc/src/sgml/ref/create_policy.sgml
patching file doc/src/sgml/ref/create_role.sgml
patching file doc/src/sgml/ref/drop_policy.sgml
patching file doc/src/sgml/reference.sgml
patching file src/backend/catalog/Makefile
patching file src/backend/catalog/aclchk.c
patching file src/backend/catalog/dependency.c
patching file src/backend/catalog/heap.c
patching file src/backend/catalog/objectaddress.c
patching file src/backend/catalog/system_views.sql
patching file src/backend/commands/Makefile
patching file src/backend/commands/alter.c
patching file src/backend/commands/copy.c
patching file src/backend/commands/event_trigger.c
patching file src/backend/commands/policy.c
patching file src/backend/commands/tablecmds.c
patching file src/backend/commands/user.c
patching file src/backend/nodes/copyfuncs.c
patching file src/backend/nodes/equalfuncs.c
patching file src/backend/nodes/outfuncs.c
patching file src/backend/nodes/readfuncs.c
patching file src/backend/optimizer/plan/planner.c
patching file src/backend/optimizer/plan/setrefs.c
patching file src/backend/optimizer/prep/prepsecurity.c
patching file src/backend/parser/gram.y
patching file src/backend/parser/parse_agg.c
patching file src/backend/parser/parse_expr.c
patching file src/backend/rewrite/Makefile
patching file src/backend/rewrite/rewriteHandler.c
patching file src/backend/rewrite/rowsecurity.c
patching file src/backend/tcop/utility.c
Hunk #2 succeeded at 833 (offset -4 lines).
patching file src/backend/utils/adt/acl.c
patching file src/backend/utils/adt/ri_triggers.c
patching file src/backend/utils/cache/plancache.c
patching file src/backend/utils/cache/relcache.c
patching file src/backend/utils/misc/guc.c
patching file src/backend/utils/misc/postgresql.conf.sample
patching file src/bin/pg_dump/common.c
patching file src/bin/pg_dump/pg_backup.h
patching file src/bin/pg_dump/pg_backup_archiver.c
patching file src/bin/pg_dump/pg_dump.c
patching file src/bin/pg_dump/pg_dump.h
patching file src/bin/pg_dump/pg_dump_sort.c
patching file src/bin/pg_dump/pg_dumpall.c
patching file src/bin/pg_dump/pg_restore.c
patching file src/bin/psql/describe.c
patching file src/include/catalog/catversion.h
Hunk #1 FAILED at 53.
1 out of 1 hunk FAILED -- saving rejects to file 
src/include/catalog/catversion.h.rej
patching file src/include/catalog/dependency.h
patching file src/include/catalog/indexing.h
patching file src/include/catalog/pg_authid.h
patching file src/include/catalog/pg_class.h
patching file src/include/catalog/pg_rowsecurity.h
patching file src/include/commands/policy.h
patching file src/include/miscadmin.h
patching file src/include/nodes/nodes.h
patching file src/include/nodes/parsenodes.h
patching file src/include/nodes/plannodes.h
patching file src/include/nodes/relation.h
patching file src/include/optimizer/planmain.h
patching file src/include/parser/kwlist.h
patching file src/include/parser/parse_node.h
patching file src/include/rewrite/rowsecurity.h
patching file src/include/utils/acl.h
patching file src/include/utils/plancache.h
patching file src/include/utils/rel.h
patching file src/test/regress/expected/dependency.out
patching file src/test/regress/expected/privileges.out
patching file src/test/regress/expected/rowsecurity.out
patching file src/test/regress/expected/rules.out
patching file src/test/regress/expected/sanity_check.out
patching file src/test/regress/parallel_schedule
patching file src/test/regress/serial_schedule
patching file src/test/regress/sql/rowsecurity.sql
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about half processes are blocked by btree, btree is bottleneck?


On 09/11/2014 06:08 AM, Xiaoyulei wrote:

I use benchmarksql with more than 200 clients on pg 9.3.3. when the test is 
going on, I collect all the process stack. I found about 100 processes are 
blocked by btree insert. Another 100 are blocked by xloginsert.

Does btree has bad performance in concurrency scenarios?


Well, there's always a bottleneck. I'd suggest trying 9.4, there were 
changes to make WAL insertion more scalable, as well as small tweaks to 
the spinlock implementation.


I'm not too familiar with BenchmarkSQL, but if 200 clients means 200 
concurrent active connections to the database, that's a lot. You'll 
likely see better performance if you dial it down closer to the number 
of CPU cores in the server.


Also make sure you use a large-enough scale factor. Otherwise all 
transactions try to access a small set of rows, which naturally becomes 
a bottleneck.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers

On Thu, Sep 11, 2014 at 9:10 AM, Michael Paquier michael.paqu...@gmail.com
wrote:
 On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  On 08/28/2014 10:10 AM, Michael Paquier wrote:
 
  + #synchronous_standby_num = -1 # number of standbys servers using sync
  rep
 
 
  To be honest, that's a horrible name for the GUC. Back when synchronous
  replication was implemented, we had looong discussions on this feature.
It
  was called quorum commit back then. I'd suggest using the quorum
term in
  this patch, too, that's a fairly well-known term in distributed
computing
  for this.
 I am open to any suggestions. Then what about the following parameter
names?
 - synchronous_quorum_num
 - synchronous_standby_quorum
 - synchronous_standby_quorum_num
 - synchronous_quorum_commit

or simply synchronous_standbys

  When synchronous replication was added, quorum was left out to keep
things
  simple; the current feature set was the most we could all agree on to be
  useful. If you search the archives for quorum commit you'll see what I
  mean. There was a lot of confusion on what is possible and what is
useful,
  but regarding this particular patch: people wanted to be able to
describe
  more complicated scenarios. For example, imagine that you have a master
and
  two standbys in one the primary data center, and two more standbys in a
  different data center. It should be possible to specify that you must
get
  acknowledgment from at least on standby in both data centers. Maybe you
  could hack that by giving the standbys in the same data center the same
  name, but it gets ugly, and it still won't scale to even more complex
  scenarios.

Won't this problem be handled if synchronous mode is supported in
cascading replication?
I am not sure about the feasibility of same, but I think the basic problem
mentioned above can be addressed and may be others as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pg_xlogdump --stats

2014-09-11 Thread Abhijit Menon-Sen

At 2014-08-21 10:06:39 +0300, hlinnakan...@vmware.com wrote:

 Committed the patch to add INT64_MODIFIER, with minor fixes.

Thank you.

 The new rm_identify method needs to be documented. […]
 Perhaps add comments to the RmgrData struct, explaining
 all of the methods.

OK, I'll submit a patch to add these comments.

 I think the names that rm_identify returns should match those that the
 rm_desc functions print.

I had originally started off trying to keep the output in sync, but it
doesn't work very well. There are rm_desc functions that print things
like truncate before and Create posting tree, and many decisions
are quite arbitrary (freeze_page, cleanup info, multi-insert).

I think it's better the (grep-friendly) way it is. If anything, perhaps
rm_desc should output ${rm_identify}[: optional explanation]. That
would also make it very clear what should go in rm_identify and what
should go in rm_desc.

Thoughts?

 The corresponding rm_identify output is:
 
 HOT_UPDATE+INIT

The +INIT is admittedly a special case, and I would have no objection to
writing that as (INIT) or something instead.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogdump --stats


On 09/11/2014 11:43 AM, Abhijit Menon-Sen wrote:

At 2014-08-21 10:06:39 +0300, hlinnakan...@vmware.com wrote:

I think the names that rm_identify returns should match those that the
rm_desc functions print.


I had originally started off trying to keep the output in sync, but it
doesn't work very well. There are rm_desc functions that print things
like truncate before and Create posting tree, and many decisions
are quite arbitrary (freeze_page, cleanup info, multi-insert).


It would be good to clean up those messages to be more sensible and 
consistent.



I think it's better the (grep-friendly) way it is. If anything, perhaps
rm_desc should output ${rm_identify}[: optional explanation]. That
would also make it very clear what should go in rm_identify and what
should go in rm_desc.


Yeah, that makes sense too.


The corresponding rm_identify output is:

HOT_UPDATE+INIT


The +INIT is admittedly a special case, and I would have no objection to
writing that as (INIT) or something instead.


I don't care much, as long as it's consistent the rm_desc output.

It's quite arbitrary that the INIT cases are considered different record 
types, with their own identity string, instead of being just a flag in 
the same record type. It's an implementation detail that that flag is 
stored in the xl_info, and that implementation detail is now exposed in 
the stats pg_xlogdump --stats output. There are similar cases in B-tree 
splits, for example, where a split where the new tuple goes to the left 
half is considered a different record type than one where it goes ot the 
right half. It might be interesting to count them separately in the 
stats, but then again it might just be confusing. The xl_info flags and 
the rm_desc output were never intended to be a useful categorization for 
statistics purposes, so it's not surprising that it's not too well 
suited for that. Nevertheless, the pg_xlogdump --stats is useful as it 
is, if you know the limitations.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables

2014-09-11 Thread Etsuro Fujita


(2014/09/11 4:32), Heikki Linnakangas wrote:

I had a cursory look at this patch and the discussions around this.


Thank you!


ISTM there are actually two new features in this: 1) allow CHECK
constraints on foreign tables, and 2) support inheritance for foreign
tables. How about splitting it into two?


That's right.  There are the two in this patch.

I'm not sure if I should split the patch into the two, because 1) won't 
make sense without 2).  As described in the following note added to the 
docs on CREATE FOREIGN TABLE, CHECK constraints on foreign tables are 
intended to support constraint exclusion for partitioned foreign tables:


+ Constraints on foreign tables are not enforced on insert or update.
+ Those definitions simply declare the constraints hold for all rows
+ in the foreign tables.  It is the user's responsibility to ensure
+ that those definitions match the remote side.  Such constraints are
+ used for some kind of query optimization such as constraint exclusion
+ for partitioned tables

Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogdump --stats

On 2014-09-11 12:14:42 +0300, Heikki Linnakangas wrote:
 On 09/11/2014 11:43 AM, Abhijit Menon-Sen wrote:
 At 2014-08-21 10:06:39 +0300, hlinnakan...@vmware.com wrote:
 I think the names that rm_identify returns should match those that the
 rm_desc functions print.
 
 I had originally started off trying to keep the output in sync, but it
 doesn't work very well. There are rm_desc functions that print things
 like truncate before and Create posting tree, and many decisions
 are quite arbitrary (freeze_page, cleanup info, multi-insert).
 
 It would be good to clean up those messages to be more sensible and
 consistent.

 I think it's better the (grep-friendly) way it is. If anything, perhaps
 rm_desc should output ${rm_identify}[: optional explanation]. That
 would also make it very clear what should go in rm_identify and what
 should go in rm_desc.

 Yeah, that makes sense too.

I'm not sure I agree here. From a theoretical POV sure, but wouldn't
that lead to even longer lines for xlogdump and other error messages for
some records?
We probably need to see how it plays out.

 The corresponding rm_identify output is:
 
 HOT_UPDATE+INIT
 
 The +INIT is admittedly a special case, and I would have no objection to
 writing that as (INIT) or something instead.
 
 I don't care much, as long as it's consistent the rm_desc output.
 
 It's quite arbitrary that the INIT cases are considered different record
 types, with their own identity string, instead of being just a flag in the
 same record type. It's an implementation detail that that flag is stored in
 the xl_info, and that implementation detail is now exposed in the stats
 pg_xlogdump --stats output.

It's also actually quite good from a display purpose. +INIT will have
quite different wal logging characteristics :)

 There are similar cases in B-tree splits, for
 example, where a split where the new tuple goes to the left half is
 considered a different record type than one where it goes ot the right half.
 It might be interesting to count them separately in the stats, but then
 again it might just be confusing. The xl_info flags and the rm_desc output
 were never intended to be a useful categorization for statistics purposes,
 so it's not surprising that it's not too well suited for that. Nevertheless,
 the pg_xlogdump --stats is useful as it is, if you know the limitations.

I agree it's not perfect, but I think it's a huge step forward. We very
well might want to improve upon the stats granularity once in, but I
think it already gives a fair amount of direction where to look.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogdump --stats


On 09/11/2014 12:22 PM, Andres Freund wrote:

On 2014-09-11 12:14:42 +0300, Heikki Linnakangas wrote:

On 09/11/2014 11:43 AM, Abhijit Menon-Sen wrote:

At 2014-08-21 10:06:39 +0300, hlinnakan...@vmware.com wrote:

I think the names that rm_identify returns should match those that the
rm_desc functions print.


I had originally started off trying to keep the output in sync, but it
doesn't work very well. There are rm_desc functions that print things
like truncate before and Create posting tree, and many decisions
are quite arbitrary (freeze_page, cleanup info, multi-insert).


It would be good to clean up those messages to be more sensible and
consistent.



I think it's better the (grep-friendly) way it is. If anything, perhaps
rm_desc should output ${rm_identify}[: optional explanation]. That
would also make it very clear what should go in rm_identify and what
should go in rm_desc.



Yeah, that makes sense too.


I'm not sure I agree here. From a theoretical POV sure, but wouldn't
that lead to even longer lines for xlogdump and other error messages for
some records?


No. All the rm_desc functions already follow that convention, and print 
foo: details, where foo identifies the record type based on xl_info. 
This proposal would just make that convention more official, and 
stipulate that the foo part should match what the new rm_identify() 
function returns. (Or perhaps even better, define it so that rm_desc 
prints only the details part, and rm_identify() the foo part, and 
have the callers put the two strings together if they want)



There are similar cases in B-tree splits, for
example, where a split where the new tuple goes to the left half is
considered a different record type than one where it goes ot the right half.
It might be interesting to count them separately in the stats, but then
again it might just be confusing. The xl_info flags and the rm_desc output
were never intended to be a useful categorization for statistics purposes,
so it's not surprising that it's not too well suited for that. Nevertheless,
the pg_xlogdump --stats is useful as it is, if you know the limitations.


I agree it's not perfect, but I think it's a huge step forward. We very
well might want to improve upon the stats granularity once in, but I
think it already gives a fair amount of direction where to look.


Agreed. That's what I was also trying to say.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fix MSVC isnan warning from e80252d4

2014-09-11 Thread David Rowley

The attached small patch fixes the following warning:

src\backend\utils\adt\arrayfuncs.c(5613): warning C4013: '_isnan'
undefined; assuming extern returning int [D:\Postgres\a\postgres.vcxproj]

The fix is pretty much just a copy and paste from costsize.c

Regards

David Rowley


isnan_msvc_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit


On 09/10/2014 05:47 PM, Mitsumasa KONDO wrote:

Hi,

I find typo in your patch. Please confirm.

@line 239
- agg-sum2_lag = 0;
+  agg-sum_lag = 0;


Ah thanks, cood catch!


And back patch is welcome for me.


I've committed and backpatched this, as well as a patch to refactor the 
way the Poisson delay is computed.


I kept the log file format unchanged when --rate is not used, so it now 
has a different number of fields depending on whether --rate is used or not.


Please review the changes I made one more time, to double-check that I 
mess up something.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit


On 08/30/2014 07:16 PM, Fabien COELHO wrote:



+   if (latency_limit)
+   printf(number of transactions above the %.1f ms latency limit:  
INT64_FORMAT \n,
+  latency_limit / 1000.0, latency_late);
+


Any reason not to report a percentage here?


Yes: I did not thought of it.

Here is a v7, with a percent. I also added a paragraph in the documenation
about how the latency is computed under throttling, and I tried to reorder
the reported stuff so that it is more logical.


Now that I've finished the detour and committed and backpatched the 
changes to the way latency is calculated, we can get back to this patch. 
It needs to be rebased.


How should skipped transactions should be taken into account in the log 
file output, with and without aggregation? I assume we'll want to have 
some trace of skipped transactions in the logs.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables


On 09/11/2014 12:22 PM, Etsuro Fujita wrote:

(2014/09/11 4:32), Heikki Linnakangas wrote:

I had a cursory look at this patch and the discussions around this.


Thank you!


ISTM there are actually two new features in this: 1) allow CHECK
constraints on foreign tables, and 2) support inheritance for foreign
tables. How about splitting it into two?


That's right.  There are the two in this patch.

I'm not sure if I should split the patch into the two, because 1) won't
make sense without 2).  As described in the following note added to the
docs on CREATE FOREIGN TABLE, CHECK constraints on foreign tables are
intended to support constraint exclusion for partitioned foreign tables:

+ Constraints on foreign tables are not enforced on insert or update.
+ Those definitions simply declare the constraints hold for all rows
+ in the foreign tables.  It is the user's responsibility to ensure
+ that those definitions match the remote side.  Such constraints are
+ used for some kind of query optimization such as constraint exclusion
+ for partitioned tables


The planner can do constraint exclusion based on CHECK constraints even 
without inheritance. It's not enabled by default because it can increase 
planning time, but if you set constraint_exclusion=on, it will work.


For example:

postgres=# create table foo (i int4 CHECK (i  0));
CREATE TABLE
postgres=# explain select * from foo WHERE i  0;
  QUERY PLAN
--
 Seq Scan on foo  (cost=0.00..40.00 rows=800 width=4)
   Filter: (i  0)
 Planning time: 0.335 ms
(3 rows)

postgres=# show constraint_exclusion ;
 constraint_exclusion
--
 partition
(1 row)

postgres=# set constraint_exclusion ='on';
SET
postgres=# explain select * from foo WHERE i  0;
QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=0)
   One-Time Filter: false
 Planning time: 0.254 ms
(3 rows)

postgres=#

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix MSVC isnan warning from e80252d4


On 09/11/2014 12:52 PM, David Rowley wrote:

The attached small patch fixes the following warning:

src\backend\utils\adt\arrayfuncs.c(5613): warning C4013: '_isnan'
undefined; assuming extern returning int [D:\Postgres\a\postgres.vcxproj]

The fix is pretty much just a copy and paste from costsize.c


Thanks, committed.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

Hi,

On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
  include $(top_srcdir)/src/backend/common.mk
 diff --git a/src/backend/postmaster/bgreclaimer.c 
 b/src/backend/postmaster/bgreclaimer.c
 new file mode 100644
 index 000..3df2337
 --- /dev/null
 +++ b/src/backend/postmaster/bgreclaimer.c

A fair number of comments in that file refer to bgwriter...

 @@ -0,0 +1,302 @@
 +/*-
 + *
 + * bgreclaimer.c
 + *
 + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
 + * attempts to keep regular backends from having to run clock sweep (which
 + * they would only do when they don't find a usable shared buffer from
 + * freelist to read in another page).

That's not really accurate. Freelist pages are often also needed to
write new pages, without reading anything in. I'd phrase it as which
they only need to do if they don't find a victim buffer from the
freelist

  In the best scenario all requests
 + * for shared buffers will be fulfilled from freelist as the background
 + * reclaimer process always tries to maintain buffers on freelist.  However,
 + * regular backends are still empowered to run clock sweep to find a usable
 + * buffer if the bgreclaimer fails to maintain enough buffers on freelist.

empowered sounds strange to me. 'still can run the clock sweep'?

 + * The bgwriter is started by the postmaster as soon as the startup 
 subprocess
 + * finishes, or as soon as recovery begins if we are doing archive recovery.

Why only archive recovery? I guess (only read this far...) it's not just
during InArchiveRecoveryb recovery but also StandbyMode? But I don't see
why we shouldn't use it during normal crash recovery. That's also often
painfully slow and the reclaimer could help? Less, but still.


 +static void bgreclaim_quickdie(SIGNAL_ARGS);
 +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
 +static void ReqShutdownHandler(SIGNAL_ARGS);
 +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

This looks inconsistent.

 + /*
 +  * If an exception is encountered, processing resumes here.
 +  *
 +  * See notes in postgres.c about the design of this coding.
 +  */
 + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 + {
 + /* Since not using PG_TRY, must reset error stack by hand */
 + error_context_stack = NULL;
 +
 + /* Prevent interrupts while cleaning up */
 + HOLD_INTERRUPTS();
 +
 + /* Report the error to the server log */
 + EmitErrorReport();
 +
 + /*
 +  * These operations are really just a minimal subset of
 +  * AbortTransaction().  We don't have very many resources to 
 worry
 +  * about in bgreclaim, but we do have buffers and file 
 descriptors.
 +  */
 + UnlockBuffers();
 + AtEOXact_Buffers(false);
 + AtEOXact_Files();

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

 + /*
 +  * Now return to normal top-level context and clear 
 ErrorContext for
 +  * next time.
 +  */
 + MemoryContextSwitchTo(bgreclaim_context);
 + FlushErrorState();
 +
 + /* Flush any leaked data in the top-level context */
 + MemoryContextResetAndDeleteChildren(bgreclaim_context);
 +
 + /* Now we can allow interrupts again */
 + RESUME_INTERRUPTS();

Other processes sleep for a second here, I think that's a good
idea. E.g. that bit:
/*
 * Sleep at least 1 second after any error.  A write error is 
likely
 * to be repeated, and we don't want to be filling the error 
logs as
 * fast as we can.
 */
pg_usleep(100L);


 + /*
 +  * Loop forever
 +  */
 + for (;;)
 + {
 + int rc;
 +
 +
 + /*
 +  * Backend will signal bgreclaimer when the number of buffers in
 +  * freelist falls below than low water mark of freelist.
 +  */
 + rc = WaitLatch(MyProc-procLatch,
 +WL_LATCH_SET | WL_POSTMASTER_DEATH,
 +-1);

That's probably not going to work well directly after a (re)start of
bgreclaim (depending on how you handle the water mark, I'll see in a
bit). Maybe it should rather be

ResetLatch();
BgMoveBuffersToFreelist();
pgstat_send_bgwriter();
rc = WaitLatch()
if (rc  WL_POSTMASTER_DEATH)
   exit(1)

 +Background Reclaimer's Processing
 +-
 +
 +The background reclaimer is designed to move buffers to freelist that are
 +likely to be recycled soon, thereby offloading the need to perform
 +clock sweep work from active

Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-09-11 Thread David Rowley

On Thu, Aug 28, 2014 at 6:23 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 If the majority of the added code is code that will be needed for
 less-bogus optimizations, it might be all right; but I'd kind of want to
 see the less-bogus optimizations working first.


That seems fair. Likely there'd be not a great deal of value to semi and
anti joins removal alone. I was more just trying to spread weight of an
inner join removal patch...

So In response to this, I've gone off and written an inner join removal
support patch (attached), which turned out to be a bit less complex than I
thought.


Here's a quick demo, of the patch at work:

test=# create table c (id int primary key);
CREATE TABLE
test=# create table b (id int primary key, c_id int not null references
c(id));
CREATE TABLE
test=# create table a (id int primary key, b_id int not null references
b(id));
CREATE TABLE
test=#
test=# explain select a.* from a inner join b on a.b_id = b.id inner join c
on b.c_id = c.id;
 QUERY PLAN
-
 Seq Scan on a  (cost=0.00..31.40 rows=2140 width=8)
 Planning time: 1.061 ms
(2 rows)

Perhaps not a greatly useful example, but if you can imagine the joins are
hidden in a view and the user is just requesting a small subset of columns,
then it does seem quite powerful.

There's currently a few things with the patch that I'll list below, which
may raise a few questions:

1. I don't think that I'm currently handling removing eclass members
properly. So far the code just removes the Vars that belong to the relation
being removed. I likely should also be doing bms_del_member(ec-ec_relids,
relid); on the eclass, but perhaps I should just be marking the whole class
as ec_broken = true and adding another eclass all everything that the
broken one has minus the parts from the removed relation?

2. Currently the inner join removal is dis-allowed if the (would be)
removal relation has *any* baserestrictinfo items. The reason for this is
that we must ensure that the inner join gives us exactly 1 row match on the
join condition, but a baserestrictinfo can void the proof that a foreign
key would give us that a matching row does exist. However there is an
exception to this that could allow that restriction to be relaxed. That is
if the qual in baserestrictinfo use vars that are in an eclass, where the
same eclass also has ec members vars that belong to the rel that we're
using the foreign key for to prove the relation not needed umm.. that's
probably better described by example:

Assume there's a foreign key a (x) reference b(x)

SELECT a.* FROM a INNER JOIN b ON a.x = b.x WHERE b.x = 1

relation b should be removable because an eclass will contain {a.x, b.x}
and therefore s baserestrictinfo for a.x = 1 should also exist on relation
a. Therefore removing relation b should produce equivalent results, i.e
everything that gets filtered out on relation b will also be filtered out
on relation a anyway.

I think the patch without this is still worth it, but if someone feels
strongly about it I'll take a bash at supporting it.

3. Currently the inner join support does not allow removals using foreign
keys which contain duplicate columns on the referencing side. e.g (a,a)
references (x,y), this is basically because of the point I made in item 2.
In this case a baserestrictinfo would exist on the referenced relation to
say WHERE x = y. I'd have to remove the restriction described in item 2 and
do a small change to the code that extracts the join condition from the
eclass for this to work. But it's likely a corner case that's not worth too
much trouble to support. I think probably if I saw an FK like that in the
field, I'd probably scratch my head for a while, while trying to
understanding why they bothered.

4. The patch currently only allows removals for eclass join types. If the
rel has any joininfo items, then the join removal is disallowed. From what
I can see equality type inner join conditions get described in eclasses,
and only non-equality join conditions make it into the joininfo list, and
since foreign keys only support equality operators, then I thought this was
a valid restriction, however, if someone can show me a flaw in my
assumption then I may need to improve this.

5. I've added a flag to pg_class called relhasfkey. Currently this gets set
to true when a foreign key is added, though I've added nothing to set it
back to false again. I notice that relhasindex gets set back to false
during vacuum, if vacuum happens to find there to not be any indexes on the
rel. I didn't put my logic here as I wasn't too sure if scanning
pg_constraint during a vacuum seemed very correct, so I just left out the
setting it to false logic based on the the fact that I noticed that
relhaspkey gets away with quite lazy setting back to false logic (only when
there's no indexes of any kind left on the relation at all).

The only think else I can think of is perhaps optimising a little.

Re: [HACKERS] Support for N synchronous standby servers

On Thu, Aug 28, 2014 at 12:40 PM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi
 rajeev.rast...@huawei.com wrote:
  I have done some more review, below are my comments:
 Thanks!

  1. There are currently two loops on *num_sync, Can we simplify the
function SyncRepGetSynchronousNodes by moving the priority calculation
inside the upper loop
  if (*num_sync == allowed_sync_nodes)
  {
  for (j = 0; j  *num_sync; j++)
  {
  Anyway we require priority only if *num_sync ==
allowed_sync_nodes condition matches.
  So in this loop itself, we can calculate the priority as well
as assigment of new standbys with lower priority.
  Let me know if you see any issue with this.

 OK, I see, yes this can minimize process a bit so I refactored the
 code by integrating the second loop to the first. This has needed the
 removal of the break portion as we need to find the highest priority
 value among the nodes already determined as synchronous.

  2.  Comment inside the function SyncRepReleaseWaiters,
  /*
   * We should have found ourselves at least, except if it is not
expected
   * to find any synchronous nodes.
   */
  Assert(num_sync  0);
 
  I think except if it is not expected to find any synchronous
nodes is not required.
  As if it has come till this point means at least this node is
synchronous.
 Yes, removed.

  3.  Document says that s_s_num should be lesser than
max_wal_senders but code wise there is no protection for the same.
  IMHO, s_s_num should be lesser than or equal to max_wal_senders
otherwise COMMIT will never return back the console without
  any knowledge of user.
  I see that some discussion has happened regarding this but I
think just adding documentation for this is not enough.
  I am not sure what issue is observed in adding check during GUC
initialization but if there is unavoidable issue during GUC initialization
then can't we try to add check at later points.

 The trick here is that you cannot really return a warning at GUC
 loading level to the user as a warning could be easily triggered if
 for example s_s_num is present before max_wal_senders in
 postgresql.conf. I am open to any solutions if there are any (like an
 error when initializing WAL senders?!). Documentation seems enough for
 me to warn the user.

How about making it as a PGC_POSTMASTER parameter and then
have a check similar to below in PostmasterMain()

/*
* Check for invalid combinations of GUC settings.
*/
if (ReservedBackends = MaxConnections)
{
write_stderr(%s: superuser_reserved_connections must be less than
max_connections\n, progname);
ExitPostmaster(1);
}

if (max_wal_senders = MaxConnections)
{
write_stderr(%s: max_wal_senders must be less than max_connections\n,
progname);
ExitPostmaster(1);
}

if (XLogArchiveMode  wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg(WAL archival (archive_mode=on) requires wal_level \archive\,
\hot_standby\, or \logical\)));

if (max_wal_senders  0  wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg(WAL streaming (max_wal_senders  0) requires wal_level
\archive\, \hot_standby\, or \logical\)));



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] RLS Design

Erik,

* Erik Rijkers (e...@xs4all.nl) wrote:
 On Wed, September 10, 2014 23:50, Stephen Frost wrote:
   [rls_9-10-2014.patch]
 
 I can't get this to apply; I attach the complaints of patch.

Thanks for taking a look at this!

[...]
 patching file src/include/catalog/catversion.h
 Hunk #1 FAILED at 53.
 1 out of 1 hunk FAILED -- saving rejects to file 
 src/include/catalog/catversion.h.rej

That's just the catversion bump- you can simply ignore it and everything
should be fine.  Look forward to hearing how it works for you!

Thanks again,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] inherit support for foreign tables

2014-09-11 Thread Etsuro Fujita


(2014/09/11 19:46), Heikki Linnakangas wrote:

On 09/11/2014 12:22 PM, Etsuro Fujita wrote:

(2014/09/11 4:32), Heikki Linnakangas wrote:

I had a cursory look at this patch and the discussions around this.


Thank you!


ISTM there are actually two new features in this: 1) allow CHECK
constraints on foreign tables, and 2) support inheritance for foreign
tables. How about splitting it into two?


That's right.  There are the two in this patch.

I'm not sure if I should split the patch into the two, because 1) won't
make sense without 2).  As described in the following note added to the
docs on CREATE FOREIGN TABLE, CHECK constraints on foreign tables are
intended to support constraint exclusion for partitioned foreign tables:

+ Constraints on foreign tables are not enforced on insert or update.
+ Those definitions simply declare the constraints hold for all rows
+ in the foreign tables.  It is the user's responsibility to ensure
+ that those definitions match the remote side.  Such constraints are
+ used for some kind of query optimization such as constraint
exclusion
+ for partitioned tables


The planner can do constraint exclusion based on CHECK constraints even
without inheritance. It's not enabled by default because it can increase
planning time, but if you set constraint_exclusion=on, it will work.


Exactly!


For example:

postgres=# create table foo (i int4 CHECK (i  0));
CREATE TABLE
postgres=# explain select * from foo WHERE i  0;
   QUERY PLAN
--
  Seq Scan on foo  (cost=0.00..40.00 rows=800 width=4)
Filter: (i  0)
  Planning time: 0.335 ms
(3 rows)

postgres=# show constraint_exclusion ;
  constraint_exclusion
--
  partition
(1 row)

postgres=# set constraint_exclusion ='on';
SET
postgres=# explain select * from foo WHERE i  0;
 QUERY PLAN
--
  Result  (cost=0.00..0.01 rows=1 width=0)
One-Time Filter: false
  Planning time: 0.254 ms
(3 rows)

postgres=#


Actually, this patch allows the exact same thing to apply to foreign 
tables.  My explanation was insufficient about that.  Sorry for that.


So, should I split the patch into the two?

Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On Fri, Jul 25, 2014 at 11:41 PM, Robert Haas robertmh...@gmail.com wrote:

 Patch 4 adds infrastructure that allows one session to save all of its
 non-default GUC values and another session to reload those values.
 This was written by Amit Khandekar and Noah Misch.  It allows
 pg_background to start up the background worker with the same GUC
 settings that the launching process is using.  I intend this as a
 demonstration of how to synchronize any given piece of state between
 cooperating backends.  For real parallelism, we'll need to synchronize
 snapshots, combo CIDs, transaction state, and so on, in addition to
 GUCs.  But GUCs are ONE of the things that we'll need to synchronize
 in that context, and this patch shows the kind of API we're thinking
 about for these sorts of problems.

Don't we need some way to prohibit changing GUC by launching process,
once it has shared the existing GUC?


 Patch 6 is pg_background itself.  I'm quite pleased with how easily
 this came together.  The existing background worker, dsm, shm_toc, and
 shm_mq infrastructure handles most of the heavily lifting here -
 obviously with some exceptions addressed by the preceding patches.
 Again, this is the kind of set-up that I'm expecting will happen in a
 background worker used for actual parallelism - clearly, more state
 will need to be restored there than here, but nonetheless the general
 flow of the code here is about what I'm imagining, just with somewhat
 more different kinds of state.  Most of the work of writing this patch
 was actually figuring out how to execute the query itself; what I
 ended up with is mostly copied form exec_simple_query, but with some
 difference here and there.  I'm not sure if it would be
 possible/advisable to try to refactor to reduce duplication.

1. This patch generates warning on windows
1pg_background.obj : error LNK2001: unresolved external symbol
StatementTimeout

You need to add PGDLLIMPORT for StatementTimeout

2.
CREATE FUNCTION pg_background_launch(sql pg_catalog.text,
   queue_size pg_catalog.int4 DEFAULT 65536)

Here shouldn't queue_size be pg_catalog.int8 as I could see
some related functions in test_shm_mq uses int8?

CREATE FUNCTION test_shm_mq(queue_size pg_catalog.int8,
CREATE FUNCTION test_shm_mq_pipelined(queue_size pg_catalog.int8,

Anyway I think corresponding C function doesn't use matching function
to extract the function args.

pg_background_launch(PG_FUNCTION_ARGS)
{
text   *sql = PG_GETARG_TEXT_PP(0);
int32 queue_size = PG_GETARG_INT64(1);

Here it should _INT32 variant to match with current sql definition,
otherwise it leads to below error.

postgres=# select pg_background_launch('vacuum verbose foo');
ERROR:  queue size must be at least 64 bytes

3.
Comparing execute_sql_string() and exec_simple_query(), I could see below
main differences:
a. Allocate a new memory context different from message context
b. Start/commit control of transaction is outside execute_sql_string
c. enable/disable statement timeout is done from outside incase of
execute_sql_string()
d. execute_sql_string() prohibits Start/Commit/Abort transaction statements.
e. execute_sql_string() doesn't log statements
f. execute_sql_string() uses binary format for result whereas
exec_simple_query()
   uses TEXT as defult format
g. processed stat related info from caller incase of execute_sql_string().

Can't we handle all these or other changes inside exec_simple_query()
based on some parameter or may be a use a hook similar to what we
do in ProcessUtility?

Basically it looks bit odd to have a duplicate (mostly) copy of
exec_simple_query().

4.
Won't it be better if pg_background_worker_main() can look more
like PostgresMain() (in terms of handling different kind of messages),
so that it can be extended in future to handle parallel worker.


5.
pg_background_result()
{
..
/* Read and processes messages from the shared memory queue. */
}

Shouldn't the processing of messages be a separate function as
we do for pqParseInput3().


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] inherit support for foreign tables


On 09/11/2014 02:30 PM, Etsuro Fujita wrote:

Actually, this patch allows the exact same thing to apply to foreign
tables.  My explanation was insufficient about that.  Sorry for that.


Great, that's what I thought.


So, should I split the patch into the two?


Yeah, please do.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

On Wed, Sep 10, 2014 at 4:54 AM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Finally I think that we need case-insensitive version of
 get_role_id and() get_database_id() to acoomplish this patch'es
 objective. (This runs full-scans on pg_database or pg_authid X()

Any such thing is certainly grounds for rejecting the patch outright.
It may be that pg_hba.conf should follow the same case-folding rules
we use elsewhere, but it should not invent novel semantics, especially
ones that make connecting to the database a far more expensive
operation than it is today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors

* Pavel Stehule (pavel.steh...@gmail.com) wrote:
 Can I help with something, it is there some open question?

I had been hoping for a more definitive answer regarding this option for
array_to_json, or even a comment about the change to row_to_json.
Andrew- any thoughts on this?  (that's what the ping on IRC is for :).

Thanks,

Stephen

 2014-09-08 6:27 GMT+02:00 Stephen Frost sfr...@snowman.net:
  * Pavel Stehule (pavel.steh...@gmail.com) wrote:
   ignore_nulls in array_to_json_pretty probably is not necessary. On second
   hand, the cost is zero, and we can have it for API consistency.
 
  I'm willing to be persuaded either way on this, really.  I do think it
  would be nice for both array_to_json and row_to_json to be single
  functions which take default values, but as for if array_to_json has a
  ignore_nulls option, I'm on the fence and would defer to people who use
  that function regularly (I don't today).
 
  Beyond that, I'm pretty happy moving forward with this patch.


signature.asc
Description: Digital signature

Re: [HACKERS] pgbench throttling latency limit



Hello Heikki,

Now that I've finished the detour and committed and backpatched the changes 
to the way latency is calculated, we can get back to this patch. It needs to 
be rebased.


Before rebasing, I think that there are a few small problems with the 
modification applied to switch from an integer range to double.


(1) ISTM that the + 0.5 which remains in the PoissonRand computation comes 
from the previous integer approach and is not needed here. If I'm not 
mistaken the formula should be plain:


 -log(uniform) * center

(2) I'm not sure of the name center, I think that lambda or
mean would be more appropriate.

(3) I wish that the maximum implied multiplier could be explicitely
documented in the source code. From pg_rand48 source code, I think
that it is 33.27106466687737

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

I agree that there's no reason to fix an algorithm to it, unless maybe it's
pglz. There's some initial talk about implementing pluggable compression
algorithms for TOAST and I guess the same must be taken into consideration
for the WAL.

--
Arthur Silva

On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed rahilasyed...@gmail.com
wrote:

I will repeat the above tests with high load on CPU and using the
benchmark
given by Fujii-san and post the results.

Average % of CPU usage at user level for each of the compression algorithm
are as follows.

CompressionMultipleSingle

Off81.133881.1267
LZ4 81.099881.1695
Snappy:80.9741 80.9703
Pglz :81.235381.2753

http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png

http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png

The numbers show CPU utilization of Snappy is the least. The CPU
utilization
in increasing order is
pglz No compression LZ4 Snappy

The variance of average CPU utilization numbers is very low. However ,
snappy seems to be best when it comes to lesser utilization of CPU.

As per the measurement results posted till date

LZ4 outperforms snappy and pglz in terms of compression ratio and
performance. However , CPU utilization numbers show snappy utilizes least
amount of CPU . Difference is not much though.

As there has been no consensus yet about which compression algorithm to
adopt, is it better to make this decision independent of the FPW
compression
patch as suggested earlier in this thread?. FPW compression can be done
using built in compression pglz as it shows considerable performance over
uncompressed WAL and good compression ratio
Also, the patch to compress multiple blocks at once gives better
compression
as compared to single block. ISTM that performance overhead introduced by
multiple blocks compression is slightly higher than single block
compression
which can be tested again after modifying the patch to use pglz . Hence,
this patch can be built using multiple blocks compression.

Thoughts?

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit




(3) I wish that the maximum implied multiplier could be explicitely
   documented in the source code. From pg_rand48 source code, I think
   that it is 33.27106466687737


Small possibly buggy code attached, to show how I computed the above 
figure.


--
Fabien.#include math.h
#include stdio.h

int main(void)
{
  double v = ldexp(0x, -48) + ldexp(0x, -32) + ldexp(0x, -16);
  double u = 1.0 - v;
  double m = -log(u);
  printf(v=%.16g u=%.16g m=%.16g\n, v, u, m);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Sep 11, 2014 at 09:37:07AM -0300, Arthur Silva wrote:
 I agree that there's no reason to fix an algorithm to it, unless maybe it's
 pglz. There's some initial talk about implementing pluggable compression
 algorithms for TOAST and I guess the same must be taken into consideration
 for the WAL.
 
 --
 Arthur Silva
 
 
 On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed rahilasyed...@gmail.com
 wrote:
 
  I will repeat the above tests with high load on CPU and using the
  benchmark
  given by Fujii-san and post the results.
 
  Average % of CPU usage at user level for each of the compression algorithm
  are as follows.
 
  CompressionMultipleSingle
 
  Off81.133881.1267
  LZ4  81.099881.1695
  Snappy:80.9741 80.9703
  Pglz :81.235381.2753
 
  
  http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png
  
  
  http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png
  
 
  The numbers show CPU utilization of Snappy is the least. The CPU
  utilization
  in increasing order is
  pglz  No compression  LZ4  Snappy
 
  The variance of average CPU utilization numbers is very low. However ,
  snappy seems to be best when it comes to lesser utilization of CPU.
 
  As per the measurement results posted till date
 
  LZ4 outperforms snappy and pglz in terms of compression ratio and
  performance. However , CPU utilization numbers show snappy utilizes least
  amount of CPU . Difference is not much though.
 
  As there has been no consensus yet about which compression algorithm to
  adopt, is it better to make this decision independent of the FPW
  compression
  patch as suggested earlier in this thread?. FPW compression can be done
  using built in compression pglz as it shows considerable performance over
  uncompressed WAL and good compression ratio
  Also, the patch to compress multiple blocks at once gives better
  compression
  as compared to single block. ISTM that performance overhead introduced by
  multiple blocks compression is slightly higher than single block
  compression
  which can be tested again after modifying the patch to use pglz .  Hence,
  this patch can be built using multiple blocks compression.
 
  Thoughts?
 

Hi,

The big (huge) win for lz4 (not the HC variant) is the enormous compression
and decompression speed. It compresses quite a bit faster (33%) than snappy
and decompresses twice as fast as snappy.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

Thanks for reviewing, Andres.

On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote:
 +static void bgreclaim_quickdie(SIGNAL_ARGS);
 +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
 +static void ReqShutdownHandler(SIGNAL_ARGS);
 +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

 This looks inconsistent.

It's exactly the same as what bgwriter.c does.

 No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
 good idea, regardless of it possibly being true today (which I'm not
 sure about yet).

We really need a more centralized way to handle error cleanup in
auxiliary processes.  The current state of affairs is really pretty
helter-skelter.  But for this patch, I think we should aim to mimic
the existing style, as ugly as it is.  I'm not sure whether Amit's got
the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
is probably a good idea.

 +Two water mark indicators are used to maintain sufficient number of buffers
 +on freelist.  Low water mark indicator is used by backends to wake 
 bgreclaimer
 +when the number of buffers in freelist falls below it.  High water mark
 +indicator is used by bgreclaimer to move buffers to freelist.

 For me the description of the high water as stated here doesn't seem to
 explain anything.

Yeah, let me try to revise and expand on that a bit:

Background Reclaimer's Processing
-

The background reclaimer runs the clock sweep to identify buffers that
are good candidates for eviction and puts them on the freelist.  This
makes buffer allocation much faster, since removing a buffer from the
head of a linked list is much cheaper than linearly scanning the whole
buffer pool until a promising candidate is found.  It's possible that
a buffer we add to the freelist may be accessed or even pinned before
it's evicted; if that happens, the backend that would have evicted it
will simply disregard it and take the next buffer instead (or run the
clock sweep itself, if necessary).  However, to make sure that doesn't
happen too often, we need to keep the freelist as short as possible,
so that there won't be many other buffer accesses between when the
time a buffer is added to the freelist and the time when it's actually
evicted.

We use two water marks to control the activity of the bgreclaimer
process.  Each time bgreclaimer is awoken, it will move buffers to the
freelist until the length of the free list reaches the high water
mark.  It will then sleep.  When the number of buffers on the freelist
reaches the low water mark, backends attempting to allocate new
buffers will set the bgreclaimer's latch, waking it up again.  While
it's important for the high water mark to be small (for the reasons
described above), we also need to ensure adequate separation between
the low and high water marks, so that the bgreclaimer isn't constantly
being awoken to find just a handful of additional candidate buffers,
and we need to ensure that the low watermark is adequate to keep the
freelist from becoming completely empty before bgreclaimer has time to
wake up and beginning filling it again.

 This section should have a description of how the reclaimer interacts
 with the bgwriter logic. Do we put dirty buffers on the freelist that
 are then cleaned by the bgwriter? Which buffers does the bgwriter write
 out?

The bgwriter is cleaning ahead of the strategy point, and the
bgreclaimer is advancing the strategy point.  I think we should
consider having the bgreclaimer wake the bgwriter if it comes across a
dirty buffer, because while the bgwriter only estimates the rate of
buffer allocation, bgreclaimer *knows* the rate of allocation, because
its own activity is tied to the allocation rate.  I think there's the
potential for this kind of thing to make the background writer
significantly more effective than it is today, but I'm heavily in
favor of leaving it for a separate patch.

 I wonder if we don't want to increase the high watermark when
 tmp_recent_backend_clocksweep  0?

That doesn't really work unless there's some countervailing force to
eventually reduce it again; otherwise, it'd just converge to infinity.
And it doesn't really seem necessary at the moment.

 Hm. Perhaps we should do a bufHdr-refcount != zero check without
 locking here? The atomic op will transfer the cacheline exclusively to
 the reclaimer's CPU. Even though it very shortly afterwards will be
 touched afterwards by the pinning backend.

Meh.  I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

 * Are we sure that the freelist_lck spinlock won't cause pain? Right now
   there will possibly be dozens of processes busily spinning on it... I
   think it's a acceptable risk, but we should think about it.

As you and I have talked about before, we could reduce contention here
by partitioning the freelist, or by using a CAS loop to pop items off
of it.  But I am not convinced either

Re: [HACKERS] Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 6:32 PM, Robert Haas robertmh...@gmail.com wrote:

 Thanks for reviewing, Andres.

 On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com
wrote:
  +static void bgreclaim_quickdie(SIGNAL_ARGS);
  +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
  +static void ReqShutdownHandler(SIGNAL_ARGS);
  +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
 
  This looks inconsistent.

 It's exactly the same as what bgwriter.c does.

  No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
  good idea, regardless of it possibly being true today (which I'm not
  sure about yet).

 We really need a more centralized way to handle error cleanup in
 auxiliary processes.  The current state of affairs is really pretty
 helter-skelter.  But for this patch, I think we should aim to mimic
 the existing style, as ugly as it is.  I'm not sure whether Amit's got
 the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
 is probably a good idea.

Code related to bgreclaimer logic itself doesn't take any LWLock, do
you suspect the same might be required due to some Signal/Interrupt
handling?

From myside, I have thought about what to keep for error cleanup based
on the working of bgreclaimer.  However there is a chance that I have
missed something.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Scaling shared buffer eviction

On 2014-09-11 09:02:34 -0400, Robert Haas wrote:
 Thanks for reviewing, Andres.
 
 On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote:
  +static void bgreclaim_quickdie(SIGNAL_ARGS);
  +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
  +static void ReqShutdownHandler(SIGNAL_ARGS);
  +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
 
  This looks inconsistent.
 
 It's exactly the same as what bgwriter.c does.

So what? There's no code in common, so I see no reason to have one
signal handler using underscores and the next one camelcase names.

  No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
  good idea, regardless of it possibly being true today (which I'm not
  sure about yet).
 
 We really need a more centralized way to handle error cleanup in
 auxiliary processes.  The current state of affairs is really pretty
 helter-skelter.

Agreed. There really should be three variants:
* full abort including support for transactions
* full abort without transactions being used (most background processes)
* abort without shared memory interactions

 But for this patch, I think we should aim to mimic
 the existing style, as ugly as it is.

Agreed.

 Background Reclaimer's Processing
 -
 
 The background reclaimer runs the clock sweep to identify buffers that
 are good candidates for eviction and puts them on the freelist.  This
 makes buffer allocation much faster, since removing a buffer from the
 head of a linked list is much cheaper than linearly scanning the whole
 buffer pool until a promising candidate is found.  It's possible that
 a buffer we add to the freelist may be accessed or even pinned before
 it's evicted; if that happens, the backend that would have evicted it
 will simply disregard it and take the next buffer instead (or run the
 clock sweep itself, if necessary).  However, to make sure that doesn't
 happen too often, we need to keep the freelist as short as possible,
 so that there won't be many other buffer accesses between when the
 time a buffer is added to the freelist and the time when it's actually
 evicted.
 
 We use two water marks to control the activity of the bgreclaimer
 process.  Each time bgreclaimer is awoken, it will move buffers to the
 freelist until the length of the free list reaches the high water
 mark.  It will then sleep.

I wonder if we should recheck the number of freelist items before
sleeping. As the latch currently is reset before sleeping (IIRC) we
might miss being woken up soon. It very well might be that bgreclaim
needs to run for more than one cycle in a row to keep up...

  This section should have a description of how the reclaimer interacts
  with the bgwriter logic. Do we put dirty buffers on the freelist that
  are then cleaned by the bgwriter? Which buffers does the bgwriter write
  out?
 
 The bgwriter is cleaning ahead of the strategy point, and the
 bgreclaimer is advancing the strategy point.

That sentence, in some form, should be in the above paragraph.

 I think we should
 consider having the bgreclaimer wake the bgwriter if it comes across a
 dirty buffer, because while the bgwriter only estimates the rate of
 buffer allocation, bgreclaimer *knows* the rate of allocation, because
 its own activity is tied to the allocation rate.  I think there's the
 potential for this kind of thing to make the background writer
 significantly more effective than it is today, but I'm heavily in
 favor of leaving it for a separate patch.

Yes, doing that sounds like a good plan. I'm happy with that being done
in a separate patch.

  I wonder if we don't want to increase the high watermark when
  tmp_recent_backend_clocksweep  0?
 
 That doesn't really work unless there's some countervailing force to
 eventually reduce it again; otherwise, it'd just converge to infinity.
 And it doesn't really seem necessary at the moment.

Right, it obviously needs to go both ways. I'm a bit sceptic about
untunable, fixed, numbers proving to be accurate for widely varied
workloads.

  Hm. Perhaps we should do a bufHdr-refcount != zero check without
  locking here? The atomic op will transfer the cacheline exclusively to
  the reclaimer's CPU. Even though it very shortly afterwards will be
  touched afterwards by the pinning backend.
 
 Meh.  I'm not in favor of adding more funny games with locking unless
 we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

  * Are we sure that the freelist_lck spinlock won't cause pain? Right now
there will possibly be dozens of processes busily spinning on it... I
think it's a acceptable risk, but we should think about it.
 
 As you and I have talked

Re: [HACKERS] pgbench throttling latency limit



Hello Heikki

Now that I've finished the detour and committed and backpatched the changes 
to the way latency is calculated, we can get back to this patch. It needs to 
be rebased.


Here is the rebase, which seems ok.

See also the small issues raised aboud getPoissonRand in another email.

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4001a98..d36fa2c 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,18 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * Transactions which take longer that this limit are counted as late
+ * and reported as such, although they are completed anyway.
+ *
+ * When under throttling: execution time slots which are more than
+ * this late (in us) are simply skipped, and the corresponding transaction
+ * is counted as such... it is not even started;
+ * otherwise above the limit transactions are counted as such, with the latency
+ * measured wrt the transaction schedule, not its actual start.
+ */
+int64		latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +250,8 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
+	int64		latency_late; /* late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +264,8 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
+	int64		latency_late;
 } TResult;
 
 /*
@@ -372,6 +388,10 @@ usage(void)
 		   -f, --file=FILENAME  read transaction script from FILENAME\n
 		 -j, --jobs=NUM   number of threads (default: 1)\n
 		 -l, --logwrite transaction times to log file\n
+		 -L, --limit=NUM  count transactions lasting more than NUM ms.\n
+		  under throttling (--rate), transactions behind schedule\n
+		  more than NUM ms are skipped, and those finishing more\n
+		  than NUM ms after their scheduled start are counted.\n
 		 -M, --protocol=simple|extended|prepared\n
 		  protocol for submitting queries (default: simple)\n
 		 -n, --no-vacuum  do not run VACUUM before tests\n
@@ -1038,6 +1058,23 @@ top:
 
 		thread-throttle_trigger += wait;
 
+		if (latency_limit)
+		{
+			instr_time	now;
+			int64		now_us;
+			INSTR_TIME_SET_CURRENT(now);
+			now_us = INSTR_TIME_GET_MICROSEC(now);
+			while (thread-throttle_trigger  now_us - latency_limit)
+			{
+/* if too far behind, this slot is skipped, and we
+ * iterate till the next nearly on time slot.
+ */
+int64 wait = getPoissonRand(thread, throttle_delay);
+thread-throttle_trigger += wait;
+thread-throttle_latency_skipped ++;
+			}
+		}
+
 		st-txn_scheduled = thread-throttle_trigger;
 		st-sleeping = 1;
 		st-throttling = true;
@@ -1110,8 +1147,11 @@ top:
 			thread-exec_count[cnum]++;
 		}
 
-		/* transaction finished: record latency under progress or throttling */
-		if ((progress || throttle_delay)  commands[st-state + 1] == NULL)
+		/* transaction finished: record latency under progress or throttling,
+		 * or if latency limit is set
+		 */
+		if ((progress || throttle_delay || latency_limit) 
+			commands[st-state + 1] == NULL)
 		{
 			int64		latency;
 
@@ -1133,6 +1173,11 @@ top:
 			 * would take 256 hours.
 			 */
 			st-txn_sqlats += latency * latency;
+
+			/* record over the limit transactions if needed.
+			 */
+			if (latency_limit  latency  latency_limit)
+thread-latency_late++;
 		}
 
 		/*
@@ -1360,7 +1405,7 @@ top:
 	}
 
 	/* Record transaction start time under logging, progress or throttling */
-	if ((logfile || progress || throttle_delay)  st-state == 0)
+	if ((logfile || progress || throttle_delay || latency_limit)  st-state == 0)
 	{
 		INSTR_TIME_SET_CURRENT(st-txn_begin);
 
@@ -2426,7 +2471,8 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 			 TState *threads, int nthreads,
 			 instr_time total_time, instr_time conn_total_time,
 			 int64 total_latencies, int64 total_sqlats,
-			 int64 throttle_lag, int64 throttle_lag_max)
+			 int64 throttle_lag, int64 throttle_lag_max,
+			 int64 throttle_latency_skipped, int64 latency_late)
 {
 	double		time_include,
 tps_include,
@@ -2465,7 +2511,17 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 			   normal_xacts);
 	}
 
-	if (throttle_delay || progress)
+	if (throttle_delay  latency_limit)
+		printf(number of transactions skipped:  INT64_FORMAT  (%.3f %%)\n,
+			   throttle_latency_skipped,
+			   100.0 * throttle_latency_skipped / (throttle_latency_skipped + normal_xacts));
+
+	if (latency_limit)
+		printf(number of transactions above the %.1f ms latency limit:

Re: [HACKERS] Scaling shared buffer eviction

  We really need a more centralized way to handle error cleanup in
  auxiliary processes.  The current state of affairs is really pretty
  helter-skelter.  But for this patch, I think we should aim to mimic
  the existing style, as ugly as it is.  I'm not sure whether Amit's got
  the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
  is probably a good idea.
 
 Code related to bgreclaimer logic itself doesn't take any LWLock, do
 you suspect the same might be required due to some Signal/Interrupt
 handling?

I suspect it might creep in at some point at some unrelated place. Which
will only ever break in production scenarios. Say, a lwlock in in config
file processing. I seem to recall somebody seing a version of a patching
adding a lwlock there... :). Or a logging hook. Or ...

The savings from not doing LWLockReleaseAll() are nonexistant, so ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra t...@fuzzy.cz wrote:
 OK. So here's v13 of the patch, reflecting this change.

With the exception of ExecChooseHashTableSize() and a lot of stylistic
issues along the lines of what I've already complained about, this
patch seems pretty good to me.  It does three things:

(1) It changes NTUP_PER_BUCKET to 1.  Although this increases memory
utilization, it also improves hash table performance, and the
additional memory we use is less than what we saved as a result of
commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.

(2) It changes ExecChooseHashTableSize() to include the effect of the
memory consumed by the bucket array.  If we care about properly
respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
setting, but more important for a small setting like 1 than for the
current value of 10.  I'm not sure the logic in this function is as
clean and simple as it can be just yet.

(3) It allows the number of batches to increase on the fly while the
hash join is in process.  This case arises when we initially estimate
that we only need a small hash table, and then it turns out that there
are more tuples than we expect.  Without this code, the hash table's
load factor gets too high and things start to suck.

I recommend separating this patch in two patches, the first covering
items (1) and (2) and the second covering item (3), which really seems
like a separate improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

On Wed, Sep 10, 2014 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva arthur...@gmail.com wrote:
  I'm continuously studying Postgres codebase. Hopefully I'll be able to
 make
  some contributions in the future.
 
  For now I'm intrigued about the extensive use of memory alignment. I'm
 sure
  there's some legacy and some architecture that requires it reasoning
 behind
  it.
 
  That aside, since it wastes space (a lot of space in some cases) there
 must
  be a tipping point somewhere. I'm sure one can prove aligned access is
  faster in a micro-benchmark but I'm not sure it's the case in a DBMS like
  postgres, specially in the page/rows area.
 
  Just for the sake of comparison Mysql COMPACT storage (default and
  recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed
  4-byte alignment. Not sure about Oracle and others.
 
  Is it worth the extra space in newer architectures (specially Intel)?
  Do you guys think this is something worth looking at?

 Yes.  At least in my opinion, though, it's not a good project for a
 beginner.  If you get your changes to take effect, you'll find that a
 lot of things will break in places that are not easy to find or fix.
 You're getting into really low-level areas of the system that get
 touched infrequently and require a lot of expertise in how things work
 today to adjust.


I thought all memory alignment was (or at least the bulk of it) handled
using some codebase wide macros/settings, otherwise how could different
parts of the code inter-op? Poking this area might suffice for some initial
testing to check if it's worth any more attention.

Unaligned memory access received a lot attention in Intel post-Nehalen era.
So it may very well pay off on Intel servers. You might find this blog post
and it's comments/external-links interesting
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

I'm a newbie in the codebase, so please let me know if I'm saying anything
non-sense.


 The idea I've had before is to try to reduce the widest alignment we
 ever require from 8 bytes to 4 bytes.  That is, look for types with
 typalign = 'd', and rewrite them to have typalign = 'i' by having them
 use two 4-byte loads to load an eight-byte value.  In practice, I
 think this would probably save a high percentage of what can be saved,
 because 8-byte alignment implies a maximum of 7 bytes of wasted space,
 while 4-byte alignment implies a maximum of 3 bytes of wasted space.
 And it would probably be pretty cheap, too, because any type with less
 than 8 byte alignment wouldn't be affected at all, and even those
 types that were affected would only be slightly slowed down by doing
 two loads instead of one.  In contrast, getting rid of alignment
 requirements completely would save a little more space, but probably
 at the cost of a lot more slowdown: any type with alignment
 requirements would have to fetch the value byte-by-byte instead of
 pulling the whole thing out at once.


Does byte-by-byte access stand true nowadays? I though modern processors
would fetch memory at very least in word sized chunks, so 4/8 bytes then
merge-slice.


 But there are a couple of obvious problems with this idea, too, such as:

 1. It's really complicated and a ton of work.

2. It would break pg_upgrade pretty darn badly unless we employed some
 even-more-complex strategy to mitigate that.
 3. The savings might not be enough to justify the effort.


Very true.


 It might be interesting for someone to develop a tool measuring the
 number of bytes of alignment padding we lose per tuple or per page and
 gather some statistics on it on various databases.  That would give us
 some sense as to the possible savings.


 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-11 Thread Mitsumasa KONDO

2014-09-11 22:01 GMT+09:00 k...@rice.edu k...@rice.edu:

 On Thu, Sep 11, 2014 at 09:37:07AM -0300, Arthur Silva wrote:
  I agree that there's no reason to fix an algorithm to it, unless maybe
 it's
  pglz.

Yes, it seems difficult to judge only  the algorithm performance.
We have to start to consider source code maintenance, quality and the other
factors..



 The big (huge) win for lz4 (not the HC variant) is the enormous compression
 and decompression speed. It compresses quite a bit faster (33%) than snappy
 and decompresses twice as fast as snappy.

Show us the evidence. Postgres members showed the test result and them
consideration.
It's very objective comparing.

Best Regards,
--
Mitsumasa KONDO

Re: [HACKERS] Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 6:59 PM, Andres Freund and...@2ndquadrant.com
wrote:

   We really need a more centralized way to handle error cleanup in
   auxiliary processes.  The current state of affairs is really pretty
   helter-skelter.  But for this patch, I think we should aim to mimic
   the existing style, as ugly as it is.  I'm not sure whether Amit's got
   the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
   is probably a good idea.
 
  Code related to bgreclaimer logic itself doesn't take any LWLock, do
  you suspect the same might be required due to some Signal/Interrupt
  handling?

 I suspect it might creep in at some point at some unrelated place. Which
 will only ever break in production scenarios. Say, a lwlock in in config
 file processing.

Yeah, I suspected the same and checked that path, but couldn't find but
may be in some path it is there as the code has many flows.

 I seem to recall somebody seing a version of a patching
 adding a lwlock there... :). Or a logging hook. Or ...

 The savings from not doing LWLockReleaseAll() are nonexistant, so ...

Okay, I shall add it in next version of patch and mention in comments
the reasons quoted by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote:
 It's exactly the same as what bgwriter.c does.

 So what? There's no code in common, so I see no reason to have one
 signal handler using underscores and the next one camelcase names.

/me shrugs.

It's not always possible to have things be consistent with each other
within a file and also with what gets done in other files.  I'm not
sure we should fault patch authors for choosing a different one than
we would have chosen.  FWIW, I probably would have done it the way
Amit did it.  I don't actually care, though.

 I wonder if we should recheck the number of freelist items before
 sleeping. As the latch currently is reset before sleeping (IIRC) we
 might miss being woken up soon. It very well might be that bgreclaim
 needs to run for more than one cycle in a row to keep up...

The outer loop in BgMoveBuffersToFreelist() was added to address
precisely this point, which I raised in a previous review.

  I wonder if we don't want to increase the high watermark when
  tmp_recent_backend_clocksweep  0?

 That doesn't really work unless there's some countervailing force to
 eventually reduce it again; otherwise, it'd just converge to infinity.
 And it doesn't really seem necessary at the moment.

 Right, it obviously needs to go both ways. I'm a bit sceptic about
 untunable, fixed, numbers proving to be accurate for widely varied
 workloads.

Me, too, but I'm *even more* skeptical about making things complicated
on the pure theory that a simple solution can't be correct.  I'm not
blind to the possibility that the current logic is inadequate, but
testing proves that it works well enough to produce a massive
performance boost over where we are now.  When, and if, we develop a
theory about specifically how it falls short then, sure, let's adjust
it.  But I think it would be a serious error to try to engineer a
perfect algorithm here based on the amount of testing that we can
reasonably do pre-commit.  We have no chance of getting that right,
and I'd rather have an algorithm that is simple and imperfect than an
algorithm that is complex and still imperfect.  No matter what, it's
better than what we have now.

  Hm. Perhaps we should do a bufHdr-refcount != zero check without
  locking here? The atomic op will transfer the cacheline exclusively to
  the reclaimer's CPU. Even though it very shortly afterwards will be
  touched afterwards by the pinning backend.

 Meh.  I'm not in favor of adding more funny games with locking unless
 we can prove they're necessary for performance.

 Well, this in theory increases the number of processes touching buffer
 headers regularly. Currently, if you have one read IO intensive backend,
 there's pretty much only process touching the cachelines. This will make
 it two. I don't think it's unreasonable to try to reduce the cacheline
 pingpong caused by that...

It's not unreasonable, but this is a good place to apply Knuth's first
law of optimization.  There's no proof we need to do this, so let's
not until there is.

 I also think it would be good to
 get some statistics on how often regular backends are running the
 clocksweep vs. how often bgreclaimer is satisfying their needs.

 I think that's necessary. The patch added buf_backend_clocksweep. Maybe
 we just also need buf_backend_from_freelist?

That's just (or should be just) buf_alloc - buf_backend_clocksweep.

I think buf_backend_clocksweep should really be called
buf_alloc_clocksweep, and should be added (in all relevant places)
right next to buf_alloc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

On Thu, Sep 11, 2014 at 9:32 AM, Arthur Silva arthur...@gmail.com wrote:
 I thought all memory alignment was (or at least the bulk of it) handled
 using some codebase wide macros/settings, otherwise how could different
 parts of the code inter-op? Poking this area might suffice for some initial
 testing to check if it's worth any more attention.

Well, sure, but the issues aren't too simple.  For example, I think
there are cases where we rely on the alignment bytes being zero to
distinguish between an aligned value following and an unaligned
toasted value.  That stuff can make your head explode, or at least
mine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-11 Thread Mitsumasa KONDO

2014-09-11 15:47 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr:


 Hello Robert,

  I am not objecting to the functionality; I'm objecting to bolting on
 ad-hoc operators one at a time.  I think an expression syntax would
 let us do this in a much more scalable way.  If I had time, I'd go do
 that, but I don't.  We could add abs(x) and hash(x) and it would all
 be grand.


 Ok. I do agree that an expression syntax would be great!

Yes, it's not bad.

However, will we need some method of modulo calculation?
I don't think so much. I think most intuitive modulo calculation method is
floor modulo,
Because if we use the negative value in modulo calculation, it just set
negative value as both positive values,
it is easy to expect the result than others. This strong point is good for
benchmark script users.

But I don't have any strong opinion about this patch, not blocking:)

Best Regards
--
Mistumasa KONDO

Re: [HACKERS] Aussie timezone database changes incoming

On Thu, Sep 11, 2014 at 12:20 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 I shouldn't be surprised that Australia gets to change. While the cynic
 in me thinks this is the usual USA-is-the-center-of-the-universe-ism, in
 reality it makes sense given relative population and likely impact.

Just because it makes sense doesn't mean it isn't
USA-is-the-center-of-the-universe-ism.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Robert Haas robertmh...@gmail.com writes:
 With the exception of ExecChooseHashTableSize() and a lot of stylistic
 issues along the lines of what I've already complained about, this
 patch seems pretty good to me.  It does three things:
 ...
 (3) It allows the number of batches to increase on the fly while the
 hash join is in process.  This case arises when we initially estimate
 that we only need a small hash table, and then it turns out that there
 are more tuples than we expect.  Without this code, the hash table's
 load factor gets too high and things start to suck.

Pardon me for not having read the patch yet, but what part of (3)
wasn't there already?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On 2014-09-11 09:48:10 -0400, Robert Haas wrote:
 On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote:
  I wonder if we should recheck the number of freelist items before
  sleeping. As the latch currently is reset before sleeping (IIRC) we
  might miss being woken up soon. It very well might be that bgreclaim
  needs to run for more than one cycle in a row to keep up...
 
 The outer loop in BgMoveBuffersToFreelist() was added to address
 precisely this point, which I raised in a previous review.

Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
up being continously busy without being visible.

   I wonder if we don't want to increase the high watermark when
   tmp_recent_backend_clocksweep  0?
 
  That doesn't really work unless there's some countervailing force to
  eventually reduce it again; otherwise, it'd just converge to infinity.
  And it doesn't really seem necessary at the moment.
 
  Right, it obviously needs to go both ways. I'm a bit sceptic about
  untunable, fixed, numbers proving to be accurate for widely varied
  workloads.
 
 Me, too, but I'm *even more* skeptical about making things complicated
 on the pure theory that a simple solution can't be correct.

Fair enough.

 I'm not
 blind to the possibility that the current logic is inadequate, but
 testing proves that it works well enough to produce a massive
 performance boost over where we are now.

But, to be honest, the testing so far was pretty narrow in the kind of
workloads that were run if I crossread things accurately. Don't get me
wrong, I'm *really* happy about having this patch, that just doesn't
mean every detail is right ;)

   Hm. Perhaps we should do a bufHdr-refcount != zero check without
   locking here? The atomic op will transfer the cacheline exclusively to
   the reclaimer's CPU. Even though it very shortly afterwards will be
   touched afterwards by the pinning backend.
 
  Meh.  I'm not in favor of adding more funny games with locking unless
  we can prove they're necessary for performance.
 
  Well, this in theory increases the number of processes touching buffer
  headers regularly. Currently, if you have one read IO intensive backend,
  there's pretty much only process touching the cachelines. This will make
  it two. I don't think it's unreasonable to try to reduce the cacheline
  pingpong caused by that...
 
 It's not unreasonable, but this is a good place to apply Knuth's first
 law of optimization.  There's no proof we need to do this, so let's
 not until there is.

That's true for new (pieces of) software; less so, when working with a
installed base that you might regress... But whatever.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 With the exception of ExecChooseHashTableSize() and a lot of stylistic
 issues along the lines of what I've already complained about, this
 patch seems pretty good to me.  It does three things:
 ...
 (3) It allows the number of batches to increase on the fly while the
 hash join is in process.  This case arises when we initially estimate
 that we only need a small hash table, and then it turns out that there
 are more tuples than we expect.  Without this code, the hash table's
 load factor gets too high and things start to suck.

 Pardon me for not having read the patch yet, but what part of (3)
 wasn't there already?

EINSUFFICIENTCAFFEINE.

It allows the number of BUCKETS to increase, not the number of
batches.  As you say, the number of batches could already increase.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Robert Haas robertmh...@gmail.com writes:
 On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 (3) It allows the number of batches to increase on the fly while the
 hash join is in process.

 Pardon me for not having read the patch yet, but what part of (3)
 wasn't there already?

 EINSUFFICIENTCAFFEINE.

 It allows the number of BUCKETS to increase, not the number of
 batches.  As you say, the number of batches could already increase.

Ah.  Well, that would mean that we need a heuristic for deciding when to
increase the number of buckets versus the number of batches ... seems
like a difficult decision.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 10:03 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-11 09:48:10 -0400, Robert Haas wrote:
 On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I wonder if we should recheck the number of freelist items before
  sleeping. As the latch currently is reset before sleeping (IIRC) we
  might miss being woken up soon. It very well might be that bgreclaim
  needs to run for more than one cycle in a row to keep up...

 The outer loop in BgMoveBuffersToFreelist() was added to address
 precisely this point, which I raised in a previous review.

 Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
 ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
 up being continously busy without being visible.

Good idea.

 I'm not
 blind to the possibility that the current logic is inadequate, but
 testing proves that it works well enough to produce a massive
 performance boost over where we are now.

 But, to be honest, the testing so far was pretty narrow in the kind of
 workloads that were run if I crossread things accurately. Don't get me
 wrong, I'm *really* happy about having this patch, that just doesn't
 mean every detail is right ;)

Oh, sure.  Totally agreed.  And, to the extent that we're improving
things based on actual testing, I'm A-OK with that.  I just don't want
to start speculating, or we'll never get this thing off the ground.

Some possibly-interesting test cases would be:

(1) A read-only pgbench workload that is just a tiny bit larger than
shared_buffers, say size of shared_buffers plus 0.01%.  Such workloads
tend to stress buffer eviction heavily.

(2) A workload that maximizes the rate of concurrent buffer eviction
relative to other tasks.  Read-only pgbench is not bad for this, but
maybe somebody's got a better idea.

As I sort of mentioned in what I was writing for the bufmgr README,
there are, more or less, three ways this can fall down, at least that
I can see: (1) if the high water mark is too high, then we'll start
finding buffers in the freelist that have already been touched since
we added them: (2) if the low water mark is too low, the freelist will
run dry; and (3) if the low and high water marks are too close
together, the bgreclaimer will be constantly getting woken up and
going to sleep again.  I can't personally think of a workload that
will enable us to get a better handle on those cases than
high-concurrency pgbench, but you're known to be ingenious at coming
up with destruction workloads, so if you have an idea, by all means
fire away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit



How should skipped transactions should be taken into account in the log file 
output, with and without aggregation? I assume we'll want to have some trace 
of skipped transactions in the logs.


The problem with this point is that how to report something not done is 
unclear, especially as the logic of the log is one line per performed 
transaction.


Obviously we can log something, but as the transaction are not performed 
the format would be different, which break the expectation of a simple and 
homogeneous log file format that people like to process directly.


So bar any great idea, I would suggest not to log skipped transactions and 
to wait for someone to need to have access to this detailed information 
and for whom the final report is not enough.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

On 2014-09-11 10:32:24 -0300, Arthur Silva wrote:
 Unaligned memory access received a lot attention in Intel post-Nehalen era.
 So it may very well pay off on Intel servers. You might find this blog post
 and it's comments/external-links interesting
 http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

FWIW, the reported results of imo pretty meaningless for postgres. It's
sequential access over larger amount of memory. I.e. a perfectly
prefetchable workload where it doesn't matter if superflous cachelines
are fetched because they're going to be needed next round anyway.

In many production workloads one of the most busy accesses to individual
datums is the binary search on individual pages during index
lookups. That's pretty much exactly the contrary to the above.

Not saying that it's not going to be a benefit in many scenarios, but
it's far from being as simple as saying that unaligned accesses on their
own aren't penalized anymore.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] gist vacuum seaq access

2014-09-11 Thread Костя Кузнецов

After discussion of gist seaq access in vaccum there are 2 issue: Heikki says :Vacuum needs to memorize the current NSN when it begins1) how i may getting corect NSN. Also i must setting F_DELETED flag on empty page and to clean parent from link on deleted_pages2) how i may getting parent of page fast?? Thanks.

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

On 11 Září 2014, 15:31, Robert Haas wrote:
 On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra t...@fuzzy.cz wrote:
 OK. So here's v13 of the patch, reflecting this change.

 With the exception of ExecChooseHashTableSize() and a lot of stylistic
 issues along the lines of what I've already complained about, this
 patch seems pretty good to me.  It does three things:

I believe I fixed the stylistic issues you pointed out, except maybe the
bracketing (which seems fine to me, though). So if you could point the
issues out, that'd be helpful.


 (1) It changes NTUP_PER_BUCKET to 1.  Although this increases memory
 utilization, it also improves hash table performance, and the
 additional memory we use is less than what we saved as a result of
 commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.

 (2) It changes ExecChooseHashTableSize() to include the effect of the
 memory consumed by the bucket array.  If we care about properly
 respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
 setting, but more important for a small setting like 1 than for the
 current value of 10.  I'm not sure the logic in this function is as
 clean and simple as it can be just yet.

I made that as clear as I was able to (based on your feedback), but I'm
stuck as I'm familiar with the logic. That is not to say it can't be
improved, but this needs to be reviewed by someone else.


 (3) It allows the number of batches to increase on the fly while the
 hash join is in process.  This case arises when we initially estimate
 that we only need a small hash table, and then it turns out that there
 are more tuples than we expect.  Without this code, the hash table's
 load factor gets too high and things start to suck.

 I recommend separating this patch in two patches, the first covering
 items (1) and (2) and the second covering item (3), which really seems
 like a separate improvement.

That probably makes sense.

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal (9.5) : psql unicode border line styles

Pavel,

* Pavel Stehule (pavel.steh...@gmail.com) wrote:
 I removed dynamic allocation  and reduced patch size.

This is certainly better, imv, though there are a couple of minor
issues (extra semi-colons, extraneous whitespace, get_line_style was
still changed to non-const, even though it doesn't need to be now).

 What I tested a old unicode style is same as new unicode style. There
 nothing was changed .. some fields are specified in refresh_utf8format
 function

I don't particularly like this (having these fields set in
refresh_utf8format to hard-coded strings in the function), why not have
those handled the same as the rest, where the strings themselves are in
the unicode_style structure?

The rest looks pretty good.  Need to step out for a bit but I'll look at
making the above changes when I get back if I don't hear anything.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

On 11 Září 2014, 16:11, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 (3) It allows the number of batches to increase on the fly while the
 hash join is in process.

 Pardon me for not having read the patch yet, but what part of (3)
 wasn't there already?

 EINSUFFICIENTCAFFEINE.

 It allows the number of BUCKETS to increase, not the number of
 batches.  As you say, the number of batches could already increase.

 Ah.  Well, that would mean that we need a heuristic for deciding when to
 increase the number of buckets versus the number of batches ... seems
 like a difficult decision.

That's true, but that's not the aim of this patch. The patch simply
increases the number of buckets if the load happens to get too high, and
does not try to decide between increasing nbuckets and nbatch.

It's true that we can often get better performance with more batches, but
the cases I was able to inspect were caused either by (a) underestimates
resulting in inappropriate nbucket count, or (b) caching effects. This
patch solves (a), not even trying to fix (b).

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf


On 10/09/14 13:13, Fujii Masao wrote:

On Wed, Sep 10, 2014 at 1:45 AM, Petr Jelinek p...@2ndquadrant.com wrote:

Hi,

I recently wanted several times to have slave server prepared at certain
point in time to reduce the time it takes for it to replay remaining WALs
(say I have pg_basebackup -x on busy db for example).


In your example, you're thinking to perform the recovery after taking
the backup and stop it at the consistent point (i.e., end of backup) by
using the proposed feature? Then you're expecting that the future recovery
will start from that consistent point and which will reduce the recovery time?

This is true if checkpoint is executed at the end of backup. But there might
be no occurrence of checkpoint during backup. In this case, even future
recovery would need to start from very start of backup. That is, we cannot
reduce the recovery time. So, for your purpose, for example, you might also
need to add new option to pg_basebackup so that checkpoint is executed
at the end of backup if the option is set.



For my use-case it does not matter much as I am talking here of huge 
volumes where it would normally take hours to replay so being behind one 
checkpoint is not too bad, but obviously I am not sure that it's good 
enough for project in general. Adding checkpoint for pg_basebackup might 
be useful addition, yes.


Also I forgot to write another use-case which making sure that I 
actually do have all the WAL present to get to certain point in time 
(this one could be done via patch to pg_receivexlog I guess, but I see 
advantage in having the changes already applied compared to just having 
the wal files).



So I wrote simple patch that adds option to shut down the cluster once
recovery_target is reached. The server will still be able to continue WAL
replay if needed later or can be configured to start standalone.


What about adding something like action_at_recovery_target=pause|shutdown
instead of increasing the number of parameters?



That will also increase number of parameters as we can't remove the 
current pause one if we want to be backwards compatible. Also there 
would have to be something like action_at_recovery_target=none or off or 
something since the default is that pause is on and we need to be able 
to turn off pause without having to have shutdown on. What more, I am 
not sure I see any other actions that could be added in the future as 
promote action already works and listen (for RO queries) also already 
works independently of this.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

2014-09-11 Thread Merlin Moncure

On Thu, Sep 11, 2014 at 8:32 AM, Arthur Silva arthur...@gmail.com wrote:

 On Wed, Sep 10, 2014 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva arthur...@gmail.com wrote:
  I'm continuously studying Postgres codebase. Hopefully I'll be able to
  make
  some contributions in the future.
 
  For now I'm intrigued about the extensive use of memory alignment. I'm
  sure
  there's some legacy and some architecture that requires it reasoning
  behind
  it.
 
  That aside, since it wastes space (a lot of space in some cases) there
  must
  be a tipping point somewhere. I'm sure one can prove aligned access is
  faster in a micro-benchmark but I'm not sure it's the case in a DBMS
  like
  postgres, specially in the page/rows area.
 
  Just for the sake of comparison Mysql COMPACT storage (default and
  recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed
  4-byte alignment. Not sure about Oracle and others.
 
  Is it worth the extra space in newer architectures (specially Intel)?
  Do you guys think this is something worth looking at?

 Yes.  At least in my opinion, though, it's not a good project for a
 beginner.  If you get your changes to take effect, you'll find that a
 lot of things will break in places that are not easy to find or fix.
 You're getting into really low-level areas of the system that get
 touched infrequently and require a lot of expertise in how things work
 today to adjust.


 I thought all memory alignment was (or at least the bulk of it) handled
 using some codebase wide macros/settings, otherwise how could different
 parts of the code inter-op? Poking this area might suffice for some initial
 testing to check if it's worth any more attention.

 Unaligned memory access received a lot attention in Intel post-Nehalen era.
 So it may very well pay off on Intel servers. You might find this blog post
 and it's comments/external-links interesting
 http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

 I'm a newbie in the codebase, so please let me know if I'm saying anything
 non-sense.

Be advised of the difficulties you are going to face here.  Assuming
for a second there is no reason not to go unaligned on Intel and there
are material benefits to justify the effort, that doesn't necessarily
hold for other platforms like arm/power.  Even though intel handles
the vast majority of installations it's not gonna fly to optimize for
that platform at the expense of others so there'd have to be some kind
of compile time setting to control alignment behavior. That being
said, if you could pull this off cleanly, it'd be pretty neat.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

On Thu, Sep 11, 2014 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 (3) It allows the number of batches to increase on the fly while the
 hash join is in process.

 Pardon me for not having read the patch yet, but what part of (3)
 wasn't there already?

 EINSUFFICIENTCAFFEINE.

 It allows the number of BUCKETS to increase, not the number of
 batches.  As you say, the number of batches could already increase.

 Ah.  Well, that would mean that we need a heuristic for deciding when to
 increase the number of buckets versus the number of batches ... seems
 like a difficult decision.

Well, the patch takes the approach of increasing the number of buckets
when (1) there's only one batch and (2) the load factor exceeds
NTUP_PER_BUCKET.  As Tomas points out, it's just increasing the number
of buckets to the exact same number which we would have allocated had
we known that this many tuples would show up.

Now, there is a possibility that we could lose.  Let's suppose that
the tuples overflow work_mem, but just barely.  If we've accurately
estimate how many tuples there are, the patch changes nothing: either
way we're going to need two batches.  But let's say we've
underestimated the number of tuples by 3x.  Then it could be that
without the patch we'd allocate fewer buckets, never increase the
number of buckets, and avoid batching; whereas with the patch, we'll
discover that our tuple estimate was wrong, increase the number of
buckets on the fly, and then be forced by the slightly-increased
memory consumption that results from increasing the number of buckets
to do two batches.  That would suck.

But catering to that case is basically hoping that we'll fall into a
septic tank and come out smelling like a rose - that is, we're hoping
that our initial poor estimate will cause us to accidentally do the
right thing later.  I don't think that's the way to go.   It's much
more important to get the case where things are bigger than we
expected but still fit within work_mem right; and we're currently
taking a huge run-time penalty in that case, as Tomas' results show.
If we wanted to cater to the other case in the future, we could
consider the opposite approach: if work_mem is exhahusted, throw the
bucket headers away and keep reading tuples into the dense-packed
chunks added by yesterday's commit.  If we again run out of work_mem,
then we *definitely* need to increase the batch count.  If we manage
to make everything fit, then we know exactly how many bucket headers
we can afford, and can use some heuristic to decide between that and
using more batches.

I don't think that should be a requirement for this patch, though: I
think the cumulative effect of Tomas's patches is going to be a win
*much* more often than it's a loss.  In 100% of cases, the
dense-packing stuff will make a hash table containing the same tuples
significantly smaller.  Except when we've radically overestimated the
tuple count, the change to NTUP_PER_BUCKET will mean less time spent
chasing hash chains.  It does use more memory, but that's more than
paid for by the first change.  Both the dense-packing stuff and the
changes to include bucket headers in the work_mem calculation will
have the effect of making the work_mem bound on memory utilization far
more accurate than it's ever been previously.  And the change to
increase the number of buckets at runtime will mean that we're
significantly more resilient against the planner underestimating the
number of tuples.  That's clearly good.  The fact that we can
construct borderline cases where it loses shouldn't deter us from
pushing forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Tomas Vondra t...@fuzzy.cz writes:
 On 11 ZÃ¡ÅÃ 2014, 16:11, Tom Lane wrote:
 Ah.  Well, that would mean that we need a heuristic for deciding when to
 increase the number of buckets versus the number of batches ... seems
 like a difficult decision.

 That's true, but that's not the aim of this patch. The patch simply
 increases the number of buckets if the load happens to get too high, and
 does not try to decide between increasing nbuckets and nbatch.

On further thought, increasing nbuckets without changing the batch
boundaries would not get us out of an out-of-work_mem situation, in fact
it makes memory consumption worse not better (assuming you count the
bucket headers towards work_mem ;-)).

So in principle, the rule seems like it ought to go if load (defined as
max bucket chain length, I imagine?) gets too high, but we are still
well below work_mem, increase nbuckets; else increase nbatch.  And
perhaps we reset nbuckets again for the next batch, not sure.  If we
are dealing with an out-of-work_mem situation then only increasing nbatch
would be a suitable response.

Because of the risk that increasing nbuckets would itself lead to a
work_mem violation, I don't think it's sane to ignore the interaction
entirely, even in a first patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

* Albe Laurenz (laurenz.a...@wien.gv.at) wrote:
 Etsuro Fujita wrote:
  I agree with you on that point.  So, I've updated the patch to have the
  explicit flag, as you proposed.  Attached is the updated version of the
  patch.  In this version, I've also revised code and its comments a bit.
 
 Thank you, I have set the patch to Ready for Committer.

I had a few minutes, so I started looking at this patch and I definitely
like where it's going.  One concern which was brought up that I didn't
see completely addressed was around UPDATE/DELETE with LIMIT, which
seems to be making progress towards getting in.  Presumably, we'd simply
disallow this optimization in that case, but we'll need a way to
identify that case..

I have to admit that, while I applaud the effort made to have this
change only be to postgres_fdw, I'm not sure that having the
update/delete happening during the Scan phase and then essentially
no-op'ing the ExecForeignUpdate/ExecForeignDelete is entirely in-line
with the defined API.

I would have thought we'd add a capability check function that says can
this Modify be completely pushed down and then have a way for that to
happen in ExecForeignUpdate/ExecForeignDelete.  That means changes to
the existing API, of course, and if people feel that's unnecessary then
I'd suggest that we need to at least document that we're willing to
support these bulk operations happening on the remote siude during the
pre-Modify Scan and not during the ExecForeignUpdate/ExecForeignDelete.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Patch to support SEMI and ANTI join removal

On Thu, Sep 11, 2014 at 7:14 AM, David Rowley dgrowle...@gmail.com wrote:
 Here's a quick demo, of the patch at work:

 test=# create table c (id int primary key);
 CREATE TABLE
 test=# create table b (id int primary key, c_id int not null references
 c(id));
 CREATE TABLE
 test=# create table a (id int primary key, b_id int not null references
 b(id));
 CREATE TABLE
 test=#
 test=# explain select a.* from a inner join b on a.b_id = b.id inner join c
 on b.c_id = c.id;
  QUERY PLAN
 -
  Seq Scan on a  (cost=0.00..31.40 rows=2140 width=8)
  Planning time: 1.061 ms
 (2 rows)

That is just awesome.  You are my new hero.

 1. I don't think that I'm currently handling removing eclass members
 properly. So far the code just removes the Vars that belong to the relation
 being removed. I likely should also be doing bms_del_member(ec-ec_relids,
 relid); on the eclass, but perhaps I should just be marking the whole class
 as ec_broken = true and adding another eclass all everything that the
 broken one has minus the parts from the removed relation?

I haven't read the patch, but I think the question is why this needs
to be different than what we do for left join removal.

 Assume there's a foreign key a (x) reference b(x)

 SELECT a.* FROM a INNER JOIN b ON a.x = b.x WHERE b.x = 1

 relation b should be removable because an eclass will contain {a.x, b.x} and
 therefore s baserestrictinfo for a.x = 1 should also exist on relation a.
 Therefore removing relation b should produce equivalent results, i.e
 everything that gets filtered out on relation b will also be filtered out on
 relation a anyway.

 I think the patch without this is still worth it, but if someone feels
 strongly about it I'll take a bash at supporting it.

That'd be nice to fix, but IMHO not essential.

 3. Currently the inner join support does not allow removals using foreign
 keys which contain duplicate columns on the referencing side. e.g (a,a)
 references (x,y), this is basically because of the point I made in item 2.
 In this case a baserestrictinfo would exist on the referenced relation to
 say WHERE x = y.

I think it's fine to not bother with this case.  Who cares?

 4. The patch currently only allows removals for eclass join types. If the
 rel has any joininfo items, then the join removal is disallowed. From what I
 can see equality type inner join conditions get described in eclasses, and
 only non-equality join conditions make it into the joininfo list, and since
 foreign keys only support equality operators, then I thought this was a
 valid restriction, however, if someone can show me a flaw in my assumption
 then I may need to improve this.

Seems OK.

 5. I've added a flag to pg_class called relhasfkey. Currently this gets set
 to true when a foreign key is added, though I've added nothing to set it
 back to false again. I notice that relhasindex gets set back to false during
 vacuum, if vacuum happens to find there to not be any indexes on the rel. I
 didn't put my logic here as I wasn't too sure if scanning pg_constraint
 during a vacuum seemed very correct, so I just left out the setting it to
 false logic based on the the fact that I noticed that relhaspkey gets away
 with quite lazy setting back to false logic (only when there's no indexes of
 any kind left on the relation at all).

The alternative to resetting the flag somehow is not having it in the
first place.  Would that be terribly expensive?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

Merlin Moncure mmonc...@gmail.com writes:
 Be advised of the difficulties you are going to face here.  Assuming
 for a second there is no reason not to go unaligned on Intel and there
 are material benefits to justify the effort, that doesn't necessarily
 hold for other platforms like arm/power.

Note that on many (most?) non-Intel architectures, unaligned access is
simply not an option.  The chips themselves will throw SIGBUS or
equivalent if you try it.  Some kernels provide signal handlers that
emulate the unaligned access in software rather than killing the process;
but the performance consequences of hitting such traps more than very
occasionally would be catastrophic.

Even on Intel, I'd wonder what unaligned accesses do to atomicity
guarantees and suchlike.  This is not a big deal for row data storage,
but we'd have to be careful about it if we were to back off alignment
requirements for in-memory data structures such as latches and buffer
headers.

Another fun thing you'd need to deal with is ensuring that the C structs
we overlay onto catalog data rows still match up with the data layout
rules.

On the whole, I'm pretty darn skeptical that such an effort would repay
itself.  There are lots of more promising things to hack on.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)


On 10/09/14 22:53, Robert Haas wrote:

Here's what the other approach looks like.  I can't really see doing
this way and then only providing hooks for those two functions, so
this is with hooks for all the send-side stuff.

Original version: 9 files changed, 295 insertions(+), 3 deletions(-)
This version: 9 files changed, 428 insertions(+), 47 deletions(-)

There is admittedly a certain elegance to providing a complete set of
hooks, so maybe this is the way to go.  The remaining patches in the
patch series work with either the old version or this one; the changes
here don't affect anything else.

Anyone else have an opinion on which way is better here?


Ok so it is more than 20 lines :)

I do like this approach more though, it looks cleaner to me.


On 10/09/14 22:53, Robert Haas wrote:

+extern int (*pq_putmessage_hook)(char msgtype, const char *s, size_tlen);
+extern int  (*pq_flush_hook)(void);


Btw you forgot to remove the above.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

On 2014-09-11 11:39:12 -0400, Tom Lane wrote:
 Even on Intel, I'd wonder what unaligned accesses do to atomicity
 guarantees and suchlike.

They pretty much kill atomicity guarantees. Atomicity is guaranteed
while you're inside a cacheline, but not once you span them.

 This is not a big deal for row data storage,
 but we'd have to be careful about it if we were to back off alignment
 requirements for in-memory data structures such as latches and buffer
 headers.

Right. I don't think that's an option.

 Another fun thing you'd need to deal with is ensuring that the C structs
 we overlay onto catalog data rows still match up with the data layout
 rules.

Yea, this would require some nastyness in the bki generation, but it'd
probably doable to have different alignment for system catalogs.

 On the whole, I'm pretty darn skeptical that such an effort would repay
 itself.  There are lots of more promising things to hack on.

I have no desire to hack on it, but I can understand the desire to
reduce the space overhead...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors

2014-09-11 Thread Andrew Dunstan



On 09/11/2014 08:29 AM, Stephen Frost wrote:

* Pavel Stehule (pavel.steh...@gmail.com) wrote:

Can I help with something, it is there some open question?

I had been hoping for a more definitive answer regarding this option for
array_to_json, or even a comment about the change to row_to_json.
Andrew- any thoughts on this?  (that's what the ping on IRC is for :).



The change in row_to_json looks OK, I think. we're replacing an 
overloading with use of default params, yes? That seems quite 
reasonable, and users shouldn't notice the difference.


There might be a case for optionally suppressing nulls in array_to_json, 
and it might work reasonably since unlike SQL arrays JSON arrays don't 
have to be regular (if nested they are arrays of arrays, not 
multi-dimensional single arrays). OTOH I'm not sure if it's worth doing 
just for the sake of orthogonality. If someone wants it, then go for it.


cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to support SEMI and ANTI join removal

Robert Haas robertmh...@gmail.com writes:
 On Thu, Sep 11, 2014 at 7:14 AM, David Rowley dgrowle...@gmail.com wrote:
 5. I've added a flag to pg_class called relhasfkey. Currently this gets set
 to true when a foreign key is added, though I've added nothing to set it
 back to false again. I notice that relhasindex gets set back to false during
 vacuum, if vacuum happens to find there to not be any indexes on the rel. I
 didn't put my logic here as I wasn't too sure if scanning pg_constraint
 during a vacuum seemed very correct, so I just left out the setting it to
 false logic based on the the fact that I noticed that relhaspkey gets away
 with quite lazy setting back to false logic (only when there's no indexes of
 any kind left on the relation at all).

 The alternative to resetting the flag somehow is not having it in the
 first place.  Would that be terribly expensive?

The behavior of relhaspkey is a legacy thing that we've tolerated only
because nothing whatsoever in the backend depends on it at all.  I'm not
eager to add more equally-ill-defined pg_class columns.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers

On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Currently two nodes can only have the same priority if they have the
 same application_name, so we could for example add a new connstring
 parameter called, let's say application_group, to define groups of
 nodes that will have the same priority (if a node does not define
 application_group, it defaults to application_name, if app_name is
 NULL, well we don't care much it cannot be a sync candidate). That's a
 first idea that we could use to control groups of nodes. And we could
 switch syncrep.c to use application_group in s_s_names instead of
 application_name. That would be backward-compatible, and could open
 the door for more improvements for quorum commits as we could control
 groups node nodes. Well this is a super-set of what application_name
 can already do, but there is no problem to identify single nodes of
 the same data center and how much they could be late in replication,
 so I think that this would be really user-friendly. An idea similar to
 that would be a base work for the next thing... See below.

In general, I think the user's requirement for what synchronous
standbys could need to acknowledge a commit could be an arbitrary
Boolean expression - well, probably no NOT, but any amount of AND and
OR that you want to use.  Can someone want A OR (((B AND C) OR (D AND
E)) AND F)?  Maybe!  Based on previous discussions, it seems not
unlikely that as soon as we decide we don't want to support that,
someone will tell us they can't live without it.  In general, though,
I'd expect the two common patterns to be more or less what you've set
forth above: any K servers from set X plus any L servers from set Y
plus any M servers from set Z, etc.  However, I'm not confident it's
right to control this by adding more configuration on the client side.
I think it would be better to stick with the idea that each client
specifies an application_name, and then the master specifies the
policy in some way.  One advantage of that is that you can change the
rules in ONE place - the master - rather than potentially having to
update every client.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal (9.5) : psql unicode border line styles

2014-09-11 Thread Pavel Stehule

Hi

2014-09-11 16:42 GMT+02:00 Stephen Frost sfr...@snowman.net:

 Pavel,

 * Pavel Stehule (pavel.steh...@gmail.com) wrote:
  I removed dynamic allocation  and reduced patch size.

 This is certainly better, imv, though there are a couple of minor
 issues (extra semi-colons, extraneous whitespace, get_line_style was
 still changed to non-const, even though it doesn't need to be now).


fixed non-const -- other, I am sorry, I am blind



  What I tested a old unicode style is same as new unicode style. There
  nothing was changed .. some fields are specified in refresh_utf8format
  function

 I don't particularly like this (having these fields set in
 refresh_utf8format to hard-coded strings in the function), why not have
 those handled the same as the rest, where the strings themselves are in
 the unicode_style structure?


I am not sure if I understand well.

With refresh_utf8format I can do shortly 6 possible combinations - or more
(when it will be requested)

I have no idea how to write as rest without repeating all 6 combinations -
what was one noticed issue of some older variant, where I designed
unicode1, unicode2, ...

Any idea, tip how to it?

Regards

Pavel




 The rest looks pretty good.  Need to step out for a bit but I'll look at
 making the above changes when I get back if I don't hear anything.

 Thanks,

 Stephen

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
new file mode 100644
index aa71674..1d59dce
*** a/doc/src/sgml/ref/psql-ref.sgml
--- b/doc/src/sgml/ref/psql-ref.sgml
*** lo_import 152801
*** 2306,2311 
--- 2306,2347 
/para
/listitem
/varlistentry
+ 
+   varlistentry
+   termliteralunicode_border_style/literal/term
+   listitem
+   para
+   Sets the border line drawing style to one
+   of literalsingle/literal or  literaldouble/literal
+   This option only affects the literalunicode/
+   linestyle
+   /para
+   /listitem
+   /varlistentry
+ 
+   varlistentry
+   termliteralunicode_column_style/literal/term
+   listitem
+   para
+   Sets the column line drawing style to one
+   of literalsingle/literal or  literaldouble/literal
+   This option only affects the literalunicode/
+   linestyle
+   /para
+   /listitem
+   /varlistentry
+ 
+   varlistentry
+   termliteralunicode_header_style/literal/term
+   listitem
+   para
+   Sets the header line drawing style to one
+   of literalsingle/literal or  literaldouble/literal
+   This option only affects the literalunicode/
+   linestyle
+   /para
+   /listitem
+   /varlistentry
  /variablelist
  /para
  
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
new file mode 100644
index 5d90ca2..94d8f45
*** a/src/bin/psql/command.c
--- b/src/bin/psql/command.c
*** exec_command(const char *cmd,
*** 1054,1059 
--- 1054,1062 
  footer, format, linestyle, null,
  numericlocale, pager, recordsep,
  tableattr, title, tuples_only,
+ unicode_border_linestyle,
+ unicode_column_linestyle,
+ unicode_header_linestyle,
  NULL
  			};
  
*** _align2string(enum printFormat in)
*** 2248,2253 
--- 2251,2290 
  	return unknown;
  }
  
+ /*
+  * Parse entered unicode linestyle. Returns true, when entered string is
+  * known linestyle: single, double else returns false.
+  */
+ static bool 
+ set_unicode_line_style(printQueryOpt *popt, const char *value, size_t vallen, unicode_linestyle *linestyle)
+ {
+ 	if (pg_strncasecmp(single, value, vallen) == 0)
+ 		*linestyle = UNICODE_LINESTYLE_SINGLE;
+ 	else if (pg_strncasecmp(double, value, vallen) == 0)
+ 		*linestyle = UNICODE_LINESTYLE_DOUBLE;
+ 	else
+ 		return false;
+ 
+ 	/* input is ok, generate new unicode style */
+ 	refresh_utf8format((popt-topt));
+ 
+ 	return true;
+ }
+ 
+ static const char *
+ _unicode_linestyle2string(int linestyle)
+ {
+ 	switch (linestyle)
+ 	{
+ 		case UNICODE_LINESTYLE_SINGLE:
+ 			return single;
+ 			break;
+ 		case UNICODE_LINESTYLE_DOUBLE:
+ 			return double;
+ 			break;
+ 	}
+ 	return unknown;
+ }
  
  bool
  do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
*** do_pset(const char *param, const char *v
*** 2305,2310 
--- 2342,2383 
  
  	}
  
+ 	/* set unicode border line style */
+ 	else if (strcmp(param, unicode_border_linestyle) == 0)
+ 	{
+ 		if (!value)
+ 			;
+ 		else if (!set_unicode_line_style(popt, value, vallen, popt-topt.unicode_border_linestyle))
+ 		{
+ 			psql_error(\\pset: allowed unicode border linestyle are single, double\n);
+ 			return false;
+ 		}
+ 	}
+ 
+ 	/* set unicode column line style */
+ 	else if (strcmp(param, unicode_column_linestyle) == 0)

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

Stephen Frost sfr...@snowman.net writes:
 I have to admit that, while I applaud the effort made to have this
 change only be to postgres_fdw, I'm not sure that having the
 update/delete happening during the Scan phase and then essentially
 no-op'ing the ExecForeignUpdate/ExecForeignDelete is entirely in-line
 with the defined API.

Yeah, I've started looking at this patch and that seemed like not
necessarily such a wise choice.  I think it'd be better if the generated
plan tree had a different structure in this case.  The existing approach
is an impressive hack but it's hard to call it anything but a hack.

I'm not sure offhand what the new plan tree ought to look like.  We could
just generate a ForeignScan node, but that seems like rather a misnomer.
Is it worth inventing a new ForeignUpdate node type?  Or maybe it'd still
be ForeignScan but with a flag field saying hey this is really an update
(or a delete).  The main benefit I can think of right now is that the
EXPLAIN output would be less strange-looking --- but EXPLAIN is hardly
the only thing that ever looks at plan trees, so having an outright
misleading plan structure is likely to burn us down the line.

One advantage of getting the core code involved in the decision about
whether an update/delete can be pushed down is that FDW-independent checks
like whether there are relevant triggers could be implemented in the core
code, rather than having to duplicate them (and later maintain them) in
every FDW that wants to do this.  OTOH, maybe the trigger issue is really
the only thing that could be shared, not sure.  Stuff like is there a
LIMIT probably has to be in the FDW since some FDWs could support pushing
down LIMIT and others not.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

On 11 Září 2014, 17:28, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 On 11 Z?? 2014, 16:11, Tom Lane wrote:
 Ah.  Well, that would mean that we need a heuristic for deciding when
 to
 increase the number of buckets versus the number of batches ... seems
 like a difficult decision.

 That's true, but that's not the aim of this patch. The patch simply
 increases the number of buckets if the load happens to get too high, and
 does not try to decide between increasing nbuckets and nbatch.

 On further thought, increasing nbuckets without changing the batch
 boundaries would not get us out of an out-of-work_mem situation, in fact
 it makes memory consumption worse not better (assuming you count the
 bucket headers towards work_mem ;-)).

Yes, the patch counts the bucket headers towards work_mem, while the
original code didn't. The reasoning is that due to changing
NTUP_PER_BUCKET from 10 to 1, the memory occupied by buckets is not
negligible and should be accounted for.

 So in principle, the rule seems like it ought to go if load (defined as
 max bucket chain length, I imagine?) gets too high, but we are still
 well below work_mem, increase nbuckets; else increase nbatch.  And
 perhaps we reset nbuckets again for the next batch, not sure.  If we
 are dealing with an out-of-work_mem situation then only increasing nbatch
 would be a suitable response.

Almost.

If we expect batching at the very beginning, we size nbuckets for full
work_mem (see how many tuples we can get into work_mem, while not
breaking NTUP_PER_BUCKET threshold).

If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the optimal
nbuckets value), we keep the value.

Only at the end, we check whether (nbuckets != nbuckets_optimal) and
resize the hash table if needed. Also, we keep this value for all batches
(it's OK because it assumes full work_mem, and it makes the batchno
evaluation trivial). So the resize happens only once and only for the
first batch.

 Because of the risk that increasing nbuckets would itself lead to a
 work_mem violation, I don't think it's sane to ignore the interaction
 entirely, even in a first patch.

This possibility of increasing the number of batches is the consequence of
counting the buckets towards work_mem. I believe that's the right thing to
do here, to make the work_mem guarantee stronger, but it's not something
this patch depends on. If this is the only concern, we can leave this out.

It's also worth mentioning that the dense allocation of tuples makes a
huge difference. palloc easily results in 100% overhead for narrow tuples
(that is, allocating 2x the amount of needed memory). For example over
here [1] is an example of a hashjoin consuming 1.4GB with work_mem=800MB.

And the dense allocation patch pretty much makes this go away (as
demonstrated in the [1]). For wider tuples, the difference is smaller, but
that when the buckets are less important.

What I'm trying to say is that it's only a matter of work_mem values.
Currently, because of the weak guarantee, people tend to set the value
lower than necessary. The dense allocation makes this unnecessary,
allowing higher work_mem values, and thus eliminating the issue.

If this is not enough, we can stop counting the buckets towards work_mem.
It'll still be more memory efficient than the old approach (as a pointer
is smaller than palloc overhead), and it won't cause additional batches.
But I don't like this - I think we should make work_mem a stronger
guarantee.

[1] http://www.postgresql.org/message-id/53beea9e.2080...@fuzzy.cz

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

On Sat, Sep 6, 2014 at 2:54 AM, Brightwell, Adam
adam.brightw...@crunchydatasolutions.com wrote:
 * Add ALTER TABLE name { ENABLE | DISABLE } ROW LEVEL SECURITY - set flag
 on table to allow for a default-deny capability.  If RLS is enabled on a
 table and has no policies, then a default-deny policy is automatically
 applied.  If RLS is disabled on table and the table still has policies on it
 then then an error is raised.  Though if DISABLE is accompanied with
 CASCADE, then all policies will be removed and no error is raised.

This text doesn't make it clear that all of the cases have been
covered; in particular, you didn't specify whether an error is thrown
if you try to add a policy to a table with DISABLE ROW LEVEL SECURITY
in effect.  Backing up a bit, I think there are two sensible designs
here:

1. Row level security policies can't exist for a table with DISABLE
ROW LEVEL SECURITY in effect.  It sounds like this is what you have
implemented, modulo any hypothetical bugs.  You can't add policies
without enabling RLS, and you can't disable RLS without dropping them
all.

2. Row level security policies can exist for a table with DISABLE ROW
LEVEL SECURITY in effect, but they don't do anything until RLS is
enabled.  A possible advantage of this approach is that you could
*temporarily* shut off RLS for a table without having to drop all of
your policies and put them back.  I kind of like this approach; we
have something similar for triggers, and I think it could be useful to
people.

If you stick with approach #1, make sure pg_dump is guaranteed to
enable RLS before applying the policies.  And either way, you should
that pg_dump behaves sanely in the case where there are circular
dependencies, like you have two table A and B, and each has a RLS
policy that manages to use the other table's row-type.  (You probably
also want to check that DROP and DROP .. CASCADE on either policy or
either table does the right thing in that situation, but that's
probably easier to get right.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Sep 11, 2014 at 1:46 AM, Rahila Syed rahilasyed...@gmail.com wrote:
I will repeat the above tests with high load on CPU and using the benchmark
 given by Fujii-san and post the results.

 Average % of CPU usage at user level for each of the compression algorithm
 are as follows.

 CompressionMultipleSingle

 Off81.133881.1267
 LZ4  81.099881.1695
 Snappy:80.9741 80.9703
 Pglz :81.235381.2753

 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png
 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png

 The numbers show CPU utilization of Snappy is the least. The CPU utilization
 in increasing order is
 pglz  No compression  LZ4  Snappy

 The variance of average CPU utilization numbers is very low. However ,
 snappy seems to be best when it comes to lesser utilization of CPU.

 As per the measurement results posted till date

 LZ4 outperforms snappy and pglz in terms of compression ratio and
 performance. However , CPU utilization numbers show snappy utilizes least
 amount of CPU . Difference is not much though.

 As there has been no consensus yet about which compression algorithm to
 adopt, is it better to make this decision independent of the FPW compression
 patch as suggested earlier in this thread?. FPW compression can be done
 using built in compression pglz as it shows considerable performance over
 uncompressed WAL and good compression ratio
 Also, the patch to compress multiple blocks at once gives better compression
 as compared to single block. ISTM that performance overhead introduced by
 multiple blocks compression is slightly higher than single block compression
 which can be tested again after modifying the patch to use pglz .  Hence,
 this patch can be built using multiple blocks compression.

I advise supporting pglz only for the initial patch, and adding
support for the others later if it seems worthwhile.  The approach
seems to work well enough with pglz that it's worth doing even if we
never add the other algorithms.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
 I advise supporting pglz only for the initial patch, and adding
 support for the others later if it seems worthwhile.  The approach
 seems to work well enough with pglz that it's worth doing even if we
 never add the other algorithms.

That approach is fine with me. Note though that I am pretty strongly
against adding support for more than one algorithm at the same time. So,
if we gain lz4 support - which I think is definitely where we should go
- we should drop pglz support for the WAL.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-11 Thread Bruce Momjian

On Thu, Sep 11, 2014 at 12:55:21PM -0400, Robert Haas wrote:
 I advise supporting pglz only for the initial patch, and adding
 support for the others later if it seems worthwhile.  The approach
 seems to work well enough with pglz that it's worth doing even if we
 never add the other algorithms.

+1

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Commitfest status

On 10.9.2014 22:39, Heikki Linnakangas wrote:
 The bad news is that the rest don't seem to moving graduating from the
 Needs Review state.

ISTM that many patches

(a) in 'needs review' actually have a review, or are being thoroughly
discussed

(b) in 'waiting on author' are not really waiting, because the author
already responded / posted a new version of the patch

Except that the patch status was not updated, whis makes it really
difficult to spot patches that currently need a review :-(

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
 I advise supporting pglz only for the initial patch, and adding
 support for the others later if it seems worthwhile.  The approach
 seems to work well enough with pglz that it's worth doing even if we
 never add the other algorithms.

 That approach is fine with me. Note though that I am pretty strongly
 against adding support for more than one algorithm at the same time.

What if one algorithm compresses better and the other algorithm uses
less CPU time?

I don't see a compelling need for an option if we get a new algorithm
that strictly dominates what we've already got in all parameters, and
it may well be that, as respects pglz, that's achievable.  But ISTM
that it need not be true in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] add modulo (%) operator to pgbench

On Thu, Sep 11, 2014 at 2:47 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Ok. I do agree that an expression syntax would be great!

 However, that would not diminish nor change much the amount and kind of code
 necessary to add an operator or a function

That's not really true.  You can't really add abs(x) or hash(x) right
now because the current code only supports this syntax:

\set varname operand1 [ operator operand2 ]

There's no way to add support for a unary operator with that syntax.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

On Thu, Sep 11, 2014 at 11:27 AM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-09-11 10:32:24 -0300, Arthur Silva wrote:
  Unaligned memory access received a lot attention in Intel post-Nehalen
 era.
  So it may very well pay off on Intel servers. You might find this blog
 post
  and it's comments/external-links interesting
 
 http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

 FWIW, the reported results of imo pretty meaningless for postgres. It's
 sequential access over larger amount of memory. I.e. a perfectly
 prefetchable workload where it doesn't matter if superflous cachelines
 are fetched because they're going to be needed next round anyway.

 In many production workloads one of the most busy accesses to individual
 datums is the binary search on individual pages during index
 lookups. That's pretty much exactly the contrary to the above.

 Not saying that it's not going to be a benefit in many scenarios, but
 it's far from being as simple as saying that unaligned accesses on their
 own aren't penalized anymore.

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


I modified the test code to use a completely random scan pattern to test
something that completely trashes the cache. Not realistic but still
confirms the hypothesis that the overhead is minimal on modern Intel.

-- test results compiling for 32bit --
 processing word of size 2
offset = 0
 average time for offset 0 is 422.7
offset = 1
 average time for offset 1 is 422.85

 processing word of size 4
offset = 0
 average time for offset 0 is 436.6
offset = 1
 average time for offset 1 is 451
offset = 2
 average time for offset 2 is 444.3
offset = 3
 average time for offset 3 is 441.9

 processing word of size 8
offset = 0
 average time for offset 0 is 630.15
offset = 1
 average time for offset 1 is 653
offset = 2
 average time for offset 2 is 655.5
offset = 3
 average time for offset 3 is 660.85
offset = 4
 average time for offset 4 is 650.1
offset = 5
 average time for offset 5 is 656.9
offset = 6
 average time for offset 6 is 656.6
offset = 7
 average time for offset 7 is 656.9


-- test results compiling for 64bit --
 processing word of size 2
offset = 0
 average time for offset 0 is 402.55
offset = 1
 average time for offset 1 is 406.9

 processing word of size 4
offset = 0
 average time for offset 0 is 424.05
offset = 1
 average time for offset 1 is 436.55
offset = 2
 average time for offset 2 is 435.1
offset = 3
 average time for offset 3 is 435.3

 processing word of size 8
offset = 0
 average time for offset 0 is 444.9
offset = 1
 average time for offset 1 is 470.25
offset = 2
 average time for offset 2 is 468.95
offset = 3
 average time for offset 3 is 476.75
offset = 4
 average time for offset 4 is 474.9
offset = 5
 average time for offset 5 is 468.25
offset = 6
 average time for offset 6 is 469.8
offset = 7
 average time for offset 7 is 469.1
//  g++ -O2 -o test test.cpp  ./test


#include sys/stat.h
#include sys/time.h
#include sys/types.h
#include iostream
#include cassert
#include vector
#include inttypes.h

using namespace std;

class WallClockTimer
{
public:
struct timeval t1, t2;
WallClockTimer() :
t1(), t2()
{
gettimeofday(t1, 0);
t2 = t1;
}
void reset()
{
gettimeofday(t1, 0);
t2 = t1;
}
int elapsed()
{
return (t2.tv_sec * 1000 + t2.tv_usec / 1000) - (t1.tv_sec * 1000 + t1.tv_usec / 1000);
}
int split()
{
gettimeofday(t2, 0);
return elapsed();
}
};

// xor shift
uint32_t xor128(void)
{
static uint32_t x = 123456789;
static uint32_t y = 362436069;
static uint32_t z = 521288629;
static uint32_t w = 88675123;
uint32_t t;
t = x ^ (x  11);
x = y;
y = z;
z = w;
return w = w ^ (w  19) ^ (t ^ (t  8));
}

template class T
void runtest()
{
size_t N = 10 * 1000 * 1000 ;
int repeat = 20;
WallClockTimer timer;
const bool paranoid = false;
cout processing word of size sizeof(T)endl;
for(unsigned int offset = 0; offsetsizeof(T); ++offset)
{
vectorT bigarray(N+2);
coutoffset = offsetendl;
T * const begin =   reinterpret_castT * (reinterpret_castuintptr_t(bigarray[0]) + offset);
assert(offset + reinterpret_castuintptr_t(bigarray[0])  == reinterpret_castuintptr_t(begin)  );
T * const end = begin + N;
if(paranoid) assert(reinterpret_castuintptr_t(end)reinterpret_castuintptr_t(bigarray.back()));
int sumt = 0;
//cout ignore this: ;
for(int k = 0 ; k  repeat; ++k)
{
timer.reset();
for(size_t i = 0; i N; ++i)
{
int ri = xor128() % N;
begin[ri] = static_castT( i );
}
volatile T val = 1;

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
 On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
  I advise supporting pglz only for the initial patch, and adding
  support for the others later if it seems worthwhile.  The approach
  seems to work well enough with pglz that it's worth doing even if we
  never add the other algorithms.
 
  That approach is fine with me. Note though that I am pretty strongly
  against adding support for more than one algorithm at the same time.
 
 What if one algorithm compresses better and the other algorithm uses
 less CPU time?

Then we make a choice for our users. A configuration option about an
aspect of postgres that darned view people will understand with for the
marginal differences between snappy and lz4 doesn't make sense.

 I don't see a compelling need for an option if we get a new algorithm
 that strictly dominates what we've already got in all parameters, and
 it may well be that, as respects pglz, that's achievable.  But ISTM
 that it need not be true in general.

If you look at the results lz4 is pretty much there. Sure, there's
algorithms which have a much better compression - but the time overhead
is so large it just doesn't make sense for full page compression.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] implement subject alternative names support for SSL connections

2014-09-11 Thread Alexey Klyukin

On Mon, Sep 8, 2014 at 8:04 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 09/05/2014 07:30 PM, Alexey Klyukin wrote:

 The error does not state whether the names comes from the CN or from
 the SAN section.


 I'd reword that slightly, to:

 psql: server certificate for example.com (and 2 other names) does not
 match host name example-foo.com

 I never liked the current wording, server name foo very much. I think
 it's important to use the word server certificate in the error message, to
 make it clear where the problem is.

+1


 For translations, that message should be pluralized, but there is no
 libpq_ngettext macro equivalent to ngettext(), like libpq_gettext. I wonder
 if that was left out on purpose, or if we just haven't needed that in libpq
 before. Anyway, I think we need to add that for this.

Well, the translation guidelines in the documentation suggest avoiding
plurals if possible, but I don't like rephrasing the sentence with
'names total: %d', so attached is my attempt to add the function.
I have also moved the one-time library binding code to the function of its own.


 This version also checks for the availability of the subject name, it
 looks like RFC 5280 does not require it at all.

 4.1.2.6.  Subject

 The subject field identifies the entity associated with the public
 key stored in the subject public key field.  The subject name MAY be
 carried in the subject field and/or the subjectAltName extension.


 Ok.

 The pattern of allocating the name in the subroutine and then
 reporting it (and releasing when necessary) in the main function is a
 little bit ugly, but it looks to me the least ugly of anything else I
 could come up (i.e. extracting those names at the time the error
 message is shown).


 I reworked that a bit, see attached. What do you think?

Thank you, I like your approach of unconditionally allocating and
freeing memory, it makes the code easier to read.
2 minor changes I've made in v7 (in addition to adding the libpq_ngettext):

- the verify_peer_name_matches_certificate does not try to return -1
(since it returns bool, it would be misinterpreted as true)
- removed the !got_error  !found_match check from the for loop
header to the loop body per style comment from Tom in the beginning of
this thread.



 I think this is ready for commit, except for two things:

 1. The pluralization of the message for translation

 2. I still wonder if we should follow the RFC 6125 and not check the Common
 Name at all, if SANs are present. I don't really understand the point of
 that rule, and it seems unlikely to pose a security issue in practice,
 because surely a CA won't sign a certificate with a bogus/wrong CN, because
 an older client that doesn't look at the SANs at all would use the CN
 anyway. But still... what does a Typical Web Browser do?


At least Firefox and Safari seem to follow RFC strictly, according to
some anecdotical evidences (I haven't got enough time to dig into the
source code yet):

http://superuser.com/questions/230136/primary-common-name-in-subjectaltname
https://bugzilla.mozilla.org/show_bug.cgi?id=238142

Chrome seem to follow them, so the major open-source browsers are
ignoring the Common Name if the SubjectAltName is present.
Still, I see no win in ignoring CN (except for the shorter code), it
just annoys some users that forgot to put the CN name also in the SAN
section, so my 5 cents is that we should check both.

Regards,
-- 
Alexey Klyukin
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
new file mode 100644
index fc930bd..a082268
*** a/src/interfaces/libpq/fe-misc.c
--- b/src/interfaces/libpq/fe-misc.c
*** static int  pqSendSome(PGconn *conn, int 
*** 64,69 
--- 64,71 
  static int pqSocketCheck(PGconn *conn, int forRead, int forWrite,
  time_t end_time);
  static intpqSocketPoll(int sock, int forRead, int forWrite, time_t 
end_time);
+ static void libpq_bindomain();
+ 
  
  /*
   * PQlibVersion: return the libpq version number
*** PQenv2encoding(void)
*** 1210,1223 
  
  #ifdef ENABLE_NLS
  
! char *
! libpq_gettext(const char *msgid)
  {
static bool already_bound = false;
  
if (!already_bound)
{
!   /* dgettext() preserves errno, but bindtextdomain() doesn't */
  #ifdef WIN32
int save_errno = GetLastError();
  #else
--- 1212,1225 
  
  #ifdef ENABLE_NLS
  
! static void
! libpq_bindomain()
  {
static bool already_bound = false;
  
if (!already_bound)
{
!   /* bindtextdomain() does not preserve errno */
  #ifdef WIN32
int save_errno = GetLastError();
  #else
*** libpq_gettext(const char *msgid)
*** 1237,1244 
--- 1239,1258 
errno = save_errno;
  #endif
}
+ }
  
+ char *
+ libpq_gettext(const char *msgid)
+ {
+   libpq_bindomain();

Re: [HACKERS] Memory Alignment in Postgres

On Thu, Sep 11, 2014 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Merlin Moncure mmonc...@gmail.com writes:
  Be advised of the difficulties you are going to face here.  Assuming
  for a second there is no reason not to go unaligned on Intel and there
  are material benefits to justify the effort, that doesn't necessarily
  hold for other platforms like arm/power.

 Note that on many (most?) non-Intel architectures, unaligned access is
 simply not an option.  The chips themselves will throw SIGBUS or
 equivalent if you try it.  Some kernels provide signal handlers that
 emulate the unaligned access in software rather than killing the process;
 but the performance consequences of hitting such traps more than very
 occasionally would be catastrophic.


 Even on Intel, I'd wonder what unaligned accesses do to atomicity
 guarantees and suchlike.  This is not a big deal for row data storage,
 but we'd have to be careful about it if we were to back off alignment
 requirements for in-memory data structures such as latches and buffer
 headers.

 Another fun thing you'd need to deal with is ensuring that the C structs
 we overlay onto catalog data rows still match up with the data layout
 rules.

 On the whole, I'm pretty darn skeptical that such an effort would repay
 itself.  There are lots of more promising things to hack on.

 regards, tom lane


Indeed I don't know any other architectures that this would be at an
option. So if this ever moves forward it must be turned on at compile time
for x86-64 only. I wonder how the Mysql handle their rows even on those
architectures as their storage format is completely packed.

If we just reduced the alignment requirements when laying out columns in
the rows and indexes by reducing/removing padding -- typalign, it'd be
enough gain in my (humble) opinion.

If you think alignment is not an issue you can see saving everywhere, which
is kinda insane...

I'm unsure how this equates in patch complexity, but judging by the
reactions so far I'm assuming a lot.

Re: [HACKERS] proposal (9.5) : psql unicode border line styles

* Pavel Stehule (pavel.steh...@gmail.com) wrote:
 2014-09-11 16:42 GMT+02:00 Stephen Frost sfr...@snowman.net:
  I don't particularly like this (having these fields set in
  refresh_utf8format to hard-coded strings in the function), why not have
  those handled the same as the rest, where the strings themselves are in
  the unicode_style structure?
 
 
 I am not sure if I understand well.
 
 With refresh_utf8format I can do shortly 6 possible combinations - or more
 (when it will be requested)
 
 I have no idea how to write as rest without repeating all 6 combinations -
 what was one noticed issue of some older variant, where I designed
 unicode1, unicode2, ...
 
 Any idea, tip how to it?

All I was suggesting was pulling these strings out of the function:

+   /* âµ */
+   popt-header_nl_right = \342\206\265;
+
+   popt-nl_left =  ;
+
+   /* âµ */
+   popt-nl_right = \342\206\265;
+
+   /* â¦ */
+   popt-wrap_left = \342\200\246;
+   popt-wrap_right = \342\200\246;

and adding them to unicode_style and then referencing them there,
similar to how the rest of printTextFormat popt (by the way- don't
really like that variable name, particularly as it's used elsewhere with
a very different meaning..  why not 'format' or 'ptf'?) is
set up in refresh_utf8format, that's all.

Thanks,

Stephen


signature.asc
Description: Digital signature

[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors

Andrew,

* Andrew Dunstan (and...@dunslane.net) wrote:
 On 09/11/2014 08:29 AM, Stephen Frost wrote:
 * Pavel Stehule (pavel.steh...@gmail.com) wrote:
 Can I help with something, it is there some open question?
 I had been hoping for a more definitive answer regarding this option for
 array_to_json, or even a comment about the change to row_to_json.
 Andrew- any thoughts on this?  (that's what the ping on IRC is for :).
 
 The change in row_to_json looks OK, I think. we're replacing an
 overloading with use of default params, yes? That seems quite
 reasonable, and users shouldn't notice the difference.

Right.  Great, thanks.

 There might be a case for optionally suppressing nulls in
 array_to_json, and it might work reasonably since unlike SQL arrays
 JSON arrays don't have to be regular (if nested they are arrays of
 arrays, not multi-dimensional single arrays). OTOH I'm not sure if
 it's worth doing just for the sake of orthogonality. If someone
 wants it, then go for it.

Ok.  I'll handle updating both of these to remove the overloading
and use default params instead, but I'll only add the 'ignore_null'
option to row_to_json.

Thanks again!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Sep 11, 2014 at 1:17 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
 On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
  I advise supporting pglz only for the initial patch, and adding
  support for the others later if it seems worthwhile.  The approach
  seems to work well enough with pglz that it's worth doing even if we
  never add the other algorithms.
 
  That approach is fine with me. Note though that I am pretty strongly
  against adding support for more than one algorithm at the same time.

 What if one algorithm compresses better and the other algorithm uses
 less CPU time?

 Then we make a choice for our users. A configuration option about an
 aspect of postgres that darned view people will understand with for the
 marginal differences between snappy and lz4 doesn't make sense.

Maybe.  Let's get the basic patch done first; then we can argue about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Sep 11, 2014 at 06:58:06PM +0200, Andres Freund wrote:
 On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
  I advise supporting pglz only for the initial patch, and adding
  support for the others later if it seems worthwhile.  The approach
  seems to work well enough with pglz that it's worth doing even if we
  never add the other algorithms.
 
 That approach is fine with me. Note though that I am pretty strongly
 against adding support for more than one algorithm at the same time. So,
 if we gain lz4 support - which I think is definitely where we should go
 - we should drop pglz support for the WAL.
 
 Greetings,
 
 Andres Freund
 

+1

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Sep 11, 2014 at 07:17:42PM +0200, Andres Freund wrote:
 On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
  On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
   I advise supporting pglz only for the initial patch, and adding
   support for the others later if it seems worthwhile.  The approach
   seems to work well enough with pglz that it's worth doing even if we
   never add the other algorithms.
  
   That approach is fine with me. Note though that I am pretty strongly
   against adding support for more than one algorithm at the same time.
  
  What if one algorithm compresses better and the other algorithm uses
  less CPU time?
 
 Then we make a choice for our users. A configuration option about an
 aspect of postgres that darned view people will understand with for the
 marginal differences between snappy and lz4 doesn't make sense.
 
  I don't see a compelling need for an option if we get a new algorithm
  that strictly dominates what we've already got in all parameters, and
  it may well be that, as respects pglz, that's achievable.  But ISTM
  that it need not be true in general.
 
 If you look at the results lz4 is pretty much there. Sure, there's
 algorithms which have a much better compression - but the time overhead
 is so large it just doesn't make sense for full page compression.
 
 Greetings,
 
 Andres Freund
 

In addition, you can leverage the the presence of a higher-compression
version of lz4 (lz4hc) that can utilize the same decompression engine
that could possibly be applied to static tables as a REINDEX option
or even slowly growing tables that would benefit from the better
compression as well as the increased decompression speed available.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On Thu, Sep 11, 2014 at 7:34 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Don't we need some way to prohibit changing GUC by launching process,
 once it has shared the existing GUC?

Nope.  I mean, eventually, for true parallelism ... we absolutely will
need that.  But pg_background itself doesn't need that; it's perfectly
fine for configuration settings to get changed in the background
worker.  So it's a different piece of infrastructure from this patch
set.

 1. This patch generates warning on windows
 1pg_background.obj : error LNK2001: unresolved external symbol
 StatementTimeout

 You need to add PGDLLIMPORT for StatementTimeout

OK.  I still think we should go back and PGDLLIMPORT-ize all the GUC variables.

 2.
 CREATE FUNCTION pg_background_launch(sql pg_catalog.text,
   queue_size pg_catalog.int4 DEFAULT 65536)

 Here shouldn't queue_size be pg_catalog.int8 as I could see
 some related functions in test_shm_mq uses int8?

I think if you think you want a queue that's more than 2GB in size,
you should should re-think.

 pg_background_launch(PG_FUNCTION_ARGS)
 {
 text   *sql = PG_GETARG_TEXT_PP(0);
 int32 queue_size = PG_GETARG_INT64(1);

 Here it should _INT32 variant to match with current sql definition,
 otherwise it leads to below error.

 postgres=# select pg_background_launch('vacuum verbose foo');
 ERROR:  queue size must be at least 64 bytes

Oops.

 3.
 Comparing execute_sql_string() and exec_simple_query(), I could see below
 main differences:
 a. Allocate a new memory context different from message context
 b. Start/commit control of transaction is outside execute_sql_string
 c. enable/disable statement timeout is done from outside incase of
 execute_sql_string()
 d. execute_sql_string() prohibits Start/Commit/Abort transaction statements.
 e. execute_sql_string() doesn't log statements
 f. execute_sql_string() uses binary format for result whereas
 exec_simple_query()
uses TEXT as defult format
 g. processed stat related info from caller incase of execute_sql_string().

 Can't we handle all these or other changes inside exec_simple_query()
 based on some parameter or may be a use a hook similar to what we
 do in ProcessUtility?

 Basically it looks bit odd to have a duplicate (mostly) copy of
 exec_simple_query().

It is.  But I didn't think hacking up exec_simple_query() was a better
option.  We could do that if most people like that approach, but to me
it seemed there were enough differences to make it unappealing.

 4.
 Won't it be better if pg_background_worker_main() can look more
 like PostgresMain() (in terms of handling different kind of messages),
 so that it can be extended in future to handle parallel worker.

I don't think that a parallel worker will look like pg_background in
much more than broad outline.  Some of the same constructs will get
reused, but overall I think it's a different problem that I'd rather
not conflate with this patch.

 5.
 pg_background_result()
 {
 ..
 /* Read and processes messages from the shared memory queue. */
 }

 Shouldn't the processing of messages be a separate function as
 we do for pqParseInput3().

I guess we could.  It's not all that much code, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

On Thu, Sep 11, 2014 at 02:54:36PM -0300, Arthur Silva wrote:
 Indeed I don't know any other architectures that this would be at an
 option. So if this ever moves forward it must be turned on at compile time
 for x86-64 only. I wonder how the Mysql handle their rows even on those
 architectures as their storage format is completely packed.
 
 If we just reduced the alignment requirements when laying out columns in
 the rows and indexes by reducing/removing padding -- typalign, it'd be
 enough gain in my (humble) opinion.
 
 If you think alignment is not an issue you can see saving everywhere, which
 is kinda insane...
 
 I'm unsure how this equates in patch complexity, but judging by the
 reactions so far I'm assuming a lot.

If the column order in the table was independent of the physical layout,
it would be possible to order columns to reduce the padding needed. Not
my suggestion, just repeating a valid comment from earlier in the thread.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about half processes are blocked by btree, btree is bottleneck?

2014-09-11 Thread Peter Geoghegan

On Wed, Sep 10, 2014 at 8:08 PM, Xiaoyulei xiaoyu...@huawei.com wrote:
 Sum:66
 #0  0x7f8273a77627 in semop () from /lib64/libc.so.6
 #1  0x0060cda7 in PGSemaphoreLock ()
 #2  0x006511a9 in LWLockAcquire ()
 #3  0x004987f7 in _bt_relandgetbuf ()
 #4  0x0049c116 in _bt_search ()
 #5  0x00497e13 in _bt_doinsert ()
 #6  0x0049af52 in btinsert ()
 #7  0x0072dce4 in FunctionCall6Coll ()
 #8  0x0049592e in index_insert ()
 #9  0x00590ac5 in ExecInsertIndexTuples ()


 Sum:36
 #0  0x7f8273a77627 in semop () from /lib64/libc.so.6
 #1  0x0060cda7 in PGSemaphoreLock ()
 #2  0x006511a9 in LWLockAcquire ()
 #3  0x00497e31 in _bt_doinsert ()
 #4  0x0049af52 in btinsert ()
 #5  0x0072dce4 in FunctionCall6Coll ()
 #6  0x0049592e in index_insert ()
 #7  0x00590ac5 in ExecInsertIndexTuples ()


I don't know what Sum indicates here, but if the ratio of total
calls to LWLockAcquire() between each LWLockAcquire() caller listed
here is roughly in line with the Sum ratio, this must be a pretty
small B-Tree (or the average size of each B-Tree must be very small).
I suppose you could still see a lot of contention without the B-Tree
ever getting much bigger if there is a lot of non-HOT updates.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] add modulo (%) operator to pgbench



However, that would not diminish nor change much the amount and kind of 
code necessary to add an operator or a function


That's not really true.  You can't really add abs(x) or hash(x) right
now because the current code only supports this syntax:

\set varname operand1 [ operator operand2 ]

There's no way to add support for a unary operator with that syntax.


Hmmm. If you accept a postfix syntax, there is:-)

But this is not convincing. Adding a unary function with a clean syntax 
indeed requires doing something with the parser.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
 On Sat, Sep 6, 2014 at 2:54 AM, Brightwell, Adam
 adam.brightw...@crunchydatasolutions.com wrote:
  * Add ALTER TABLE name { ENABLE | DISABLE } ROW LEVEL SECURITY - set flag
  on table to allow for a default-deny capability.  If RLS is enabled on a
  table and has no policies, then a default-deny policy is automatically
  applied.  If RLS is disabled on table and the table still has policies on it
  then then an error is raised.  Though if DISABLE is accompanied with
  CASCADE, then all policies will be removed and no error is raised.
 
 This text doesn't make it clear that all of the cases have been
 covered; in particular, you didn't specify whether an error is thrown
 if you try to add a policy to a table with DISABLE ROW LEVEL SECURITY
 in effect.  Backing up a bit, I think there are two sensible designs
 here:

Ah, yeah, the text could certainly be clearer.

 1. Row level security policies can't exist for a table with DISABLE
 ROW LEVEL SECURITY in effect.  It sounds like this is what you have
 implemented, modulo any hypothetical bugs.  You can't add policies
 without enabling RLS, and you can't disable RLS without dropping them
 all.

Right, this was the approach we were taking.  Specifically, adding
policies would implicitly enable RLS for the relation.

 2. Row level security policies can exist for a table with DISABLE ROW
 LEVEL SECURITY in effect, but they don't do anything until RLS is
 enabled.  A possible advantage of this approach is that you could
 *temporarily* shut off RLS for a table without having to drop all of
 your policies and put them back.  I kind of like this approach; we
 have something similar for triggers, and I think it could be useful to
 people.

I like the idea of being able to turn them off without dropping them.
We have that with row_security = off, but that would only work for the
owner or a superuser (or a user with bypassrls).  This would allow
disabling RLS temporairly for everything accessing the table.

The one thing I'm wondering about with this design is- what happens when
a policy is initially added?  Currently, we automatically turn on RLS
for the table when that happens.  I'm not thrilled with the idea that
you have to add policies AND turn on RLS explicitly- someone might add
policies but then forget to turn RLS on..

 If you stick with approach #1, make sure pg_dump is guaranteed to
 enable RLS before applying the policies.

Currently, adding a policy automatically turns on RLS, so we don't have
any issue with pg_dump from that perspective.  Handling cases where RLS
is disabled but policies exist would get more complicated for pg_dump if
we keep the current idea that adding policies implicitly turns on RLS-
it'd essentially have to go back and turn it off after the policies are
added.  Not a big fan of that either.

 And either way, you should
 that pg_dump behaves sanely in the case where there are circular
 dependencies, like you have two table A and B, and each has a RLS
 policy that manages to use the other table's row-type.  (You probably
 also want to check that DROP and DROP .. CASCADE on either policy or
 either table does the right thing in that situation, but that's
 probably easier to get right.)

Agreed, we'll double-check that this is working.  As these are
attributes of the table which get added later on by pg_dump, similar to
permissions, I'd think it'd all work fine, but good to make sure (and
ditto with DROP/DROP CASCADE..  We have some checks for that, but good
to make sure it works in a circular-dependency case too).

If we want to be able to disable RLS w/o dropping the policies, then I
think we have to completely de-couple the two and users would then have
both add policies AND turn on RLS to have RLS actually be enabled for a
given table.  I'm on the fence about that.

Thoughts?

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Commitfest status


On 11/09/14 18:59, Tomas Vondra wrote:

On 10.9.2014 22:39, Heikki Linnakangas wrote:

The bad news is that the rest don't seem to moving graduating from the
Needs Review state.


ISTM that many patches

(b) in 'waiting on author' are not really waiting, because the author
 already responded / posted a new version of the patch

Except that the patch status was not updated, whis makes it really
difficult to spot patches that currently need a review :-(



I think that still means patch is 'waiting for author' as author is 
responsible for changing this.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)


On 11/09/14 20:37, Robert Haas wrote:

1. This patch generates warning on windows
1pg_background.obj : error LNK2001: unresolved external symbol
StatementTimeout

You need to add PGDLLIMPORT for StatementTimeout


OK.  I still think we should go back and PGDLLIMPORT-ize all the GUC variables.


+1


4.
Won't it be better if pg_background_worker_main() can look more
like PostgresMain() (in terms of handling different kind of messages),
so that it can be extended in future to handle parallel worker.


I don't think that a parallel worker will look like pg_background in
much more than broad outline.  Some of the same constructs will get
reused, but overall I think it's a different problem that I'd rather
not conflate with this patch.


Yeah I agree here. While the patch provides a lot of necessary plumbing 
that any kind of parallel processing needs, the pg_background itself 
does not seem to be base for that.
Actually, when I first seen the pg_background, I was thinking that it 
looks like a good base for some kind of background job scheduling 
mechanism that was requested few times over the years (the actual 
scheduling is the only part missing now IMHO but that's separate 
discussion).


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On 2014-09-11 21:27:15 +0200, Petr Jelinek wrote:
 On 11/09/14 20:37, Robert Haas wrote:
 OK.  I still think we should go back and PGDLLIMPORT-ize all the GUC 
 variables.
 
 +1

Same here. I think Tom was the only one against it, but pretty much
everyone else was for it. We should fix all the GUCs declared as externs
in multiple .c files at the same time imo.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aussie timezone database changes incoming

2014-09-11 Thread Gavin Flower


On 12/09/14 01:57, Robert Haas wrote:

On Thu, Sep 11, 2014 at 12:20 AM, Craig Ringer cr...@2ndquadrant.com wrote:

I shouldn't be surprised that Australia gets to change. While the cynic
in me thinks this is the usual USA-is-the-center-of-the-universe-ism, in
reality it makes sense given relative population and likely impact.

Just because it makes sense doesn't mean it isn't
USA-is-the-center-of-the-universe-ism.

...Robert


In the same way the Americans tend to act like they are not on a planet, 
by saying something will happen in Summer - completely ignoring that for 
people who live in the Southern hemisphere, for whom that will be Winter!



Cheers,
Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

On Thu, Sep 11, 2014 at 11:24 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 Don't the changes to src/backend/optimizer/plan/createplan.c belong in
 patch #2?

 The borderline between #1 and #2 is little bit bogus. So, I moved most of
 portion into #1, however, invocation of InitCustomScan (that is a callback
 in CustomPlanMethod) in create_custom_plan() is still in #2.

Eh, create_custom_scan() certainly looks like it is in #1 from here,
or at least part of it is.  It calculates tlist and clauses and then
does nothing with them.  That clearly can't be the right division.

I think it would make sense to have create_custom_scan() compute tlist
and clauses first, and then pass those to CreateCustomPlan().  Then
you don't need a separate InitCustomScan() - which is misnamed anyway,
since it has nothing to do with ExecInitCustomScan().

 OK, I revised. Now custom-scan assumes it has a particular valid relation
 to be scanned, so no code path with scanrelid == 0 at this moment.

 Let us revisit this scenario when custom-scan replaces relation-joins.
 In this case, custom-scan will not be associated with a particular base-
 relation, thus it needs to admit a custom-scan node with scanrelid == 0.

Yeah, I guess the question there is whether we'll want let CustomScan
have scanrelid == 0 or require that CustomJoin be used there instead.

 Why can't the Custom(GpuHashJoin) node build the hash table internally
 instead of using a separate node?

 It's possible, however, it prevents to check sub-plans using EXPLAIN if we
 manage inner-plans internally. So, I'd like to have a separate node being
 connected to the inner-plan.

Isn't that just a matter of letting the EXPLAIN code print more stuff?
 Why can't it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-11 Thread Peter Geoghegan

On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas robertmh...@gmail.com wrote:
 No, not really.  All you have to do is right a little test program to
 gather the information.

I don't think a little test program is useful - IMV it's too much of a
simplification to suppose that a strcoll() has a fixed cost, and a
memcmp() has a fixed cost, and that we can determine algebraically
that we should (say) proceed or not proceed with the additional
opportunistic memcmp() == 0 optimization based solely on that. I'm
not sure if that's what you meant, but it might have been. Temporal
locality is surely a huge factor here, for example. Are we talking
about a memcmp() that always immediately precedes a similar strcoll()
call on the same memory? Are we factoring in the cost of
NUL-termination in order to make each strcoll() possible? And that's
just for starters.

However, I think it's perfectly fair to consider a case where the
opportunistic memcmp() == 0 thing never works out (as opposed to
mostly not helping, which is what I considered earlier [1]), as long
as we're sorting real tuples. You mentioned Heikki's test case; it
seems fair to consider that, but for the non-abbreviated case where
the additional, *totally* opportunistic memcmp == 0 optimization
only applies (so no abbreviated keys), while still having the
additional optimization be 100% useless. Clearly that test case is
also just about perfectly pessimal for this case too. (Recall that
Heikki's test case shows performance on per-sorted input, so there are
far fewer comparisons than would typically be required anyway - n
comparisons, or a bubble sort best case. If I wanted to cheat, I
could reduce work_mem so that an external tape sort is used, since as
it happens tapesort doesn't opportunistically check for pre-sorted
input, but I won't do that. Heikki's case both emphasizes the
amortized cost of a strxfrm() where we abbreviate, and in this
instance de-emphasizes the importance of memory latency by having
access be sequential/predictable.)

The only variation I'm adding here to Heikki's original test case is
to have a leading int4 attribute that always has a value of 1 -- that
conveniently removes abbreviation (including strxfrm() overhead) as a
factor that can influence the outcome, since right now that isn't
under consideration. So:

create table sorttest (dummy int4, t text);
insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75)
from generate_series(1, 3) g;

Benchmark:

pg@hamster:~/tests$ cat heikki-sort.sql
select * from (select * from sorttest order by dummy, t offset 100) f;

pgbench -f heikki-sort.sql -T 100 -n

With optimization enabled

tps = 77.861353 (including connections establishing)
tps = 77.862260 (excluding connections establishing)

tps = 78.211053 (including connections establishing)
tps = 78.212016 (excluding connections establishing)

tps = 77.996117 (including connections establishing)
tps = 77.997069 (excluding connections establishing)

With optimization disabled (len1 == len2 thing is commented out)
=

tps = 78.719389 (including connections establishing)
tps = 78.720366 (excluding connections establishing)

tps = 78.764819 (including connections establishing)
tps = 78.765712 (excluding connections establishing)

tps = 78.472902 (including connections establishing)
tps = 78.473844 (excluding connections establishing)

So, yes, it looks like I might have just about regressed this case -
it's hard to be completely sure. However, this is still a very
unrealistic case, since invariably len1 == len2 without the
optimization ever working out, whereas the case that benefits [2] is
quite representative. As I'm sure you were expecting, I still favor
pursuing this additional optimization.

If you think I've been unfair or not thorough, I'm happy to look at
other cases. Also, I'm not sure that you accept that I'm justified in
considering this a separate question to the more important question of
what to do in the tie-breaker abbreviation case (where we may be
almost certain that equality will be indicated by a memcmp()). If you
don't accept that I'm right about that more important case, I guess
that means that you don't have confidence in my ad-hoc cost model (the
HyperLogLog/cardinality stuff).

[1] 
http://www.postgresql.org/message-id/CAM3SWZR9dtGO+zX4VEn7GTW2=+umSNq=c57sjgxg8oqhjar...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/cam3swzqtyv3kp+cakzjzv3rwb1ojjahwpcz9coyjxpkhbtc...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

On Thu, Sep 11, 2014 at 3:08 PM, Stephen Frost sfr...@snowman.net wrote:
 2. Row level security policies can exist for a table with DISABLE ROW
 LEVEL SECURITY in effect, but they don't do anything until RLS is
 enabled.  A possible advantage of this approach is that you could
 *temporarily* shut off RLS for a table without having to drop all of
 your policies and put them back.  I kind of like this approach; we
 have something similar for triggers, and I think it could be useful to
 people.

 I like the idea of being able to turn them off without dropping them.
 We have that with row_security = off, but that would only work for the
 owner or a superuser (or a user with bypassrls).  This would allow
 disabling RLS temporairly for everything accessing the table.

 The one thing I'm wondering about with this design is- what happens when
 a policy is initially added?  Currently, we automatically turn on RLS
 for the table when that happens.  I'm not thrilled with the idea that
 you have to add policies AND turn on RLS explicitly- someone might add
 policies but then forget to turn RLS on..

Whoa.  I think that's a bad idea.  I think the default value for RLS
should be disabled, and users should have to turn it on explicitly if
they want to get it.  It's arguable whether the behavior if you try to
create a policy beforehand should be (1) outright failure or (2)
command accepted but no effect, but I think (3) automagically enable
the feature is a POLA violation.  When somebody adds a policy and then
drops it again, they will expect to be back in the same state they
started out in, and for good reason.

 If we want to be able to disable RLS w/o dropping the policies, then I
 think we have to completely de-couple the two and users would then have
 both add policies AND turn on RLS to have RLS actually be enabled for a
 given table.  I'm on the fence about that.

 Thoughts?

A strong +1 for doing just that.  Look, anybody who is going to use
row-level security but isn't careful enough to verify that it's
actually working as desired after configuring it is a lost cause
anyway.  That is the moral equivalent of a locksmith who comes out and
replaces a lock for you and at no point while he's there does he ever
close the door and verify that it latches and won't reopen.  I'm sure
somebody has done that, but if a security breach results, surely
everybody would agree that the locksmith is at fault, not the lock
manufacturer.  Personally, I have to test every GRANT and REVOKE I
issue, because there's no error for granting a privilege that the
target already has or revoking one they don't, and with group
membership and PUBLIC it's quite easy to have not done what you
thought you did.  Fixing that might be worthwhile but it doesn't take
away from the fact that, like any other configuration change you make,
security-relevant changes need to be tested.

There is another possible advantage of the explicit-enable approach as
well, which is that you might want to create several policies and then
turn them all on at once.  With what you have now, creating the first
policy will enable RLS on the table and then everyone who wasn't the
beneficiary of that initial policy is locked out.  Now, granted, you
can probably get around that by doing all of the operations in one
transaction, so it's a minor point.  But it's still nice to think
about being able to add several policies and then flip them on.  If it
doesn't work out, flip them off, adjust, and flip them back on again.
Now, again, the core design issue, IMHO, is that the switch from
default-allow to default-deny should be explicit and unmistakable, so
the rest of this is just tinkering around the edges.  But we might as
well make those edges as nice as possible, and the usability of this
approach feels good to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

* Robert Haas (robertmh...@gmail.com) wrote:
 On Thu, Sep 11, 2014 at 3:08 PM, Stephen Frost sfr...@snowman.net wrote:
  The one thing I'm wondering about with this design is- what happens when
  a policy is initially added?  Currently, we automatically turn on RLS
  for the table when that happens.  I'm not thrilled with the idea that
  you have to add policies AND turn on RLS explicitly- someone might add
  policies but then forget to turn RLS on..
 
 Whoa.  I think that's a bad idea.  I think the default value for RLS
 should be disabled, and users should have to turn it on explicitly if
 they want to get it.  It's arguable whether the behavior if you try to
 create a policy beforehand should be (1) outright failure or (2)
 command accepted but no effect, but I think (3) automagically enable
 the feature is a POLA violation.  When somebody adds a policy and then
 drops it again, they will expect to be back in the same state they
 started out in, and for good reason.

Yeah, that I can agree with.  Prior to adding the ability to explicitly
enable RLS, that's what they got, but that's changed now that we've made
the ability to turn on/off RLS half-way independent of policies.  Also..

  If we want to be able to disable RLS w/o dropping the policies, then I
  think we have to completely de-couple the two and users would then have
  both add policies AND turn on RLS to have RLS actually be enabled for a
  given table.  I'm on the fence about that.
 
 A strong +1 for doing just that.  Look, anybody who is going to use
 row-level security but isn't careful enough to verify that it's
 actually working as desired after configuring it is a lost cause
 anyway. 

I had been thinking the same, which is why I was on the fence about if
it was really an issue or not.  This all amounts to actually making the
patch smaller also, which isn't a bad thing.

 Personally, I have to test every GRANT and REVOKE I
 issue, because there's no error for granting a privilege that the
 target already has or revoking one they don't, and with group
 membership and PUBLIC it's quite easy to have not done what you
 thought you did.  Fixing that might be worthwhile but it doesn't take
 away from the fact that, like any other configuration change you make,
 security-relevant changes need to be tested.

Hmm, pretty sure that'd end up going against the spec too, but that's
a whole different discussion anyway.

 There is another possible advantage of the explicit-enable approach as
 well, which is that you might want to create several policies and then
 turn them all on at once.  With what you have now, creating the first
 policy will enable RLS on the table and then everyone who wasn't the
 beneficiary of that initial policy is locked out.  Now, granted, you
 can probably get around that by doing all of the operations in one
 transaction, so it's a minor point.  But it's still nice to think
 about being able to add several policies and then flip them on.  If it
 doesn't work out, flip them off, adjust, and flip them back on again.
 Now, again, the core design issue, IMHO, is that the switch from
 default-allow to default-deny should be explicit and unmistakable, so
 the rest of this is just tinkering around the edges.  But we might as
 well make those edges as nice as possible, and the usability of this
 approach feels good to me.

Fair enough.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-11 Thread Peter Geoghegan

On Tue, Sep 9, 2014 at 12:01 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 + Lehman and Yao don't require read locks, but assume that in-memory
 + copies of tree pages are unshared.

 This is the existing paragraph, just moved to different place in the README.

That's right - it seemed to make just as much sense to talk about that
here, while doing so establishes the context of talking about how what
we do differs from the canonical algorithm (which I think is true of
real-world LY implementations generally). Which makes the next
paragraph easier to understand:

 + Although it could be argued that Lehman and Yao isn't followed to the
 + letter because single pages are read locked as the tree is descended,
 + this is at least necessary to support deletion, a common requirement
 + which LY hardly acknowledge.  Read locks also ensure that B-tree
 + pages are self-consistent (LY appear to assume atomic page reads and
 + writes).

 This is just duplicating the existing paragraph. I don't see the point of
 this.

The point is to make clear the reason for the differences - evidently,
Amit felt it was unclear why we don't follow LY. I am suggesting that
LY's talk of having no read locks is unrealistic (it might be
realistic in the 21st century, but that's a whole other story).

  Even with these read locks, following LY obviates the need
 + of earlier schemes to hold multiple locks concurrently when descending
 + the tree as part of servicing index scans (pessimistic lock coupling).
 + The addition of right-links at all levels, as well as the addition of
 + a page high key allows detection and dynamic recovery from
 + concurrent page splits (that is, splits between unlocking an internal
 + page, and subsequently locking its child page during a descent).

 This explains in a few sentences what a LY B-tree looks like. The current
 README assumes that you're already familiar with what a LY tree looks like,
 or that you go read the paper mentioned at the top of the README. It might
 be a good idea to expand on that, and add an introduction like this so that
 an unfamiliar reader doesn't need to read the LY paper first. Is that the
 purpose of this patch? Please make it more explicit.

Yes, although even LY don't get around to telling the reader why they
should care, the mistake that many good papers make. We now know its
significance, both in general and to Postgres. Framing the discussion
like this aids understanding more than you'd think.

 And please make the
 sentences simpler - the above reads like a Shakespeare play.

Out, damned lock!

 Something like:
 The basic Lehman  Yao Algorithm
 

 Compared to a classic B-tree, LY adds a right-link pointer to each page, to
 the page's right sibling. It also adds a high key to each page, which is
 an upper bound on the keys that are allowed on that page. These two
 additions make it possible detect a concurrent page split, which allows the
 tree to be searched without holding any read locks (except to keep a single
 page from being modified while reading it).

 When a search follows a downlink to a child page, it compares the page's
 high key with the search key. If the search key is greater than the high
 key, the page must've been split concurrently, and you must follow the
 right-link to find the new page containing the key range you're looking for.
 This might need to be repeated, if the page has been split more than once.

 Differences to the Lehman  Yao algorithm
 =

 (current Lehamn and Yao Algorithm and Insertions section)



 I think that's pretty much the same information you added, but it's in a
 separate section, with the clear purpose that it explains what a LY tree
 looks like. You can skip over it, if you have read the paper or are
 otherwise already familiar with it. It still assumes that you're familiar
 with B-trees in general.

That seems fair enough - I'd just expand on why we don't completely
avoid read locks, which LY suppose we can get away with. That is
clearly a point of confusion.

 Anyway, I see that you had resurrected this in the commitfest app after
 three weeks of inactivity. I'm going to mark this back to Returned with
 Feedback. Please don't resurrect it again, this patch has received more
 than its fair share of attention.

I didn't mean to suggest that it deserves much attention. I didn't
know how to interpret the fact that you changed the status, since in
fact not much additional work was required. I was busy throughout, for
reasons that are perhaps obvious. I am not fussed about when this
happens, but I really think we should get around to it.

 Instead, please help by signing up to
 review a patch. The commitfest progress is glacial at the moment, and we
 really need experienced reviewers like you to get closure to people's
 patches.

I'm back from holidays now. I plan to look at the Tomas Vondra's
patch. If you think I should look at some particular

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

On Thu, Sep 11, 2014 at 4:13 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas robertmh...@gmail.com wrote:
 No, not really.  All you have to do is right a little test program to
 gather the information.

 I don't think a little test program is useful - IMV it's too much of a
 simplification to suppose that a strcoll() has a fixed cost, and a
 memcmp() has a fixed cost, and that we can determine algebraically
 that we should (say) proceed or not proceed with the additional
 opportunistic memcmp() == 0 optimization based solely on that. I'm
 not sure if that's what you meant, but it might have been.

I think I said pretty clearly that it was.

 However, I think it's perfectly fair to consider a case where the
 opportunistic memcmp() == 0 thing never works out (as opposed to
 mostly not helping, which is what I considered earlier [1]), as long
 as we're sorting real tuples. You mentioned Heikki's test case; it
 seems fair to consider that, but for the non-abbreviated case where
 the additional, *totally* opportunistic memcmp == 0 optimization
 only applies (so no abbreviated keys), while still having the
 additional optimization be 100% useless. Clearly that test case is
 also just about perfectly pessimal for this case too. (Recall that
 Heikki's test case shows performance on per-sorted input, so there are
 far fewer comparisons than would typically be required anyway - n
 comparisons, or a bubble sort best case. If I wanted to cheat, I
 could reduce work_mem so that an external tape sort is used, since as
 it happens tapesort doesn't opportunistically check for pre-sorted
 input, but I won't do that. Heikki's case both emphasizes the
 amortized cost of a strxfrm() where we abbreviate, and in this
 instance de-emphasizes the importance of memory latency by having
 access be sequential/predictable.)

 The only variation I'm adding here to Heikki's original test case is
 to have a leading int4 attribute that always has a value of 1 -- that
 conveniently removes abbreviation (including strxfrm() overhead) as a
 factor that can influence the outcome, since right now that isn't
 under consideration. So:

 create table sorttest (dummy int4, t text);
 insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75)
 from generate_series(1, 3) g;

 Benchmark:

 pg@hamster:~/tests$ cat heikki-sort.sql
 select * from (select * from sorttest order by dummy, t offset 100) f;

 pgbench -f heikki-sort.sql -T 100 -n

 With optimization enabled
 
 tps = 77.861353 (including connections establishing)
 tps = 77.862260 (excluding connections establishing)

 tps = 78.211053 (including connections establishing)
 tps = 78.212016 (excluding connections establishing)

 tps = 77.996117 (including connections establishing)
 tps = 77.997069 (excluding connections establishing)

 With optimization disabled (len1 == len2 thing is commented out)
 =

 tps = 78.719389 (including connections establishing)
 tps = 78.720366 (excluding connections establishing)

 tps = 78.764819 (including connections establishing)
 tps = 78.765712 (excluding connections establishing)

 tps = 78.472902 (including connections establishing)
 tps = 78.473844 (excluding connections establishing)

 So, yes, it looks like I might have just about regressed this case -
 it's hard to be completely sure. However, this is still a very
 unrealistic case, since invariably len1 == len2 without the
 optimization ever working out, whereas the case that benefits [2] is
 quite representative. As I'm sure you were expecting, I still favor
 pursuing this additional optimization.

Well, I have to agree that doesn't look too bad, but your reluctance
to actually do the microbenchmark worries me.  Granted,
macrobenchmarks are what actually matters, but they can be noisy and
there can be other confounding factors.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade and epoch

2014-09-11 Thread Bruce Momjian

On Wed, Sep 10, 2014 at 02:24:17AM +0100, Greg Stark wrote:
 On Tue, Sep 9, 2014 at 4:05 PM, Bruce Momjian br...@momjian.us wrote:
   Yes, I did think about that, but it seems like a behavior change.
   However, it is tempting to avoid future bug reports about this.
 
  When this came up in March, Tom and I agreed that this wasn't something
  we wanted to slip into 9.4.  Given that, it is hard to argue we should
  now slip this into 9.5, 9.4, and 9.3, so unless someone else votes for
  inclusion, I think I will leave this as 9.5-only.
 
  With no one replying, I will consider this issue closed and not
  backpatch this.
 
 I think the reason nobody's responding is because nobody has anything
 significant to add. It's a behavior change from not-working to
 working. Why wouldn't it be backpatched?

OK, Greg seems to be passionate about this.  Does anyone _object_ to my
back-patching the epoch preservation fix through 9.3.  Tom?

The patch is commit a74a4aa23bb95b590ff01ee564219d2eacea3706.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade and epoch