date:20141207

Re: [HACKERS] Removing INNER JOINs

2014-12-07 Thread Simon Riggs

On 4 December 2014 at 12:24, Simon Riggs si...@2ndquadrant.com wrote:
 On 3 December 2014 at 12:18, Atri Sharma atri.j...@gmail.com wrote:

 So the planner keeps all possibility satisfying plans, or it looks at the
 possible conditions (like presence of foreign key for this case, for eg) and
 then lets executor choose between them?

 I'm suggesting the planner keeps 2 plans: One with removable joins,
 one without the removable joins.

I only just noticed the thread moved on while I was flying.

So it looks Tom and I said the same thing, or close enough for me to +1 Tom.


Another idea would be to only skip Hash and Merge Joins, since the
tests for those are fairly easy to put into the Init call. That sounds
slightly easier than the proposal with the Option/Choice/Switch node.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Misunderstanding on the FSM README file

2014-12-07 Thread Guillaume Lelarge

Hi,

I've been reading the FSM README file lately
(src/backend/storage/freespace/README), and I'm puzzled by one of the graph
(the binary tree structure of an FSM file). Here it is:

4
 4 2
3 4   0 2- This level represents heap pages

Shouldn't the last line be:
4 3   2 0

(ie, highest number of free space on the left node, lowest on the right one)

Probably just nitpicking, but still, I'm wondering if I missed something
out.

Thanks.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com

Re: [HACKERS] Misunderstanding on the FSM README file

2014-12-07 Thread Heikki Linnakangas


On 12/07/2014 02:03 PM, Guillaume Lelarge wrote:

Hi,

I've been reading the FSM README file lately
(src/backend/storage/freespace/README), and I'm puzzled by one of the graph
(the binary tree structure of an FSM file). Here it is:

 4
  4 2
3 4   0 2- This level represents heap pages

Shouldn't the last line be:
4 3   2 0

(ie, highest number of free space on the left node, lowest on the right one)

Probably just nitpicking, but still, I'm wondering if I missed something
out.


No, that's not how it works. Each number at the bottom level corresponds 
to a particular heap page. The first number would be heap page #0 (which 
has 3 units of free space), the second heap page #1 (with 4 units of 
free space) and so forth. Each node on the upper levels stores the 
maximum of its two children.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alter user set local_preload_libraries.

2014-12-07 Thread Peter Eisentraut

On 11/12/14 1:01 PM, Fujii Masao wrote:
 On Wed, Nov 5, 2014 at 1:22 AM, Peter Eisentraut pete...@gmx.net wrote:
 On 10/9/14 1:58 PM, Fujii Masao wrote:
 Also I think that it's useful to allow ALTER ROLE/DATABASE SET to
 set PGC_BACKEND and PGC_SU_BACKEND parameters. So, what
 about applying the attached patch? This patch allows that and
 changes the context of session_preload_libraries to PGC_SU_BACKEND.

 After looking through this again, I wonder whether there is any reason
 why ignore_system_indexes cannot be plain PGC_USERSET.  With this
 change, we'd allow setting it via ALTER ROLE, but the access to
 pg_db_role_setting happens before it.
 
 Even without the patch, we can set ignore_system_indexes
 at the startup of the connection because it's defined with
 PGC_BACKEND context, but the access to system tables
 can happen before that. Am I missing something?

Let's think about what would happen if we allowed PGC_BACKEND settings
to be changed by ALTER ROLE.

Here is the current set of PGC_BACKEND and PGC_SU_BACKEND settings:

- ignore_system_indexes

Reason: interesting consequences if changed later

- post_auth_delay

Reason: changing later would have no effect

- local_preload_libraries

Reason: changing later would have no effect

- log_connections

Reason: changing later would have no effect

- log_disconnections

Reason: dubious / consistency with log_connections?

Only ignore_system_indexes is really in the spirit of the original
PGC_BACKEND setting, namely for settings that unprivileged users can set
at the beginning of a session but should not change later.  We used to
have fsync in that category, for example, because changing that was
not considered safe at some time.  We used to have a lot more of these,
but not many stood the test of time.

The other settings are really just things that take effect during
session start but don't hurt when changed later.  The problem is that
the order of these relative to ALTER ROLE processing is really an
implementation detail or a best effort type of thing.  For example, it
looks as though we are making an effort to process post_auth_delay
really late, after ALTER ROLE processing (even though we currently don't
allow it to be set that way; strange), but there is no reason why that
has to be so.  One could reasonably think that post auth is really
early, before catalog access starts.  On the other hand, log_connections
is processed really early, so ALTER ROLE would have no effect.

This is going to end up inconsistent one way or the other.

My radical proposal therefore would have been to embrace this
inconsistency and get rid of PGC_BACKEND and PGC_SU_BACKEND altogether,
relying on users interpreting the parameter names to indicate that
changing them later may or may not have an effect.  Unfortunately, the
concerns about ignore_system_indexes prevent that.

We could think of another way to address those concerns, e.g., with an
ad hoc way in an assign hook.

The other proposal would be to keep PGC_BACKEND and PGC_SU_BACKEND as
special-case warts, perhaps document them as such, but change everything
to use something else as much as possible, namely

post_auth_delay - PGC_USERSET
local_preload_libraries - PGC_USERSET
log_disconnections - PGC_SUSET



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alter user set local_preload_libraries.

2014-12-07 Thread Tom Lane

Peter Eisentraut pete...@gmx.net writes:
 My radical proposal therefore would have been to embrace this
 inconsistency and get rid of PGC_BACKEND and PGC_SU_BACKEND altogether,
 relying on users interpreting the parameter names to indicate that
 changing them later may or may not have an effect.  Unfortunately, the
 concerns about ignore_system_indexes prevent that.

Yeah, I think making ignore_system_indexes USERSET would be a pretty bad
idea.  People would expect that they can frob it back and forth with no
impact other than performance, and I'm doubtful that that's true.

If we wanted to make a push to get rid of PGC_BACKEND, I could see
changing ignore_system_indexes to SUSET category, and assuming that
superusers are adults who won't push a button just to see what it does.

But having said that, I don't really think that PGC_BACKEND is a useless
category.  It provides a uniform way of documenting that changing a
particular setting post-session-start is useless.  Therefore I'm not
on board with getting rid of it.  To the extent that we can make ALTER
ROLE/DATABASE control these settings, that would be a good thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fractions in GUC variables

2014-12-07 Thread John Gorman

This patch implements the first wiki/Todo Configuration Files item
Consider normalizing fractions in postgresql.conf, perhaps using '%'.

The Fractions in GUC variables discussion is here.

http://www.postgresql.org/message-id/467132cf.9020...@enterprisedb.com

This patch implements expressing GUC variables as percents in
postgresql.conf.

autovacuum_vacuum_scale_factor = 20%   # percent of table size before vacuum
autovacuum_analyze_scale_factor = 10%  # percent of table size before
analyze

As you can see the postgresql.conf file and the documentation read more
naturally.  I added a regression test to guc.sql. The sql interface also
accepts both numeric and percent forms although the percent form must be
quoted because '%' is an operator.

show cursor_tuple_fraction; -- 10%
set cursor_tuple_fraction = .15; -- 15%
set cursor_tuple_fraction = '33%'; -- 33%

I tagged four configuration variables to display as percents.

The attached patch applies cleanly against master and passes all regression
tests including two new tests in guc.sql.


guc_percent-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fractions in GUC variables

2014-12-07 Thread Josh Berkus

On 12/07/2014 11:48 AM, John Gorman wrote:
 
 This patch implements the first wiki/Todo Configuration Files item
 Consider normalizing fractions in postgresql.conf, perhaps using '%'.
 
 The Fractions in GUC variables discussion is here.
 

Oh, this is nice!  Thanks for working on it.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2014-12-07 Thread Noah Misch

On Wed, Dec 03, 2014 at 11:54:38AM -0300, Alvaro Herrera wrote:
 Pushed with some extra cosmetic tweaks.

The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
running on 64-bit Windows Server 2003.  I have not checked other Windows
configurations; the suite does pass on GNU/Linux.
*** Z:/nm/postgresql/src/test/modules/commit_ts/expected/commit_timestamp.out   
2014-12-05 05:43:01.07442 +
--- Z:/nm/postgresql/src/test/modules/commit_ts/results/commit_timestamp.out
2014-12-05 08:24:13.094705200 +
***
*** 19,27 
  ORDER BY id;
   id | ?column? | ?column? | ?column? 
  +--+--+--
!   1 | t| t| t
!   2 | t| t| t
!   3 | t| t| t
  (3 rows)
  
  DROP TABLE committs_test;
--- 19,27 
  ORDER BY id;
   id | ?column? | ?column? | ?column? 
  +--+--+--
!   1 | t| f| t
!   2 | t| f| t
!   3 | t| f| t
  (3 rows)
  
  DROP TABLE committs_test;
***
*** 34,39 
  SELECT x.xid::text::bigint  0, x.timestamp  '-infinity'::timestamptz, 
x.timestamp  now() FROM pg_last_committed_xact() x;
   ?column? | ?column? | ?column? 
  --+--+--
!  t| t| t
  (1 row)
  
--- 34,39 
  SELECT x.xid::text::bigint  0, x.timestamp  '-infinity'::timestamptz, 
x.timestamp  now() FROM pg_last_committed_xact() x;
   ?column? | ?column? | ?column? 
  --+--+--
!  t| t| f
  (1 row)
  

==


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-12-07 Thread Simon Riggs

On 20 October 2014 at 10:57, Jim Nasby jim.na...@bluetreble.com wrote:

 Currently, a non-freeze vacuum will punt on any page it can't get a cleanup
 lock on, with no retry. Presumably this should be a rare occurrence, but I
 think it's bad that we just assume that and won't warn the user if something
 bad is going on.

(I'm having email problems, so I can't see later mails on this thread,
so replying here.)

Logging patch looks fine, but I would rather not add a line of text
for each VACUUM, just in case this is non-zero. I think we should add
that log line only if the blocks skipped  0.

What I'm more interested in is what you plan to do with the
information once we get it?

The assumption that skipping blocks is something bad is strange. I
added it because VACUUM could and did regularly hang on busy tables,
which resulted in bloat because other blocks that needed cleaning
didn't get any attention.

Which is better, spend time obsessively trying to vacuum particular
blocks, or to spend the time on other blocks that are in need of
cleaning and are available to be cleaned?

Which is better, have autovacuum or system wide vacuum progress on to
other tables that need cleaning, or spend lots of effort retrying?

How do we know what is the best next action?

I'd really want to see some analysis of those things before we spend
even more cycles on this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition - v0.2

2014-12-07 Thread Michael Paquier

On Wed, Dec 3, 2014 at 4:03 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Ping? This patch is in a stale state for a couple of weeks and still
 marked as waiting on author for this CF.
Marked as returned with feedback.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gist vacuum gist access

2014-12-07 Thread Michael Paquier

On Tue, Sep 9, 2014 at 12:47 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 09/08/2014 03:26 PM, Alexander Korotkov wrote:

 On Mon, Sep 8, 2014 at 12:51 PM, Heikki Linnakangas
 hlinnakan...@vmware.com

 wrote:


 On 09/08/2014 11:19 AM, Alexander Korotkov wrote:

 On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov
 aekorot...@gmail.com


 wrote:

   On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas 

 hlinnakan...@vmware.com wrote:

   In the b-tree code, we solved that problem back in 2006, so it can be

 done but requires a bit more code. In b-tree, we solved it with a
 vacuum
 cycle ID number that's set on the page halves when a page is split.
 That
 allows VACUUM to identify pages that have been split concurrently sees
 them, and jump back to vacuum them too. See commit
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=
 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do
 something similar in GiST, and in fact you might be able to reuse the
 NSN
 field that's already set on the page halves on split, instead of
 adding
 a
 new vacuum cycle ID.

 ...


 Another note. Assuming we have NSN which can play the role of vacuum
 cycle
 ID, can we implement sequential (with possible jump back) index scan
 for
 GiST?


 Yeah, I think it would work. It's pretty straightforward, the page split
 code already sets the NSN, just when we need it. Vacuum needs to memorize
 the current NSN when it begins, and whenever it sees a page with a higher
 NSN (or the FOLLOW_RIGHT flag is set), follow the right-link if it points
 to lower-numbered page.


 I mean full index scan feature for SELECT queries might be implemented
 as
 well as sequential VACUUM.


 Oh, sorry, I missed that. If you implement a full-index scan like that, you
 might visit some tuples twice, so you'd have to somehow deal with the
 duplicates. For a bitmap index scan it would be fine.
This patch has been in a Wait on Author state for quite a long time
and Heikki has provided comments on it. Switching it to Returned with
feedback.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

2014-12-07 Thread Michael Paquier

On Thu, Nov 27, 2014 at 10:16 AM, Bruce Momjian br...@momjian.us wrote:
 On Sat, Nov  8, 2014 at 05:56:00PM +0100, Andres Freund wrote:
 On 2014-11-08 11:52:43 -0500, Tom Lane wrote:
  Adding a similar
  level of burden to support a feature with a narrow use-case seems like
  a nonstarter from here.

 I don't understand this statement. In my experience the lack of a usable
 replication solution that allows temporary tables and major version
 differences is one of the most, if not *the* most, frequent criticisms
 of postgres I hear. How is this a narrow use case?

 How would replicating DDL handle cases where the master and slave
 servers have different major versions and the DDL is only supported by
 the Postgres version on the master server?  If it would fail, does this
 limit the idea that logical replication allows major version-different
 replication?
Marking this patch as Returned with feedback. Even with the
more-fundamental arguments of putting such replication solution
in-core or not (I am skeptical as well btw), on a
code-base-discussion-only I don't think that this patch is acceptable
as-is without more structure of jsonb to do on-memory manipulation of
containers (aka remove ObjTree stuff).
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Selectivity estimation for inet operators

2014-12-07 Thread Michael Paquier

On Wed, Dec 3, 2014 at 6:14 AM, Emre Hasegeli e...@hasegeli.com wrote:
 Actually, there's a second large problem with this patch: blindly
 iterating through all combinations of MCV and histogram entries makes the
 runtime O(N^2) in the statistics target.  I made up some test data (by
 scanning my mail logs) and observed the following planning times, as
 reported by EXPLAIN ANALYZE:

 explain analyze select * from relays r1, relays r2 where r1.ip = r2.ip;
 explain analyze select * from relays r1, relays r2 where r1.ip  r2.ip;

 stats targeteqjoinsel   networkjoinsel

 100 0.27 ms 1.85 ms
 10002.56 ms 167.2 ms
 1   56.6 ms 13987.1 ms

 I don't think it's necessary for network selectivity to be quite as
 fast as eqjoinsel, but I doubt we can tolerate 14 seconds planning
 time for a query that might need just milliseconds to execute :-(

 It seemed to me that it might be possible to reduce the runtime by
 exploiting knowledge about the ordering of the histograms, but
 I don't have time to pursue that right now.

 I make it break the loop when we passed the last possible match. Patch
 attached.  It reduces the runtime almost 50% with large histograms.

 We can also make it use only some elements of the MCV and histogram
 for join estimation.  I have experimented with reducing the right and
 the left hand side of the lists on the previous versions.  I remember
 it was better to reduce only the left hand side.  I think it would be
 enough to use log(n) elements of the right hand side MCV and histogram.
 I can make the change, if you think selectivity estimation function
 is the right place for this optimization.
Marking as Returned with feedback as more work needs to be done.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support UPDATE table SET(*)=...

2014-12-07 Thread Michael Paquier

On Wed, Nov 26, 2014 at 4:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think what's likely missing here is a clear design for the raw parse
 tree representation (what's returned by the bison grammar).  The patch
 seems to be trying to skate by without creating any new parse node types
 or fields, but that may well be a bad idea.  At the very least there
 needs to be some commentary added to parsenodes.h explaining what the
 representation actually is (cf commentary there for MultiAssignRef).

 Also, I think it's a mistake not to be following the MultiAssignRef
 model for the case of (*) = ctext_row.  We optimize the ROW-source
 case at the grammar stage when there's a fixed number of target columns,
 because that's a very easy transformation --- but you can't do it like
 that when there's not.  It's possible that that optimization should be
 delayed till later even in the existing case; in general, optimizing
 in gram.y is a bad habit that's better avoided ...
Marking as returned with feedback based on those comments.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-07 Thread Michael Paquier

On Tue, Dec 2, 2014 at 2:14 PM, Jeff Davis pg...@j-davis.com wrote:
 On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
 On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis pg...@j-davis.com wrote:
  I can also just move isReset there, and keep mem_allocated as a uint64.
  That way, if I find later that I want to track the aggregated value for
  the child contexts as well, I can split it into two uint32s. I'll hold
  off any any such optimizations until I see some numbers from HashAgg
  though.

 I took a quick look at memory-accounting-v8.patch.

 Is there some reason why mem_allocated is a uint64? All other things
 being equal, I'd follow the example of tuplesort.c's
 MemoryContextAllocHuge() API, which (following bugfix commit
 79e0f87a1) uses int64 variables to track available memory and so on.

 No reason. New version attached; that's the only change.
Note that I am marking this patch back to Needs Review state. I
doesn't seem that this patch has been reviewed completely.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-07 Thread Michael Paquier

On Sun, Dec 7, 2014 at 1:50 AM, Stephen Frost sfr...@snowman.net wrote:
 * Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
  I don't see any changes to the regression test files, were they
  forgotten in the patch?  I would think that at least the view definition
  changes would require updates to the regression tests, though perhaps
  nothing else.

 Hmmm... :-/ The regression tests that changed were in
 'src/test/regress/expected/rules.out' and should be near the bottom of the
 patch.

 Hah, looked just like changes to the system_views, sorry for the
 confusion. :)

  Overall, I'm pretty happy with the patch and would suggest moving on to
  writing up the documentation changes to go along with the code changes.
  I'll continue to play around with it but it all seems pretty clean to
  me and will allow us to easily add the additiaonl role attributes being
  discussed.

 Sounds good.  I'll start on those changes next.
This patch (https://commitfest.postgresql.org/action/patch_view?id=1616)
has not been updated in the commitfest app for two months, making its
progress hard to track. However, even if some progress has been made
things are still not complete (documentation, etc.). Opinions about
marking that as returned with feedback for the current commit fest and
create a new entry for the next CF if this work is going to be
pursued?
I guess that it would be fine simply move it to the next CF, but it
seems I cannot do that myself in the CF app.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-12-07 Thread Michael Paquier

On Sat, Dec 6, 2014 at 9:01 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 If you agree, then we should try to avoid this change in new behaviour.
Still seeing many concerns about this patch, so marking it as returned
with feedback. If possible, switching it to the next CF would be fine
I guess as this work is still being continued.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] detect custom-format dumps in psql and emit a useful error

2014-12-07 Thread Michael Paquier

On Fri, Oct 24, 2014 at 7:23 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-10-24 07:18:55 -0300, Alvaro Herrera wrote:
 Yeah, this patch is a lot more debatable than the other one.  I have
 pushed the first one without changing the error message.

 We could just test for toc.dat and then emit the warning...
One patch has been committed, and the second is debatable. Hence
marking this entry as returned with feedback in the CF app.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-12-07 Thread Petr Jelinek


On 05/12/14 16:49, Robert Haas wrote:

On Thu, Dec 4, 2014 at 8:45 AM, Michael Paquier
michael.paqu...@gmail.com wrote:

Here is patch which renames action_at_recovery_target to
recovery_target_action everywhere.

Thanks, Looks good to me.

A couple of things that would be good to document as well in
recovery-config.sgml:
- the fact that pause_at_recovery_target is deprecated.


Why not just remove it altogether?  I don't think the
backward-compatibility break is enough to get excited about, or to
justify the annoyance of carrying an extra parameter that does (part
of) the same thing.



Ok this patch does that, along with the rename to recovery_target_action 
and addition to the recovery.conf.sample.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index b4959ac..519a0ce 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -280,31 +280,11 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
- varlistentry id=pause-at-recovery-target
-   xreflabel=pause_at_recovery_target
-  termvarnamepause_at_recovery_target/varname (typeboolean/type)
+ varlistentry id=recovery-target-action
+   xreflabel=recovery_target_action
+  termvarnamerecovery_target_action/varname (typeenum/type)
   indexterm
-primaryvarnamepause_at_recovery_target/ recovery parameter/primary
-  /indexterm
-  /term
-  listitem
-   para
-Alias for action_at_recovery_target, literaltrue/ is same as
-action_at_recovery_target = literalpause/ and literalfalse/
-is same as action_at_recovery_target = literalpromote/.
-   /para
-   para
-This setting has no effect if xref linkend=guc-hot-standby is not
-enabled, or if no recovery target is set.
-   /para
-  /listitem
- /varlistentry
-
- varlistentry id=action-at-recovery-target
-   xreflabel=action_at_recovery_target
-  termvarnameaction_at_recovery_target/varname (typeenum/type)
-  indexterm
-primaryvarnameaction_at_recovery_target/ recovery parameter/primary
+primaryvarnamerecovery_target_action/ recovery parameter/primary
   /indexterm
   /term
   listitem
@@ -336,7 +316,7 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
/para
para
 Note that because filenamerecovery.conf/ will not be renamed when
-varnameaction_at_recovery_target/ is set to literalshutdown/,
+varnamerecovery_target_action/ is set to literalshutdown/,
 any subsequent start will end with immediate shutdown unless the
 configuration is changed or the filenamerecovery.conf/ is removed
 manually.
diff --git a/doc/src/sgml/release-9.1.sgml b/doc/src/sgml/release-9.1.sgml
index 79a8b07..4fcc0b3 100644
--- a/doc/src/sgml/release-9.1.sgml
+++ b/doc/src/sgml/release-9.1.sgml
@@ -6301,8 +6301,7 @@
 
   listitem
para
-Add filenamerecovery.conf/ setting link
-linkend=pause-at-recovery-targetvarnamepause_at_recovery_target//link
+Add filenamerecovery.conf/ setting varnamepause_at_recovery_target/
 to pause recovery at target (Simon Riggs)
/para
 
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index 7657df3..42da62b 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -94,12 +94,13 @@
 #recovery_target_timeline = 'latest'
 #
 #
-# If pause_at_recovery_target is enabled, recovery will pause when
-# the recovery target is reached. The pause state will continue until
-# pg_xlog_replay_resume() is called. This setting has no effect if
-# hot standby is not enabled, or if no recovery target is set.
+# The recovery_target_action option can be set to either promote, paused
+# or shutdown and decides the behaviour of the server once the recovery_target
+# is reached. Promote will promote the server to standalone one. Pause will
+# pause the recovery, which can be then resumed by pg_xlog_replay_resume().
+# And shutdown will stop the server.
 #
-#pause_at_recovery_target = true
+#recovery_target_action = 'pause'
 #
 #---
 # STANDBY SERVER PARAMETERS
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index da28de9..469a83b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -229,7 +229,7 @@ static char *recoveryEndCommand = NULL;
 static char *archiveCleanupCommand = NULL;
 static RecoveryTargetType recoveryTarget = RECOVERY_TARGET_UNSET;
 static bool recoveryTargetInclusive = true;
-static RecoveryTargetAction

Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-12-07 Thread Michael Paquier

On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra t...@fuzzy.cz wrote:
 Thanks for the link. I've been looking for a good dataset with such
 data, and this one is by far the best one.

 The current version of the patch supports only data types passed by
 value (i.e. no varlena types - text, ), which means it's impossible to
 build multivariate stats on some of the interesting columns (state,
 city, ...).

 I guess it's time to start working on removing this limitation.
Tomas, what's your status on this patch? Are you planning to make it
more complicated than it is? For now I have switched it to a Needs
Review state because even your first version did not get advanced
review (that's quite big btw). I guess that we should switch it to the
next CF.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-12-07 Thread Michael Paquier

On Mon, Dec 8, 2014 at 9:59 AM, Petr Jelinek p...@2ndquadrant.com wrote:
 Ok this patch does that, along with the rename to recovery_target_action and
 addition to the recovery.conf.sample.
This needs a rebase as at least da71632 and b8e33a8 are conflicting.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-12-07 Thread Petr Jelinek


On 08/12/14 02:03, Michael Paquier wrote:

On Mon, Dec 8, 2014 at 9:59 AM, Petr Jelinek p...@2ndquadrant.com wrote:

Ok this patch does that, along with the rename to recovery_target_action and
addition to the recovery.conf.sample.

This needs a rebase as at least da71632 and b8e33a8 are conflicting.



Simon actually already committed something similar, so no need.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-12-07 Thread Petr Jelinek


On 08/12/14 02:06, Petr Jelinek wrote:

On 08/12/14 02:03, Michael Paquier wrote:

On Mon, Dec 8, 2014 at 9:59 AM, Petr Jelinek p...@2ndquadrant.com
wrote:

Ok this patch does that, along with the rename to
recovery_target_action and
addition to the recovery.conf.sample.

This needs a rebase as at least da71632 and b8e33a8 are conflicting.



Simon actually already committed something similar, so no need.



...except for the removal of pause_at_recovery_target it seems, so I 
attached just that


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 31473cd..07b4856 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -280,26 +280,6 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
- varlistentry id=pause-at-recovery-target
-   xreflabel=pause_at_recovery_target
-  termvarnamepause_at_recovery_target/varname (typeboolean/type)
-  indexterm
-primaryvarnamepause_at_recovery_target/ recovery parameter/primary
-  /indexterm
-  /term
-  listitem
-   para
-Alias for recovery_target_action, literaltrue/ is same as
-recovery_target_action = literalpause/ and literalfalse/
-is same as recovery_target_action = literalpromote/.
-   /para
-   para
-This setting has no effect if xref linkend=guc-hot-standby is not
-enabled, or if no recovery target is set.
-   /para
-  /listitem
- /varlistentry
-
  varlistentry id=action-at-recovery-target
xreflabel=recovery_target_action
   termvarnamerecovery_target_action/varname (typeenum/type)
diff --git a/doc/src/sgml/release-9.1.sgml b/doc/src/sgml/release-9.1.sgml
index 79a8b07..5cbfc4a 100644
--- a/doc/src/sgml/release-9.1.sgml
+++ b/doc/src/sgml/release-9.1.sgml
@@ -6301,8 +6301,8 @@
 
   listitem
para
-Add filenamerecovery.conf/ setting link
-linkend=pause-at-recovery-targetvarnamepause_at_recovery_target//link
+Add filenamerecovery.conf/ setting
+varnamepause_at_recovery_target/
 to pause recovery at target (Simon Riggs)
/para
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0f09add..6370aed 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4653,7 +4653,6 @@ readRecoveryCommandFile(void)
 	ConfigVariable *item,
 			   *head = NULL,
 			   *tail = NULL;
-	bool		recoveryPauseAtTargetSet = false;
 	bool		recoveryTargetActionSet = false;
 
 
@@ -4699,25 +4698,6 @@ readRecoveryCommandFile(void)
 	(errmsg_internal(archive_cleanup_command = '%s',
 	 archiveCleanupCommand)));
 		}
-		else if (strcmp(item-name, pause_at_recovery_target) == 0)
-		{
-			bool recoveryPauseAtTarget;
-
-			if (!parse_bool(item-value, recoveryPauseAtTarget))
-ereport(ERROR,
-		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-		 errmsg(parameter \%s\ requires a Boolean value, pause_at_recovery_target)));
-
-			ereport(DEBUG2,
-	(errmsg_internal(pause_at_recovery_target = '%s',
-	 item-value)));
-
-			recoveryTargetAction = recoveryPauseAtTarget ?
-	 RECOVERY_TARGET_ACTION_PAUSE :
-	 RECOVERY_TARGET_ACTION_PROMOTE;
-
-			recoveryPauseAtTargetSet = true;
-		}
 		else if (strcmp(item-name, recovery_target_action) == 0)
 		{
 			if (strcmp(item-value, pause) == 0)
@@ -4902,17 +4882,6 @@ readRecoveryCommandFile(void)
 			RECOVERY_COMMAND_FILE)));
 	}
 
-	/*
-	 * Check for mutually exclusive parameters
-	 */
-	if (recoveryPauseAtTargetSet  recoveryTargetActionSet)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg(cannot set both \%s\ and \%s\ recovery parameters,
-		pause_at_recovery_target,
-		recovery_target_action),
- errhint(The \pause_at_recovery_target\ is deprecated.)));
-
 
 	/*
 	 * Override any inconsistent requests. Not that this is a change

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Status of Commit fest 2014-10

2014-12-07 Thread Michael Paquier

Hi all,

I have been through a certain number of patches and marked the ones
that needed some love from their authors as returned with feedback (at
least the ones marked as such in the CF app), but you may have noticed
that :)

This email is a call to move on and close soon the CF currently open
and to move to the next one, which normally begins in one week. Hence,
to committers, there are actually 11 patches waiting for some
attention:
- Use faster, higher precision timer API GetSystemTimeAsFileTime on windows
- Compression of Full Page Writes
- REINDEX SCHEMA
- Add IF NOT EXISTS to CREATE TABLE AS and CREATE MATERIALIZED VIEW
- pgcrypto: support PGP signatures
- Point to polygon distance operator
- Jsonb generator functions
- json strip nulls functions
- pg_basebackup vs. Windows and tablespaces
- New operators for pgbench expressions, with modulo
- Refactor of functions related to quoting from builtins.h to utils/quote.h

To patch authors: please ping your reviewers (IF NOT NULL) if you
think that you did not get enough feedback.
To patch reviewers: please update the status of the patch on the CF
app if you have done your part of the work.

At this point I think that it may be preferable to move the
work-in-progress items to the next CF.

Thanks,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-12-07 Thread Michael Paquier

On Mon, Dec 8, 2014 at 10:18 AM, Petr Jelinek p...@2ndquadrant.com wrote:
 On 08/12/14 02:06, Petr Jelinek wrote:
 Simon actually already committed something similar, so no need.


 ...except for the removal of pause_at_recovery_target it seems, so I
 attached just that
Thanks! Removal of this parameter is getting at least two votes, and
nobody expressed against it, so IMO we should remove it now instead or
later, or else I'm sure we would simply forget it. My2c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2014-12-07 Thread Petr Jelinek


On 08/12/14 00:56, Noah Misch wrote:

On Wed, Dec 03, 2014 at 11:54:38AM -0300, Alvaro Herrera wrote:

Pushed with some extra cosmetic tweaks.


The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
running on 64-bit Windows Server 2003.  I have not checked other Windows
configurations; the suite does pass on GNU/Linux.



Hmm I wonder if  now() needs to be changed to = now() in those 
queries to make them work correctly on that plarform, I don't have 
machine with that environment handy right now, so I would appreciate if 
you could try that, in case you don't have time for that, I will try to 
setup something later...


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fractions in GUC variables

2014-12-07 Thread Michael Paquier

On Mon, Dec 8, 2014 at 4:48 AM, John Gorman johngorm...@gmail.com wrote:
 The attached patch applies cleanly against master and passes all regression
 tests including two new tests in guc.sql.
Please be sure to register your patch to the upcoming commit fest,
this way it will not fall into oblivion and will get some feedback:
https://commitfest.postgresql.org/action/commitfest_view?id=25
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-12-07 Thread Dilip kumar

On 06 December 2014 20:01 Amit Kapila Wrote

I wanted to understand what exactly the above loop is doing.

a.
first of all the comment on top of it says Some of the slot
are free, ..., if some slot is free, then why do you want
to process the results? (Do you mean to say that *None* of
the slot is free?)

This comment is wrong, I will remove this.

b.
IIUC, you have called function select_loop(maxFd, slotset)
to check if socket descriptor is readable, if yes then why
in do..while loop the same maxFd is checked always, don't
you want to check different socket descriptors?  I am not sure
if I am missing something here

select_loop(maxFd, slotset)

maxFd is the max descriptor among all SETS, and slotset contains all the 
descriptor, so if any of the descriptor get some message select_loop will come 
out, and once select loop come out,
we need to check how many descriptor have got the message from server so we 
loop and process the results.

So it’s not only for a maxFd, it’s for all the descriptors. And it’s in 
do..while loop, because it possible that select_loop come out because of some 
intermediate message on any of the socket but still query is not complete,
and if none of the socket is still free (that we check in below for loop), then 
go to select_loop again.


c.
After checking the socket descriptor for maxFd why you want
to run run the below for loop for all slots?
for (i = 0; i  max_slot; i++)
After Select loop is out, it’s possible that we might have got result on 
multiple connections, so consume input and check if still busy, then nothing to 
do, but if finished process the result and mark the connection free.
And if any of the connection is free, then we will break the do..while loop.







From: Amit Kapila [mailto:amit.kapil...@gmail.com]
Sent: 06 December 2014 20:01
To: Dilip kumar
Cc: Magnus Hagander; Alvaro Herrera; Jan Lentfer; Tom Lane; 
PostgreSQL-development; Sawada Masahiko; Euler Taveira
Subject: Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP 
]


On Mon, Dec 1, 2014 at 12:18 PM, Dilip kumar 
dilip.ku...@huawei.commailto:dilip.ku...@huawei.com wrote:

 On 24 November 2014 11:29, Amit Kapila Wrote,


I have verified that all previous comments are addressed and
the new version is much better than previous version.


 here we are setting each target once and doing for all the tables..

Hmm, theoretically I think new behaviour could lead to more I/O in
certain cases as compare to existing behaviour.  The reason for more I/O
is that in the new behaviour, while doing Analyze for a particular table at
different targets, in-between it has Analyze of different table as well,
so the pages in shared buffers or OS cache for a particular table needs to
be reloded again for a new target whereas currently it will do all stages
of Analyze for a particular table in one-go which means that each stage
of Analyze could get benefit from the pages of a table loaded by previous
stage.  If you agree, then we should try to avoid this change in new
behaviour.


 Please provide you opinion.

I have few questions regarding function GetIdleSlot()

+ static int
+ GetIdleSlot(ParallelSlot *pSlot, int max_slot, const char *dbname,
+const
char *progname, bool completedb)
{
..
+/*
+* Some of the slot are free, Process the results for slots whichever
+* are free
+*/
+
+do
+{
+SetCancelConn(pSlot[0].connection);
+
+i = select_loop(maxFd,
slotset);
+
+ResetCancelConn();
+
+if (i  0)
+{
+/*
+
  * This can only happen if user has sent the cancel 
request using
+*
Ctrl+C, Cancel is handled by 0th slot, so fetch the error result.
+*/
+
+
GetQueryResult(pSlot[0].connection, dbname, progname,
+
completedb);
+return NO_SLOT;
+}
+
+Assert(i != 0);
+
+
for (i = 0; i  max_slot; i++)
+{
+if (!FD_ISSET(pSlot[i].sock,
slotset))
+continue;
+
+PQconsumeInput(pSlot[i].connection);
+
  if (PQisBusy(pSlot[i].connection))
+continue;
+
+
  pSlot[i].isFree = true;
+
+if (!GetQueryResult(pSlot[i].connection, 
dbname,
progname,
+   
 completedb))
+
return NO_SLOT;
+
+if (firstFree  0)
+firstFree = i;
+
  }
+}while(firstFree  0);
}

I wanted to understand what exactly the above loop is doing.

a.
first of all the comment on top of it

Re: [HACKERS] inherit support for foreign tables

2014-12-07 Thread Etsuro Fujita


(2014/12/07 2:02), David Fetter wrote:

On Thu, Dec 04, 2014 at 12:35:54PM +0900, Etsuro Fujita wrote:

But I think
there would be another idea.  An example will be shown below.  We show the
update commands below the ModifyTable node, not above the corresponding
ForeignScan nodes, so maybe less confusing.  If there are no objections of
you and others, I'll update the patch this way.

postgres=# explain verbose update parent set a = a * 2 where a = 5;
  QUERY PLAN
-
  Update on public.parent  (cost=0.00..280.77 rows=25 width=10)
On public.ft1
  Remote SQL: UPDATE public.mytable_1 SET a = $2 WHERE ctid = $1

^^
It occurs to me that the command generated by the FDW might well not
be SQL at all, as is the case with file_fdw and anything else that
talks to a NoSQL engine.

Would it be reasonable to call this Remote command or something
similarly generic?


Yeah, but I'd like to propose that this line is shown by the FDW API 
(ie, ExplainForeignModify) as in non-inherited update cases, so that the 
FDW developer can choose the right name.  As for On public.ft1, I'd 
like to propose that the FDW API also show that by calling a function 
for that introduced into the PG core (Would it be better to use For 
rather than On?).


Sorry, my explanation was not enough.

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2014-12-07 Thread Simon Riggs

 On Thu, Dec 4, 2014 at 8:37 PM, Michael Paquier wrote
 I pondered something that Andres mentioned upthread: we may not do the
compression in WAL record only for blocks, but also at record level. Hence
joining the two ideas together I think that we should definitely have
a different
GUC to control the feature, consistently for all the images. Let's call it
wal_compression, with the following possible values:
- on, meaning that a maximum of compression is done, for this feature
basically full_page_writes = on.
- full_page_writes, meaning that full page writes are compressed
- off, default value, to disable completely the feature.
This would let room for another mode: 'record', to completely compress
a record. For now though, I think that a simple on/off switch would be
fine for this patch. Let's keep things simple.

+1 for a separate parameter for compression

Some changed thoughts to the above

* parameter should be SUSET - it doesn't *need* to be set only at
server start since all records are independent of each other

* ideally we'd like to be able to differentiate the types of usage.
which then allows the user to control the level of compression
depending upon the type of action. My first cut at what those settings
should be are ALL  LOGICAL  PHYSICAL  VACUUM.

VACUUM - only compress while running vacuum commands
PHYSICAL - only compress while running physical DDL commands (ALTER
TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
typically be used for logical decoding
LOGICAL - compress FPIs for record types that change tables
ALL - all user commands
(each level includes all prior levels)

* name should not be wal_compression - we're not compressing all wal
records, just fpis. There is no evidence that we even want to compress
other record types, nor that our compression mechanism is effective at
doing so. Simple = keep name as compress_full_page_writes
Though perhaps we should have it called wal_compression_level

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Testing DDL deparsing support

2014-12-07 Thread Ian Barwick

On 14/12/07 12:43, Bruce Momjian wrote:
 On Tue, Dec  2, 2014 at 03:13:07PM -0300, Alvaro Herrera wrote:
 Robert Haas wrote:
 On Thu, Nov 27, 2014 at 11:43 PM, Ian Barwick i...@2ndquadrant.com wrote:

 A simple schedule to demonstrate this is available; execute from the
 src/test/regress/ directory like this:

 ./pg_regress \
   --temp-install=./tmp_check \
   --top-builddir=../../.. \
   --dlpath=. \
   --schedule=./schedule_ddl_deparse_demo

 I haven't read the code, but this concept seems good to me.

 Excellent, thanks.

 It has the unfortunate weakness that a difference could exist during
 the *middle* of the regression test run that is gone by the *end* of
 the run, but our existing pg_upgrade testing has the same weakness, so
 I guess we can view this as one more reason not to be too aggressive
 about having regression tests drop the unshared objects they create.

 Agreed.  Not dropping objects also helps test pg_dump itself; the normal
 procedure there is run the regression tests, then pg_dump the regression
 database.  Objects that are dropped never exercise their corresponding
 pg_dump support code, which I think is a bad thing.  I think we should
 institute a policy that regression tests must keep the objects they
 create; maybe not all of them, but at least a sample large enough to
 cover all interesting possibilities.
 
 This causes creation DDL is checked if it is used in the regression
 database, but what about ALTER and DROP?  pg_dump doesn't issue those,
 except in special cases like inheritance.

Sure, pg_dump won't contain ALTER/DROP DDL; we are using pg_dump
after replaying the DDL commands to compare the actual state of the
database with the expected state.

As I'm in the middle of writing these tests, before I go any further
do you accept the tests need to be included?


Regards

Ian Barwick

-- 
 Ian Barwick   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2014-12-07 Thread Michael Paquier

On Mon, Dec 8, 2014 at 11:30 AM, Simon Riggs si...@2ndquadrant.com wrote:
 * parameter should be SUSET - it doesn't *need* to be set only at
 server start since all records are independent of each other
Check.

 * ideally we'd like to be able to differentiate the types of usage.
 which then allows the user to control the level of compression
 depending upon the type of action. My first cut at what those settings
 should be are ALL  LOGICAL  PHYSICAL  VACUUM.
 VACUUM - only compress while running vacuum commands
 PHYSICAL - only compress while running physical DDL commands (ALTER
 TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
 typically be used for logical decoding
 LOGICAL - compress FPIs for record types that change tables
 ALL - all user commands
 (each level includes all prior levels)

Well, that's clearly an optimization so I don't think this should be
done for a first shot but those are interesting fresh ideas.
Technically speaking, note that we would need to support such things
with a new API to switch a new context flag in registered_buffers of
xloginsert.c for each block, and decide if the block is compressed
based on this context flag, and the compression level wanted.

 * name should not be wal_compression - we're not compressing all wal
 records, just fpis. There is no evidence that we even want to compress
 other record types, nor that our compression mechanism is effective at
 doing so. Simple = keep name as compress_full_page_writes
 Though perhaps we should have it called wal_compression_level

I don't really like those new names, but I'd prefer
wal_compression_level if we go down that road with 'none' as default
value. We may still decide in the future to support compression at the
record level instead of context level, particularly if we have an API
able to do palloc_return_null_at_oom, so the idea of WAL compression
is not related only to FPIs IMHO.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-07 Thread Adam Brightwell

Michael,

This patch (https://commitfest.postgresql.org/action/patch_view?id=1616)
 has not been updated in the commitfest app for two months, making its
 progress hard to track.


I believe that the mentioned patch should be considered 'on hold' or
'dependent' upon the acceptance of the work that is being done in this
thread.

Also, the changes proposed by this thread have already been added to the
next commitfest (https://commitfest.postgresql.org/action/patch_view?id=1651
).

However, even if some progress has been made
 things are still not complete (documentation, etc.). Opinions about
 marking that as returned with feedback for the current commit fest and
 create a new entry for the next CF if this work is going to be
 pursued?

I guess that it would be fine simply move it to the next CF, but it
 seems I cannot do that myself in the CF app.


This work will certainly continue to be pursued.  If a simple move is
possible/acceptable, then I think that would be the best option.  I can
handle that as it would appear that I am capable of moving it, if that is
acceptable.

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-12-07 Thread Amit Kapila

On Mon, Dec 8, 2014 at 7:33 AM, Dilip kumar dilip.ku...@huawei.com wrote:
 On 06 December 2014 20:01 Amit Kapila Wrote

 I wanted to understand what exactly the above loop is doing.



 a.

 first of all the comment on top of it says Some of the slot

 are free, ..., if some slot is free, then why do you want

 to process the results? (Do you mean to say that *None* of

 the slot is free?)



 This comment is wrong, I will remove this.


I suggest rather than removing, edit the comment to indicate
the idea behind code at that place.

 b.

 IIUC, you have called function select_loop(maxFd, slotset)

 to check if socket descriptor is readable, if yes then why

 in do..while loop the same maxFd is checked always, don't

 you want to check different socket descriptors?  I am not sure

 if I am missing something here



 select_loop(maxFd, slotset)


 So it’s not only for a maxFd, it’s for all the descriptors. And it’s in
do..while loop, because it possible that select_loop come out because of
some intermediate message on any of the socket but still query is not
complete,


Okay, I think this part of code is somewhat similar to what
is done in pg_dump/parallel.c with some differences related
to handling of inAbort.  One thing I have noticed here which
could lead to a problem is that caller of select_loop() function
assumes that return value is less than zero only if there is a
cancel request which I think is wrong, because select system
call could also return -1 in case of error.  I am referring below
code in above context:

+ if (i  0)
+ {
+ /*
+  * This can only happen if user has sent the cancel request using
+  * Ctrl+C, Cancel is handled by 0th slot, so fetch the error result.
+  */
+
+ GetQueryResult(pSlot[0].connection, dbname, progname,
+completedb);



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Seq Scan

2014-12-07 Thread Amit Kapila

On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost sfr...@snowman.net wrote:

 * Amit Kapila (amit.kapil...@gmail.com) wrote:
  1. As the patch currently stands, it just shares the relevant
  data (like relid, target list, block range each worker should
  perform on etc.) to the worker and then worker receives that
  data and form the planned statement which it will execute and
  send the results back to master backend.  So the question
  here is do you think it is reasonable or should we try to form
  the complete plan for each worker and then share the same
  and may be other information as well like range table entries
  which are required.   My personal gut feeling in this matter
  is that for long term it might be better to form the complete
  plan of each worker in master and share the same, however
  I think the current way as done in patch (okay that needs
  some improvement) is also not bad and quite easier to implement.

 For my 2c, I'd like to see it support exactly what the SeqScan node
 supports and then also what Foreign Scan supports.  That would mean we'd
 then be able to push filtering down to the workers which would be great.
 Even better would be figuring out how to parallelize an Append node
 (perhaps only possible when the nodes underneath are all SeqScan or
 ForeignScan nodes) since that would allow us to then parallelize the
 work across multiple tables and remote servers.

 One of the big reasons why I was asking about performance data is that,
 today, we can't easily split a single relation across multiple i/o
 channels.  Sure, we can use RAID and get the i/o channel that the table
 sits on faster than a single disk and possibly fast enough that a single
 CPU can't keep up, but that's not quite the same.  The historical
 recommendations for Hadoop nodes is around one CPU per drive (of course,
 it'll depend on workload, etc, etc, but still) and while there's still a
 lot of testing, etc, to be done before we can be sure about the 'right'
 answer for PG (and it'll also vary based on workload, etc), that strikes
 me as a pretty reasonable rule-of-thumb to go on.

 Of course, I'm aware that this won't be as easy to implement..

  2. Next question related to above is what should be the
  output of ExplainPlan, as currently worker is responsible
  for forming its own plan, Explain Plan is not able to show
  the detailed plan for each worker, is that okay?

 I'm not entirely following this.  How can the worker be responsible for
 its own plan when the information passed to it (per the above
 paragraph..) is pretty minimal?

Because for a simple sequence scan that much information is sufficient,
basically if we have scanrelid, target list, qual and then RTE (primarily
relOid), then worker can form and perform sequence scan.

 In general, I don't think we need to
 have specifics like this worker is going to do exactly X because we
 will eventually need some communication to happen between the worker and
 the master process where the worker can ask for more work because it's
 finished what it was tasked with and the master will need to give it
 another chunk of work to do.  I don't think we want exactly what each
 worker process will do to be fully formed at the outset because, even
 with the best information available, given concurrent load on the
 system, it's not going to be perfect and we'll end up starving workers.
 The plan, as formed by the master, should be more along the lines of
 this is what I'm gonna have my workers do along w/ how many workers,
 etc, and then it goes and does it.

I think here you want to say that work allocation for workers should be
dynamic rather fixed which I think makes sense, however we can try
such an optimization after some initial performance data.

 Perhaps for an 'explain analyze' we
 return information about what workers actually *did* what, but that's a
 whole different discussion.


Agreed.

  3. Some places where optimizations are possible:
  - Currently after getting the tuple from heap, it is deformed by
  worker and sent via message queue to master backend, master
  backend then forms the tuple and send it to upper layer which
  before sending it to frontend again deforms it via
slot_getallattrs(slot).

 If this is done as I was proposing above, we might be able to avoid
 this, but I don't know that it's a huge issue either way..  The bigger
 issue is getting the filtering pushed down.

  - Master backend currently receives the data from multiple workers
  serially.  We can optimize in a way that it can check other queues,
  if there is no data in current queue.

 Yes, this is pretty critical.  In fact, it's one of the recommendations
 I made previously about how to change the Append node to parallelize
 Foreign Scan node work.

  - Master backend is just responsible for coordination among workers
  It shares the required information to workers and then fetch the
  data processed by each worker, by using some more logic, we might
  be able to make

Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Langote

Hi Robert,

 From: Robert Haas [mailto:robertmh...@gmail.com]
 On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
 langote_amit...@lab.ntt.co.jp wrote:
  So, we're going to support exactly two levels of partitioning?
  partitions with partissub=false and subpartitions with partissub=true?
   Why not support only one level of partitioning here but then let the
  children have their own pg_partitioned_rel entries if they are
  subpartitioned?  That seems like a cleaner design and lets us support
  an arbitrary number of partitioning levels if we ever need them.
 
  Yeah, that's what I thought at some point in favour of dropping partissub
 altogether. However, not that this design solves it, there is one question - 
 if
 we would want to support defining for a table both partition key and sub-
 partition key in advance? That is, without having defined a first level 
 partition
 yet; in that case, what level do we associate sub-(sub-) partitioning key with
 or more to the point where do we keep it?
 
 Do we really need to allow that?  I think you let people partition a
 toplevel table, and then partition its partitions once they've been
 created.  I'm not sure there's a good reason to associate the
 subpartitioning scheme with the toplevel table.  For one thing, that
 forces all subpartitions to be partitioned the same way - do we want
 to insist on that?  If we do, then I agree that we need to think a
 little harder here.
 

To me, it sounds better if we insist on a uniform subpartitioning scheme across 
all partitions. It seems that's how it's done elsewhere. It would be 
interesting to hear what others think though.

  That would be a default partition. That is, where the tuples that don't
 belong elsewhere (other defined partitions) go. VALUES clause of the
 definition for such a partition would look like:
 
  (a range partition) ... VALUES LESS THAN MAXVALUE
  (a list partition) ... VALUES DEFAULT
 
  There has been discussion about whether there shouldn't be such a place
 for tuples to go. That is, it should generate an error if a tuple can't go
 anywhere (or support auto-creating a new one like in interval partitioning?)
 
 I think Alvaro's response further down the thread is right on target.
 But to go into a bit more detail, let's consider the three possible
 cases:
 
 - Hash partitioning.  Every key value gets hashed to some partition.
 The concept of an overflow or default partition doesn't even make
 sense.
 
 - List partitioning.  Each key for which the user has defined a
 mapping gets sent to the corresponding partition.  The keys that
 aren't mapped anywhere can either (a) cause an error or (b) get mapped
 to some default partition.  It's probably useful to offer both
 behaviors.  But I don't think it requires a partitionisoverflow
 column, because you can represent it some other way, such as by making
 partitionvalues NULL, which is otherwise meaningless.
 
 - Range partitioning.  In this case, what you've basically got is a
 list of partition bounds and a list of target partitions.   Suppose
 there are N partition bounds; then there will be N+1 targets.  Some of
 those targets can be undefined, meaning an attempt to insert a key
 with that value will error out.  For example, suppose the user defines
 a partition for values 1-3 and 10-13.  Then your list of partition
 bounds looks like this:
 
 1,3,10,13
 
 And your list of destinations looks like this:
 
 undefined,firstpartition,undefined,secondpartition,undefined
 
 More commonly, the ranges will be contiguous, so that there are no
 gaps.  If you have everything 10 in the first partition, everything
 10-20 in the second partition, and everything else in a third
 partition, then you have bounds 10,20 and destinations
 firstpartition,secondpartition,thirdpartition.  If you want values
 greater than 20 to error out, then you have bounds 10,20 and
 destinations firstpartition,secondpartition,undefined.
 
 In none of this do you really have an overflow partition.  Rather,
 the first and last destinations, if defined, catch everything that has
 a key lower than the lowest key or higher than the highest key.  If
 not defined, you error out.

So just to clarify, first and last destinations are considered defined if you 
have something like:

...
PARTITION p1 VALUES LESS THAN 10
PARTITION p2 VALUES BETWEEN 10 AND 20
PARTITION p3 VALUES GREATER THAN 20
...

And not defined if:

...
PARTITION p1 VALUES BETWEEN 10 AND 20
...

In the second case, because no explicit definitions for values less than 10 and 
greater than 20 are in place, rows with that value error out? If so, that makes 
sense. 

 
  I wonder if your suggestion of pg_node_tree plays well here. This then
 could be a list of CONSTs or some such... And I am thinking it's a concern 
 only
 for range partitions, no? (that is, a multicolumn partition key)
 
 I guess you could list or hash partition on multiple columns, too.
 And yes, this is why I though of pg_node_tree.
 
   * DDL syntax (no

Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Langote

From: Amit Kapila [mailto:amit.kapil...@gmail.com] 
Sent: Saturday, December 06, 2014 5:00 PM
To: Robert Haas
Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
Subject: Re: [HACKERS] On partitioning

On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
 langote_amit...@lab.ntt.co.jp wrote:

  I wonder if your suggestion of pg_node_tree plays well here. This then 
  could be a list of CONSTs or some such... And I am thinking it's a concern 
  only for range partitions, no? (that is, a multicolumn partition key)

 I guess you could list or hash partition on multiple columns, too.

 How would you distinguish values in list partition for multiple
 columns? I mean for range partition, we are sure there will
 be either one value for each column, but for list it could
 be multiple and not fixed for each partition, so I think it will not
 be easy to support the multicolumn partition key for list
 partitions.

Irrespective of difficulties of representing it using pg_node_tree, it seems to 
me that multicolumn list partitioning is not widely used. It is used in 
combination with range or hash partitioning as composite partitioning. So, 
perhaps we need not worry about that.

Regards,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Langote

From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila
Sent: Saturday, December 06, 2014 5:06 PM
To: Robert Haas
Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
Subject: Re: [HACKERS] On partitioning

On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila amit.kapil...@gmail.com wrote:
  Do we really need to support dml or pg_dump for individual partitions?

 I think we do.  It's quite reasonable for a DBA (or developer or
 whatever) to want to dump all the data that's in a single partition;
 for example, maybe they have the table partitioned, but also spread
 across several servers.  When the data on one machine grows too big,
 they want to dump that partition, move it to a new machine, and drop
 the partition from the old machine.  That needs to be easy and
 efficient.

 More generally, with inheritance, I've seen the ability to reference
 individual inheritance children be a real life-saver on any number of
 occasions.  Now, a new partitioning system that is not as clunky as
 constraint exclusion will hopefully be fast enough that people don't
 need to do it very often any more.  But I would be really cautious
 about removing the option.  That is the equivalent of installing a new
 fire suppression system and then boarding up the emergency exit.
 Yeah, you *hope* the new fire suppression system is good enough that
 nobody will ever need to go out that way any more.  But if you're
 wrong, people will die, so getting rid of it isn't prudent.  The
 stakes are not quite so high here, but the principle is the same.

 Sure, I don't feel we should not provide anyway to take dump
 for individual partition but not at level of independent table.
 May be something like --table table_name
 --partition partition_name.

This does sound cleaner.

 In general, I think we should try to avoid exposing that partitions are
 individual tables as that might hinder any future enhancement in that
 area (example if we someone finds a different and better way to
 arrange the partition data, then due to the currently exposed syntax,
 we might feel blocked). 

Sounds like a concern. I guess you are referring to whether we allow a 
partition relation to be included in the range table and then some other cases. 
In the former case we could allow referring to individual partitions by some 
additional syntax if it doesn’t end up looking too ugly or invite a bunch of 
other issues.

This seems to have been discussed a little bit upthread (for example, see Open 
Questions in Alvaro's original proposal and Hannu Krosing's reply). 

Regards,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables

2014-12-07 Thread Ashutosh Bapat

On Sat, Dec 6, 2014 at 9:21 PM, Noah Misch n...@leadboat.com wrote:

 On Thu, Dec 04, 2014 at 10:00:14AM +0530, Ashutosh Bapat wrote:
  On Thu, Dec 4, 2014 at 9:05 AM, Etsuro Fujita 
 fujita.ets...@lab.ntt.co.jp wrote:
   (2014/12/03 19:35), Ashutosh Bapat wrote:
   On Tue, Dec 2, 2014 at 8:29 AM, Etsuro Fujita
   fujita.ets...@lab.ntt.co.jp mailto:fujita.ets...@lab.ntt.co.jp
 wrote:
IIUC, even the transactions over the local and the *single* remote
   server are not guaranteed to be executed atomically in the current
   form.  It is possible that the remote transaction succeeds and the
   local one fails, for example, resulting in data inconsistency
   between the local and the remote.
  
  
IIUC, while committing transactions involving a single remote server,
   the steps taken are as follows
   1. the local changes are brought to PRE-COMMIT stage, which means that
   the transaction *will* succeed locally after successful completion of
   this phase,
   2. COMMIT message is sent to the foreign server
   3. If step two succeeds, local changes are committed and successful
   commit is conveyed to the client
   4. if step two fails, local changes are rolled back and abort status
 is
   conveyed to the client
   5. If step 1 itself fails, the remote changes are rolled back.
   This is as per one phase commit protocol which guarantees ACID for
   single foreign data source. So, the changes involving local and a
 single
   foreign server seem to be atomic and consistent.
  
  
   Really?  Maybe I'm missing something, but I don't think the current
   implementation for committing transactions has such a mechanism stated
 in
   step 1.  So, I think it's possible that the local transaction fails in
   step3 while the remote transaction succeeds, as mentioned above.
  
  
  PFA a script attached which shows this. You may want to check the code in
  pgfdw_xact_callback() for actions taken by postgres_fdw on various
 events.
  CommitTransaction() for how those events are generated. The code there
  complies with the sequence above.

 While postgres_fdw delays remote commit to eliminate many causes for having
 one server commit while another aborts, other causes remain.  The local
 transaction could still fail due to WAL I/O problems or a system crash.
 2PC
 is the reliable answer, but that was explicitly out of scope for the
 initial
 postgres_fdw write support.  Does this inheritance patch add any atomicity
 problem that goes away when one breaks up the inheritance hierarchy and
 UPDATEs each table separately?  If not, this limitation is okay.


If the UPDATES crafted after breaking up the inheritance hierarchy are
needed to be run within the same transaction (as the UPDATE on inheritance
hierarchy would do), yes, there is atomicity problem.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-07 Thread Michael Paquier

On Mon, Dec 8, 2014 at 12:34 PM, Adam Brightwell
adam.brightw...@crunchydatasolutions.com wrote:
 Michael,

 This patch (https://commitfest.postgresql.org/action/patch_view?id=1616)
 has not been updated in the commitfest app for two months, making its
 progress hard to track.


 I believe that the mentioned patch should be considered 'on hold' or
 'dependent' upon the acceptance of the work that is being done in this
 thread.

 Also, the changes proposed by this thread have already been added to the
 next commitfest
 (https://commitfest.postgresql.org/action/patch_view?id=1651).

 However, even if some progress has been made
 things are still not complete (documentation, etc.). Opinions about
 marking that as returned with feedback for the current commit fest and
 create a new entry for the next CF if this work is going to be
 pursued?

 I guess that it would be fine simply move it to the next CF, but it
 seems I cannot do that myself in the CF app.

 This work will certainly continue to be pursued.  If a simple move is
 possible/acceptable, then I think that would be the best option.  I can
 handle that as it would appear that I am capable of moving it, if that is
 acceptable.
Yes please. Actually I could have done it, just found the option to do
so. Let's see what shows up with your work.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Kapila

On Mon, Dec 8, 2014 at 11:01 AM, Amit Langote langote_amit...@lab.ntt.co.jp
wrote:
 From: Amit Kapila [mailto:amit.kapil...@gmail.com]
 Sent: Saturday, December 06, 2014 5:00 PM
 To: Robert Haas
 Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
 Subject: Re: [HACKERS] On partitioning

 On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas robertmh...@gmail.com
wrote:
  On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
  langote_amit...@lab.ntt.co.jp wrote:

   I wonder if your suggestion of pg_node_tree plays well here. This
then could be a list of CONSTs or some such... And I am thinking it's a
concern only for range partitions, no? (that is, a multicolumn partition
key)

  I guess you could list or hash partition on multiple columns, too.

  How would you distinguish values in list partition for multiple
  columns? I mean for range partition, we are sure there will
  be either one value for each column, but for list it could
  be multiple and not fixed for each partition, so I think it will not
  be easy to support the multicolumn partition key for list
  partitions.

 Irrespective of difficulties of representing it using pg_node_tree, it
seems to me that multicolumn list partitioning is not widely used.

So I think it is better to be clear why we are not planning to
support it, is it that because it is not required by users or
is it due to the reason that code seems to be tricky or is it due
to both of the reasons.  It might help us if anyone raises this
during the development of this patch or in general if someone
requests such a feature.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-07 Thread Rahila Syed

The important point to consider for this patch is the use of the
additional 2-bytes as uint16 in the block information structure to save the
length of a compressed
block, which may be compressed without its hole to achieve a double level
of compression (image compressed without its hole). We may use a simple
flag on
one or two bits using for example a bit from hole_length, but in this case
we would need to always compress images with their hole included, something
more
 expensive as the compression would take more time.
As you have mentioned here the idea to use bits from existing fields rather
than adding additional 2 bytes in header,
FWIW elaborating slightly on the way it was done in the initial patches,
We can use the following struct

 unsignedhole_offset:15,
 compress_flag:2,
hole_length:15;

Here  compress_flag can be 0 or 1 depending on status of compression. We
can reduce the compress_flag to just 1 bit flag.

IIUC, the purpose of adding compress_len field in the latest patch is
to store length of compressed blocks which is used at the time of decoding
the blocks.

With this approach, length of compressed block can be stored in hole_length
as,

 hole_length = BLCKSZ - compress_len.

Thus, hole_length can serve the purpose of storing length of a compressed
block without the need of additional 2-bytes.  In DecodeXLogRecord,
hole_length can be used for tracking the length of data received in cases
of both compressed as well as uncompressed blocks.

As you already mentioned, this will need compressing images with hole but
 we can MemSet hole to 0 in order to make compression of hole less
expensive and effective.



Thank you,

Rahila Syed

On Sat, Dec 6, 2014 at 7:37 PM, Michael Paquier michael.paqu...@gmail.com
wrote:


 On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund and...@2ndquadrant.com
 wrote:

 On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
  On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier 
 michael.paqu...@gmail.com
  wrote:
 
  
  
  
   On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed rahilasyed...@gmail.com
 
   wrote:
  
   I attempted quick review and could not come up with much except this
  
   +   /*
   +* Calculate the amount of FPI data in the record. Each backup
 block
   +* takes up BLCKSZ bytes, minus the hole length.
   +*
   +* XXX: We peek into xlogreader's private decoded backup blocks
 for
   the
   +* hole_length. It doesn't seem worth it to add an accessor
 macro for
   +* this.
   +*/
   +   fpi_len = 0;
   +   for (block_id = 0; block_id = record-max_block_id; block_id++)
   +   {
   +   if (XLogRecHasCompressedBlockImage(record, block_id))
   +   fpi_len += BLCKSZ -
 record-blocks[block_id].compress_len;
  
  
   IIUC, fpi_len in case of compressed block image should be
  
   fpi_len = record-blocks[block_id].compress_len;
  
   Yep, true. Patches need a rebase btw as Heikki fixed a commit related
 to
   the stats of pg_xlogdump.
  
 
  In any case, any opinions to switch this patch as Ready for committer?

 Needing a rebase is a obvious conflict to that... But I guess some wider
 looks afterwards won't hurt.


 Here are rebased versions, which are patches 1 and 2. And I am switching
 as well the patch to Ready for Committer. The important point to consider
 for this patch is the use of the additional 2-bytes as uint16 in the block
 information structure to save the length of a compressed block, which may
 be compressed without its hole to achieve a double level of compression
 (image compressed without its hole). We may use a simple flag on one or two
 bits using for example a bit from hole_length, but in this case we would
 need to always compress images with their hole included, something more
 expensive as the compression would take more time.

 Robert wrote:
  What I would suggest is instrument the backend with getrusage() at
  startup and shutdown and have it print the difference in user time and
  system time.  Then, run tests for a fixed number of transactions and
  see how the total CPU usage for the run differs.
 That's a nice idea, which is done with patch 3 as a simple hack calling
 twice getrusage at the beginning of PostgresMain and before proc_exit,
 calculating the difference time and logging it for each process (used as
 well log_line_prefix with %p).

 Then I just did a small test with a load of a pgbench-scale-100 database
 on fresh instances:
 1) Compression = on:
 Stop LSN: 0/487E49B8
 getrusage: proc 11163: LOG:  user diff: 63.071127, system diff: 10.898386
 pg_xlogdump: FPI size: 122296653 [90.52%]
 2) Compression = off
 Stop LSN: 0/4E54EB88
 Result: proc 11648: LOG:  user diff: 43.855212, system diff: 7.857965
 pg_xlogdump: FPI size: 204359192 [94.10%]
 And the CPU consumption is showing quite some difference... I'd expect as
 well pglz_compress to show up high in a perf profile for this case (don't
 have the time to do that now, but a perf record -a -g would be

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-07 Thread Michael Paquier

On Mon, Dec 8, 2014 at 3:42 PM, Rahila Syed rahilasye...@gmail.com wrote:

The important point to consider for this patch is the use of the additional
 2-bytes as uint16 in the block information structure to save the length of a
 compressed
block, which may be compressed without its hole to achieve a double level
 of compression (image compressed without its hole). We may use a simple flag
 on
one or two bits using for example a bit from hole_length, but in this case
 we would need to always compress images with their hole included, something
 more
  expensive as the compression would take more time.
 As you have mentioned here the idea to use bits from existing fields rather
 than adding additional 2 bytes in header,
 FWIW elaborating slightly on the way it was done in the initial patches,
 We can use the following struct

  unsignedhole_offset:15,
  compress_flag:2,
 hole_length:15;

 Here  compress_flag can be 0 or 1 depending on status of compression. We can
 reduce the compress_flag to just 1 bit flag.
Just adding that this is fine as the largest page size that can be set is 32k.

 IIUC, the purpose of adding compress_len field in the latest patch is to
 store length of compressed blocks which is used at the time of decoding the
 blocks.

 With this approach, length of compressed block can be stored in hole_length
 as,

  hole_length = BLCKSZ - compress_len.

 Thus, hole_length can serve the purpose of storing length of a compressed
 block without the need of additional 2-bytes.  In DecodeXLogRecord,
 hole_length can be used for tracking the length of data received in cases of
 both compressed as well as uncompressed blocks.

 As you already mentioned, this will need compressing images with hole but
 we can MemSet hole to 0 in order to make compression of hole less expensive
 and effective.

Thanks for coming back to this point in more details, this is very
important. The additional 2 bytes used make compression less expensive
by ignoring the hole, for a bit more data in each record. Using uint16
is as well a cleaner code style, more in-line wit hte other fields,
but that's a personal opinion ;)

Doing a switch from one approach to the other is easy enough though,
so let's see what others think.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Langote


From: Amit Kapila [mailto:amit.kapil...@gmail.com] 
   How would you distinguish values in list partition for multiple
   columns? I mean for range partition, we are sure there will
   be either one value for each column, but for list it could
   be multiple and not fixed for each partition, so I think it will not
   be easy to support the multicolumn partition key for list
   partitions.

 Irrespective of difficulties of representing it using pg_node_tree, it seems 
 to me that multicolumn list partitioning is not widely used.
 
 So I think it is better to be clear why we are not planning to
 support it, is it that because it is not required by users or
 is it due to the reason that code seems to be tricky or is it due
 to both of the reasons.  It might help us if anyone raises this
 during the development of this patch or in general if someone
 requests such a feature.

Coming back to the how pg_node_tree representation for list partitions - 

For each column in a multicolumn list partition key, a value would look like a 
dumped Node for List of Consts (all allowed values in a given list partition). 
And the whole key would then be a List of such Nodes (a dump thereof). That's 
perhaps pretty verbose but I guess that's supposed to be only a catalog 
representation. During relcache building, we turn this back into a collection 
of structs to efficiently locate the partition of interest whatever the method 
of doing that ends up being (based on partition type). The relcache step 
ensures that we have decoupled the concern of quickly locating an interesting 
partition from its catalog representation.

Of course, there may be flaws in this picture and would only reveal themselves 
when actually trying to implement it or they can be pointed out in advance.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

44 matches

Mail list logo