Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-11 Thread Amit Kapila
On Fri, Feb 10, 2017 at 6:45 AM, Peter Geoghegan  wrote:
> On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan  wrote:
>>> which isn't an issue here, but reinforces my point about the (badly
>>> documented) assumption that we don't release locks on user relations
>>> early.
>>

I think even if we don't need to retain locks till transaction end in
the present case, but Andres's point has a merit that it is good to
follow an undocumented rule of releasing locks at transaction end for
user relations unless there is some benefit in releasing the locks
early in some situation.  I think that can avoid unnecessary worry
that whether it safe to do in some situation.

>> You are right about the substantive issue, I assume, but I have a hard
>> time agreeing with the idea that it's even badly documented when there
>> are at least 10 counter-examples of blithely doing this. In any case,
>> I find it very confusing when you talk about things as established
>> practice/coding standards when they're very clearly not consistently
>> adhered to at all. You should at least say "let's not make a bad
>> situation any worse", or something, so that I don't need to spend 10
>> minutes pulling my hair out.
>
> BTW, aren't there cases where a PARALLEL SAFE function that needs to
> acquire locks on some arbitrary relation not referenced in the query
> can get locks on its own, which may only last as long as the parallel
> worker's lifetime? This could happen *despite* the fact that the
> author of the function may have imagined that callers could not
> release relation level locks early (i.e., by not releasing them
> directly, and so correctly following this "badly documented
> assumption").
>
> It seems like the existence of PARALLEL RESTRICTED is primarily
> motivated by making stuff that definitely needs the locks to last
> until xact end being sure that that happens -- the docs say so.
>

I don't think so.  I think one of the primary reason of releasing
locks in workers is that we don't have any way to transfer them to the
leader which can release them at transaction end.

> This
> seems wholly inconsistent with the idea that you're not supposed to
> let that happen under any circumstances. I find all this highly
> confusing. Have I missed some further subtlety that applies with
> PARALLEL RESTRICTED?
>

The basic idea of parallel restricted is that any expression will be
considered parallel restricted if that can't be executed in the
worker, for example, a reference to initplans in a query.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2017-02-11 Thread Amit Kapila
On Fri, Feb 10, 2017 at 11:27 PM, Robert Haas  wrote:
> On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila  wrote:
>>> The hunk in indexam.c looks like a leftover that can be omitted.
>>
>> It is not a leftover hunk. Earlier, the patch has the same check
>> btparallelrescan, but based on your comment up thread [1] (Why does
>> btparallelrescan cater to the case where scan->parallel_scan== NULL?),
>> this has been moved to indexam.c.
>
> That's not my point.  The point is, if it's not a parallel scan,
> index_parallelrescan() should never get called in the first place.  So
> therefore it shouldn't need to worry about scan->parallel_scan being
> NULL.  It seems possibly reasonable to put something like
> Assert(scan->parallel_scan != NULL) in there, but arranging to return
> without doing anything in that case seems like it just masks possible
> bugs in the calling code.  And really I'm not sure we even need the
> Assert.
>

I have removed the check from index_parallelrescan() and ensured in
the caller that it must be called only when parallel_scan descriptor
is present.

Comments from another e-mail:
> This design presupposes that every AM that might ever want to do
> parallel scans is happy with genericcostestimate()'s method of
> computing the number of index pages that will be fetched.  However,
> that might not be true for every possible AM.  In fact, it's already
> not true for BRIN, which always reads the whole index.  Now, since
> we're only concerned with btree at the moment, nothing would be
> immediately broken by this approach but it seems we're just setting
> ourselves up for future pain.
>

I think to make it future-proof, we could add a generic page
estimation function.  However, I have tried to implement based on your
below suggestion, see if that looks better to you, otherwise we can
introduce a generic page estimation API.

> I have what I think might be a better idea: let's make
> amcostestimate() responsible for returning a suggested parallel degree
> for the path via an additional out parameter.  cost_index() can then
> reduce that value if it seems like not enough heap pages will be
> fetched to justify the return value, or it can override it completely
> if parallel_degree is set for the relation.
>

I see the value of your idea, but I think it might be better to return
the number of IndexPages required to be scanned from amcostestimate()
and then use the already computed value of heap_pages in cost_index to
identify the number of workers.  This will make the calculation simple
and avoid overriding the return value.  Now, the returned value of
index pages will only be used for cases when partial path is being
constructed, but I think that is okay, because we are not doing any
extra calculation to compute the number of index pages fetched.
Another argument could be that the number of index pages to be used
for parallelism can be different from the number of pages to be
scanned or what ever is used in cost computation, but I think that is
also easy to change later when we create partial paths for indexes
other than btree.  I have implemented the above idea in the attached
patch (parallel_index_opt_exec_support_v9.patch)

>  Then we don't need to run
> this logic twice to compute the same value, and we don't violate the
> AM abstraction layer.
>

We can avoid computing the same value twice, however, with your
suggested approach, we have to do all the additional work for the
cases where employing parallel workers is not beneficial, so not sure
if there is a net gain.

> BTW, the changes to add_partial_path() aren't needed, because an
> IndexPath only gets reused if you stick a Bitmap{Heap,And,Or}Path on
> top of it, and that won't be the case with this or any other pending
> patch.  If we get the ability to have a Parallel Bitmap Heap Scan that
> takes a parallel index scan rather than a standard index scan as
> input, then we'll need something like this.  But for now it's probably
> best to just update the comments and remove the Assert().
>

Right, changed as per suggestion.

> I think you can also leave out the documentation changes from these
> patches.  I'll do some general rewriting of the parallel query section
> once we know exactly what capabilities we'll have in this release; I
> think that will work better than trying to update them a bit at a time
> for each patch.
>

Okay, removed the documentation part.

Patches to be used: guc_parallel_index_scan_v1.patch [1],
parallel_index_scan_v8.patch, parallel_index_opt_exec_support_v9.patch


[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BTnM4pXQbvn7OXqam%2Bk_HZqb0ROZUMxOiL6DWJYCyYow%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


parallel_index_scan_v8.patch
Description: Binary data


parallel_index_opt_exec_support_v9.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to 

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-11 Thread Tom Lane
Jim Nasby  writes:
> On 2/8/17 5:54 PM, Tom Lane wrote:
>> Maybe it'd be better to imagine this as something closer to planagg.c,
>> that is it knows how to apply a specific high-level optimization that
>> might or might not be a win, so it builds a path describing that and sees
>> if it looks cheaper than a path done the normal way.  The fact that
>> we could even build a path describing a union is something that wasn't
>> there before 9.6, but maybe there's enough infrastructure for that now.

> It's encouraging that there's at least a template to follow there... If 
> there is missing infrastructure, is it likely to be major? My uninformed 
> guess would be that the pathification was the major hurdle.

I wrote a POC patch for this on a long airplane ride.  It's not complete,
and I'm sure there are bugs as well, but it makes your example case
better.  What I did about the de-duplication issue is to de-dup using
the CTIDs of all baserels in the query.  This means the optimization
is only applicable to cases where all the rels have CTIDs ... but other
methods such as inspecting unique keys ain't gonna work for non-table
rels either, so I think this is about the best we can hope for.
However, I did not understand your point about:

> BTW, there's an important caveat here: users generally do NOT want 
> duplicate rows from the fact table if the dimension table results aren't 
> unique.

so maybe we should be thinking about some other way entirely?

Anyway, what's attached below produces this on your example query:

 Aggregate  (cost=38.12..38.13 rows=1 width=8) (actual time=0.158..0.158 rows=1 
loops=1)
   ->  Result  (cost=38.03..38.11 rows=4 width=0) (actual time=0.151..0.155 
rows=3 loops=1)
 ->  Unique  (cost=38.03..38.07 rows=4 width=18) (actual 
time=0.150..0.154 rows=3 loops=1)
   ->  Sort  (cost=38.03..38.04 rows=4 width=18) (actual 
time=0.150..0.151 rows=4 loops=1)
 Sort Key: f.ctid, d1.ctid, d2.ctid
 Sort Method: quicksort  Memory: 25kB
 ->  Append  (cost=4.85..37.99 rows=4 width=18) (actual 
time=0.070..0.106 rows=4 loops=1)
   ->  Nested Loop Left Join  (cost=4.85..19.00 rows=2 
width=18) (actual time=0.069..0.075 rows=2 loops=1)
 ->  Nested Loop  (cost=4.57..18.37 rows=2 
width=16) (actual time=0.035..0.038 rows=2 loops=1)
   ->  Index Scan using dim_t_key on dim d1 
 (cost=0.28..8.29 rows=1 width=10) (actual time=0.009..0.009 rows=1 loops=1)
 Index Cond: ('1'::text = t)
   ->  Bitmap Heap Scan on fact f  
(cost=4.30..10.05 rows=2 width=14) (actual time=0.023..0.025 rows=2 loops=1)
 Recheck Cond: (f1 = d1.s)
 Heap Blocks: exact=2
 ->  Bitmap Index Scan on f_f1  
(cost=0.00..4.29 rows=2 width=0) (actual time=0.016..0.016 rows=2 loops=1)
   Index Cond: (f1 = d1.s)
 ->  Index Scan using dim_pkey on dim d2  
(cost=0.28..0.31 rows=1 width=10) (actual time=0.016..0.017 rows=0 loops=2)
   Index Cond: (f.f2 = s)
   ->  Nested Loop Left Join  (cost=4.85..19.00 rows=2 
width=18) (actual time=0.025..0.029 rows=2 loops=1)
 ->  Nested Loop  (cost=4.57..18.37 rows=2 
width=16) (actual time=0.022..0.025 rows=2 loops=1)
   ->  Index Scan using dim_t_key on dim d2 
 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
 Index Cond: ('1'::text = t)
   ->  Bitmap Heap Scan on fact f  
(cost=4.30..10.05 rows=2 width=14) (actual time=0.017..0.019 rows=2 loops=1)
 Recheck Cond: (f2 = d2.s)
 Heap Blocks: exact=2
 ->  Bitmap Index Scan on f_f2  
(cost=0.00..4.29 rows=2 width=0) (actual time=0.014..0.014 rows=2 loops=1)
   Index Cond: (f2 = d2.s)
 ->  Index Scan using dim_pkey on dim d1  
(cost=0.28..0.31 rows=1 width=10) (actual time=0.001..0.001 rows=0 loops=2)
   Index Cond: (f.f1 = s)

The main remaining piece of work here is that, as you can see from the
above, it fails to eliminate joins to tables that we don't actually need
in a particular UNION arm.  This is because the references to those
tables' ctid columns prevent analyzejoins.c from removing the joins.
I've thought about ways to deal with that but haven't come up with
anything that wasn't pretty ugly 

Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)

2017-02-11 Thread Greg Stark
On 11 February 2017 at 23:45, Corey Huinker  wrote:
> So you'd just want to know nesting depth, with no indicator of true/false?

Even that's more than bash does, for example:

$ if true ; then
> if false ; then
> :
> fi
> fi

I'm a bit confused how the true/false is actually valuable. It doesn't
tell you how the expression actually evaluated, just where you are in
the code you're typing in which you can tell equally well by looking
at what code you're typing in. The reason nesting level is handy is
just to remind you in case you forget.

For debugging scripts it would be handy to have some way to tell
whether the \if expression actually evaluated to true or false but
that wouldn't be in the prompt I don't think.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)

2017-02-11 Thread Corey Huinker
On Sat, Feb 11, 2017 at 5:57 PM, Greg Stark  wrote:

> On 10 February 2017 at 21:36, Fabien COELHO  wrote:
> >> command   prompt is now
> >> ---   ---
> >> \echo bob ''   = initial state, no branch going on at all
> >> \if yes   't' = inside a true branch
> >> \if no'tf' = false inside a true
> >> \endif't' = back to just the true branch
> >> \if yes   'tt'
> >> \if yes   'ttt'
> >> \if yes   '...ttt' = only show the last 3, but let it be known that
> >> there's at least one more'
> >> \else '...ttz' = past the point of a true bit of this branch
> >
> >
> > I like the "tfz" idea. I'm not sure whether the up to 6 characters is a
> > good, though.
>
>
> I haven't been following this thread but just skimming through it for
> the first time I thought this was more baroque than I was expecting. I
> was expecting something like a { for each level of nested if you're in
> so you can see how many deep you are. I didn't expect to see anything
> more complex than that.
>

So you'd just want to know nesting depth, with no indicator of true/false?


Re: [HACKERS] Parallel Index Scans

2017-02-11 Thread Amit Kapila
On Sat, Feb 11, 2017 at 9:41 PM, Robert Haas  wrote:
> On Fri, Feb 10, 2017 at 11:22 PM, Amit Kapila  wrote:
>>> Why can't we rely on _bt_walk_left?
>>
>> The reason is mentioned in comments, but let me try to explain with
>> some example.  When you reach that point of code, it means that either
>> the current page (assume page number is 10) doesn't contain any
>> matching items or it is a half-dead page, both of which indicates that
>> we have to move to the previous page.   Now, before checking if the
>> current page contains matching items, we signal parallel machinery
>> (via _bt_parallel_release) to allow workers to read the previous page
>> (assume previous page number is 9).  So it is quite possible that
>> after deciding that current page (page number 10) doesn't contain any
>> matching tuples if we directly move to the previous page (in this case
>> it will be 9) by using _bt_walk_left, some other worker would have
>> read page 9.  In short, if we directly use _bt_walk_left(), then we
>> are prone to returning some of the values twice as multiple workers
>> can read the same page.
>
> But ... the entire point of the seize-and-release stuff is to avoid
> this problem.  You're suppose to seize the scan, read the current
> page, walk left, store the page you find in the scan, and then release
> the scan.
>

Exactly and that is what is done in the patch.  Basically, if we found
that the current page is half-dead or it doesn't contain any matching
items, then release the current buffer, seize the scan, read the
current page, walk left and so on.  I am slightly confused here
because it seems both of us agree on what is the right thing to do and
according to me that is how it is implemented.  Are you just ensuring
about whether I have implemented as discussed or do you see a problem
with the way it is implemented?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)

2017-02-11 Thread Greg Stark
On 10 February 2017 at 21:36, Fabien COELHO  wrote:
>> command   prompt is now
>> ---   ---
>> \echo bob ''   = initial state, no branch going on at all
>> \if yes   't' = inside a true branch
>> \if no'tf' = false inside a true
>> \endif't' = back to just the true branch
>> \if yes   'tt'
>> \if yes   'ttt'
>> \if yes   '...ttt' = only show the last 3, but let it be known that
>> there's at least one more'
>> \else '...ttz' = past the point of a true bit of this branch
>
>
> I like the "tfz" idea. I'm not sure whether the up to 6 characters is a
> good, though.


I haven't been following this thread but just skimming through it for
the first time I thought this was more baroque than I was expecting. I
was expecting something like a { for each level of nested if you're in
so you can see how many deep you are. I didn't expect to see anything
more complex than that.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)

2017-02-11 Thread Corey Huinker
On Sat, Feb 11, 2017 at 3:48 PM, Fabien COELHO  wrote:

>
> Just realized that '?' means "unknown transactional status" in %x. That
>> might cause confusion if a person had a prompt of %x%R. Is that enough
>> reason to pick a different cue?
>>
>
> Argh.
>
> "\?\.?[tfz]" seems distinctive enough. Note that %R uses "'=-*^!$( and %x
> uses *!?, which means that they already share 2 characters, so adding ?
> does not seem like a big issue if it was not one before.
>
> Otherwise, maybe "&" or "%", but it is less about a condition.


Fair enough, it shouldn't be too confusing then.

The get_prompt() function can see the global pset, obviously, but can't see
the scan_state, where the if-stack currently resides. I could give up on
the notion of a per-file if-stack and just have one in pset, but that might
make life difficult for whomever is brave enough to tackle \while loops. Or
I could give pset a pointer to the current if-stack inside the scan_state,
or I could have pset hold a stack of stacks. Unsure which way would be
best.


Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2017-02-11 Thread Andreas Karlsson

On 02/02/2015 03:10 PM, Andres Freund wrote:

I think if we should instead just use the new index, repoint the
dependencies onto the new oid, and then afterwards, when dropping,
rename the new index one onto the old one. That means the oid of the
index will change and some less than pretty grovelling around
dependencies, but it still seems preferrable to what we're discussing
here otherwise.


I think that sounds like a good plan. The oid change does not seem like 
a too big deal to me, especially since that is what users will get now 
too. Do you still think this is the right way to solve this?


I have attached my work in progress patch which implements and is very 
heavily based on Michael's previous work. There are some things left to 
do but I think I should have a patch ready for the next commitfest if 
people still like this type of solution.


I also changed index_set_state_flags() to be transactional since I 
wanted the old index to become invalid at exactly the same time as the 
new becomes valid. From reviewing the code that seems like a safe change.


A couple of bike shedding questions:

- Is the syntax "REINDEX  CONCUURENTLY " ok?

- What should we do with REINDEX DATABASE CONCURRENTLY and the system 
catalog? I so not think we can reindex the system catalog concurrently 
safely, so what should REINDEX DATABASE do with the catalog indexes? 
Skip them, reindex them while taking locks, or just error out?


- What level of information should be output in VERBOSE mode?

What remains to be implemented:

- Support for exclusion constraints
- Look more into how I handle constraints (currently the temporary index 
too seems to have the PRIMARY KEY flag)

- Support for the VERBOSE flag
- More testing to catch bugs

Andreas

diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 306def4a15..ca1aeca65f 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -923,7 +923,8 @@ ERROR:  could not serialize access due to read/write dependencies among transact
 
 
  Acquired by VACUUM (without FULL),
- ANALYZE, CREATE INDEX CONCURRENTLY, and
+ ANALYZE, CREATE INDEX CONCURRENTLY,
+ REINDEX CONCURRENTLY,
  ALTER TABLE VALIDATE and other
  ALTER TABLE variants (for full details see
  ).
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 3908ade37b..24464020cd 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } name
+REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name
 
  
 
@@ -68,9 +68,12 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } CONCURRENTLY option failed, leaving
   an invalid index. Such indexes are useless but it can be
   convenient to use REINDEX to rebuild them. Note that
-  REINDEX will not perform a concurrent build. To build the
-  index without interfering with production you should drop the index and
-  reissue the CREATE INDEX CONCURRENTLY command.
+  REINDEX will perform a concurrent build if 
+  CONCURRENTLY is specified. To build the index without interfering
+  with production you should drop the index and reissue either the
+  CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
+  command. Indexes of toast relations can be rebuilt with REINDEX
+  CONCURRENTLY.
  
 
 
@@ -152,6 +155,21 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } 
 

+CONCURRENTLY
+
+ 
+  When this option is used, PostgreSQL will rebuild the
+  index without taking any locks that prevent concurrent inserts,
+  updates, or deletes on the table; whereas a standard reindex build
+  locks out writes (but not reads) on the table until it's done.
+  There are several caveats to be aware of when using this option
+   see .
+ 
+
+   
+
+   
 VERBOSE
 
  
@@ -231,6 +249,172 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } 
 
+  
+   Rebuilding Indexes Concurrently
+
+   
+   index
+   rebuilding concurrently
+   
+
+   
+Rebuilding an index can interfere with regular operation of a database.
+Normally PostgreSQL locks the table whose index is rebuilt
+against writes and performs the entire index build with a single scan of the
+table. Other transactions can still read the table, but if they try to
+insert, update, or delete rows in the table they will block until the
+index rebuild is finished. This could have a severe effect if the system is
+a live production database. Very large tables can take many hours to be
+indexed, and even for smaller tables, an index rebuild can lock out writers
+for periods that are unacceptably long for a production system.
+   
+
+   
+PostgreSQL supports rebuilding 

Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)

2017-02-11 Thread Fabien COELHO



Just realized that '?' means "unknown transactional status" in %x. That
might cause confusion if a person had a prompt of %x%R. Is that enough
reason to pick a different cue?


Argh.

"\?\.?[tfz]" seems distinctive enough. Note that %R uses "'=-*^!$( and %x 
uses *!?, which means that they already share 2 characters, so adding ? 
does not seem like a big issue if it was not one before.


Otherwise, maybe "&" or "%", but it is less about a condition.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)

2017-02-11 Thread Corey Huinker
On Sat, Feb 11, 2017 at 2:43 AM, Fabien COELHO  wrote:

>
> Ok, so that's not just PROMPT_READY, that's every prompt...which might be
>> ok. ? is a great optional cue, and you're thinking on 2 levels deep, 2nd
>> level always being '.'?
>>
>
> Yep. The idea is to keep it short, but to still have something to say
> "there are more levels" in the stack, hence the one dot. Basically I just
> compressed your 4 level proposal, and added a separator to deal with the
> preceding stuff and suggest the conditional.
>
> --
> Fabien.
>

Just realized that '?' means "unknown transactional status" in %x. That
might cause confusion if a person had a prompt of %x%R. Is that enough
reason to pick a different cue?


Re: [HACKERS] Access inside pg_node_tree from query?

2017-02-11 Thread Ryan Murphy
>
> There are no in-core operators or functions to manipulate pg_node_tree.
>

Thanks Michael, just checking!


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-11 Thread Alexander Korotkov
On Sat, Feb 11, 2017 at 4:17 PM, Tomas Vondra 
wrote:

> On 02/11/2017 01:21 PM, Alexander Korotkov wrote:
>
>> Hi, Tomas!
>>
>> On Sat, Feb 11, 2017 at 2:28 AM, Tomas Vondra
>> >
>> wrote:
>>
>> As discussed at the Developer meeting ~ a week ago, I've ran a
>> number of benchmarks on the commit, on a small/medium-size x86
>> machines. I currently don't have access to a machine as big as used
>> by Alexander (with 72 physical cores), but it seems useful to verify
>> the patch does not have negative impact on smaller machines.
>>
>> In particular I've ran these tests:
>>
>> * r/o pgbench
>> * r/w pgbench
>> * 90% reads, 10% writes
>> * pgbench with skewed distribution
>> * pgbench with skewed distribution and skipping
>>
>>
>> Thank you very much for your efforts!
>> I took a look at these tests.  One thing catch my eyes.  You warmup
>> database using pgbench run.  Did you consider using pg_prewarm instead?
>>
>> SELECT sum(x.x) FROM (SELECT pg_prewarm(oid) AS x FROM pg_class WHERE
>> relkind IN ('i', 'r') ORDER BY oid) x;
>>
>> In my experience pg_prewarm both takes less time and leaves less
>> variation afterwards.
>>
>>
> I've considered it, but the problem I see in using pg_prewarm for
> benchmarking purposes is that it only loads the data into memory, but it
> does not modify the tuples (so all tuples have the same xmin/xmax, no dead
> tuples, ...), it does not set usage counters on the buffers and also does
> not generate any clog records.
>

Yes, but please note that pgbench runs VACUUM first, and all the tuples
would be hinted.  In order to xmin/xmax and clog take effect, you should
run subsequent pgbench with --no-vacuum.  Also usage counters of buffers
make sense only when eviction happens, i.e. data don't fit shared_buffers.

Also, you use pgbench -S to warmup before readonly test.  I think
pg_prewarm would be much better there unless your data is bigger than
shared_buffers.

I don't think there's a lot of variability in the results I measured. If
> you look at (max-min) for each combination of parameters, the delta is
> generally within 2% of average, with a very few exceptions, usually caused
> by the first run (so perhaps the warmup should be a bit longer).


You also could run pg_prewarm before warmup pgbench for readwrite test.  In
my intuition you should get more stable results with shorter warmup.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [ patch ] pg_dump: new --custom-fetch-table and --custom-fetch-value parameters

2017-02-11 Thread Andrea Urbani
I'm a beginner here... anyway I try to share my ideas.

My situation is changed in a worst state: I'm no more able to make a pg_dump 
neither with my custom fetch value (I have tried "1" as value = one row at the 
time) neither without the "--column-inserts":

pg_dump: Dumping the contents of table "tDocumentsFiles" failed: PQgetResult() 
failed.
pg_dump: Error message from server: ERROR:  out of memory
DETAIL:  Failed on request of size 1073741823.
pg_dump: The command was: COPY public."tDocumentsFiles" ("ID_Document", 
"ID_File", "Name", "FileName", "Link", "Note", "Picture", "Content", 
"FileSize", "FileDateTime", "DrugBox", "DrugPicture", "DrugInstructions") TO 
stdout;

I don't know if the Kyotaro Horiguchi patch will solve this, because, again, 
I'm not able to get neither one single row.
Similar problem trying to read and to write the bloab fields with my program.
Actually I'm working via pieces:
Read
  r1) I get the length of the bloab field
  r2) I check the available free memory (on the client pc)
  r3) I read pieces of the bloab field, according to the free memory, appending 
them to a physical file
Write
  w1) I check the length of the file to save inside the bloab
  w2) I check the available free memory (on the client pc)
  w3) I create a temporary table on the server
  w4) I add lines to this temporary table, writing pieces of the file according 
to the free memory
  w5) I ask the server to write, inside the final bloab field, the 
concatenation of the rows of the temporary data
The read and write is working now.
Probably the free memory check should be done on both sides (client and server 
[does a function/view with the available free memory exist?]) taking the 
smallest one.
What do you think to use a similar approach in the pg_dump?
a) go through the table getting the size of each row / fields
b) when the size of the row or of the field is bigger than the value (provided 
or stored somewhere), read pieces of the field till the end

PS: I have see there are the "large object" that can work via streams. My files 
are actually not bigger than 1Gb, but, ok, maybe in the future I will use them 
instead of the bloabs.

Thank you 
Andrea


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2017-02-11 Thread Robert Haas
On Fri, Feb 10, 2017 at 11:22 PM, Amit Kapila  wrote:
>> Why can't we rely on _bt_walk_left?
>
> The reason is mentioned in comments, but let me try to explain with
> some example.  When you reach that point of code, it means that either
> the current page (assume page number is 10) doesn't contain any
> matching items or it is a half-dead page, both of which indicates that
> we have to move to the previous page.   Now, before checking if the
> current page contains matching items, we signal parallel machinery
> (via _bt_parallel_release) to allow workers to read the previous page
> (assume previous page number is 9).  So it is quite possible that
> after deciding that current page (page number 10) doesn't contain any
> matching tuples if we directly move to the previous page (in this case
> it will be 9) by using _bt_walk_left, some other worker would have
> read page 9.  In short, if we directly use _bt_walk_left(), then we
> are prone to returning some of the values twice as multiple workers
> can read the same page.

But ... the entire point of the seize-and-release stuff is to avoid
this problem.  You're suppose to seize the scan, read the current
page, walk left, store the page you find in the scan, and then release
the scan.  The entire point of that stuff is that when somebody's
advancing the scan to the next page, everybody else waits for them to
get done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums by default?

2017-02-11 Thread Robert Haas
On Fri, Feb 10, 2017 at 7:38 PM, Tomas Vondra
 wrote:
> Incidentally, I've been dealing with a checksum failure reported by a
> customer last week, and based on the experience I tend to agree that we
> don't have the tools needed to deal with checksum failures. I think such
> tooling should be a 'must have' for enabling checksums by default.
>
> In this particular case the checksum failure is particularly annoying
> because it happens during recovery (on a standby, after a restart), during
> startup, so FATAL means shutdown.
>
> I've managed to inspect the page in different way (dd and pageinspect from
> another instance), and it looks fine - no obvious data corruption, the only
> thing that seems borked is the checksum itself, and only three consecutive
> bits are flipped in the checksum. So this doesn't seem like a "stale
> checksum" - hardware issue is a possibility (the machine has ECC RAM
> though), but it might just as easily be a bug in PostgreSQL, when something
> scribbles over the checksum due to a buffer overflow, just before we write
> the buffer to the OS. So 'false failures' are not entirely impossible thing.
>
> And no, backups may not be a suitable solution - the failure happens on a
> standby, and the page (luckily) is not corrupted on the master. Which means
> that perhaps the standby got corrupted by a WAL, which would affect the
> backups too. I can't verify this, though, because the WAL got removed from
> the archive, already. But it's a possibility.
>
> So I think we're not ready to enable checksums by default for everyone, not
> until we can provide tools to deal with failures like this (I don't think
> users will be amused if we tell them to use 'dd' and inspect the pages in a
> hex editor).
>
> ISTM the way forward is to keep the current default (disabled), but to allow
> enabling checksums on the fly. That will mostly fix the issue for people who
> actually want checksums but don't realize they need to enable them at initdb
> time (and starting from scratch is not an option for them), are running on
> good hardware and are capable of dealing with checksum errors if needed,
> even without more built-in tooling.
>
> Being able to disable checksums on the fly is nice, but it only really
> solves the issue of extra overhead - it does really help with the failures
> (particularly when you can't even start the database, because of a checksum
> failure in the startup phase).
>
> So, shall we discuss what tooling would be useful / desirable?

FWIW, I appreciate this analysis and I think it's exactly the kind of
thing we need to set a strategy for moving forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-11 Thread Ashutosh Sharma
> FWIW it might be interesting to have comparable results from the same
> benchmarks I did. The scripts available in the git repositories should not
> be that hard to tweak. Let me know if you're interested and need help with
> that.
>

Sure, I will have a look into those scripts once I am done with the
simple pgbench testing. Thanks and sorry for the delayed response.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2017-02-11 Thread Tomas Vondra

On 02/11/2017 02:33 AM, Andres Freund wrote:

On 2017-02-11 02:13:59 +0100, Tomas Vondra wrote:

On 02/09/2017 10:37 PM, Andres Freund wrote:

Hi,

On 2016-12-13 01:45:13 +0100, Tomas Vondra wrote:

 src/backend/utils/mmgr/Makefile   |   2 +-
 src/backend/utils/mmgr/aset.c | 115 +
 src/backend/utils/mmgr/memdebug.c | 131 ++
 src/include/utils/memdebug.h  |  22 +++
 4 files changed, 156 insertions(+), 114 deletions(-)
 create mode 100644 src/backend/utils/mmgr/memdebug.c


I'm a bit loathe to move these to a .c file - won't this likely make
these debugging tools even slower?  Seems better to put some of them
them in a header as static inlines (not randomize, but the rest).



Do you have any numbers to support that? AFAICS compilers got really good in
inlining stuff on their own.


Unless you use LTO, they can't inline across translation units. And
using LTO is slow enough for linking that it's not that much fun to use,
as it makes compile-edit-compile cycles essentially take as long as a
full rebuild.



From 43aaabf70b979b172fd659ef4d0ef129fd78d72d Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Wed, 30 Nov 2016 15:36:23 +0100
Subject: [PATCH 2/3] slab allocator

---
 src/backend/replication/logical/reorderbuffer.c |  82 +--
 src/backend/utils/mmgr/Makefile |   2 +-
 src/backend/utils/mmgr/slab.c   | 803 
 src/include/nodes/memnodes.h|   2 +-
 src/include/nodes/nodes.h   |   1 +
 src/include/replication/reorderbuffer.h |  15 +-
 src/include/utils/memutils.h|   9 +


I'd like to see the reorderbuffer changes split into a separate commit
from the slab allocator introduction.



I rather dislike patches that only add a bunch of code, without actually
using it anywhere.



But if needed, this is trivial to do at commit time - just don't
commit the reorderbuffer bits.


Meh.



+ * Each block includes a simple bitmap tracking which chunks are used/free.
+ * This makes it trivial to check if all chunks on the block are free, and
+ * eventually free the whole block (which is almost impossible with a 
global
+ * freelist of chunks, storing chunks from all blocks).


Why is checking a potentially somewhat long-ish bitmap better than a
simple counter, or a "linked list" of "next free chunk-number" or such
(where free chunks simply contain the id of the subsequent chunk)?
Using a list instead of a bitmap would also make it possible to get
'lifo' behaviour, which is good for cache efficiency.  A simple
chunk-number based singly linked list would only imply a minimum
allocation size of 4 - that seems perfectly reasonable?



A block-level counter would be enough to decide if all chunks on the block
are free, but it's not sufficient to identify which chunks are free /
available for reuse.

The bitmap only has a single bit per chunk, so I find "potentially long-ish"
is a bit misleading. Any linked list implementation will require much more
per-chunk overhead - as the chunks are fixed-legth, it's possible to use
chunk index (instead of 64-bit pointers), to save some space. But with large
blocks / small chunks that's still at least 2 or 4 bytes per index, and
you'll need two (to implement doubly-linked list, to make add/remove
efficient).



For example assume 8kB block and 64B chunks, i.e. 128 chunks. With bitmap
that's 16B to track all free space on the block. Doubly linked list would
require 1B per chunk index, 2 indexes per chunk. That's 128*2 = 256B.



I have a hard time believing this the cache efficiency of linked lists
(which may or may not be real in this case) out-weights this, but if you
want to try, be my guest.


I'm not following - why would there be overhead in anything for
allocations bigger than 4 (or maybe 8) bytes? You can store the list
(via chunk ids, not pointers) inside the chunks itself, where data
otherwise would be.  And I don't see why you'd need a doubly linked
list, as the only operations that are needed are to push to the front of
the list, and to pop from the front of the list - and both operations
are simple to do with a singly linked list?



Oh! I have not considered storing the chunk indexes (for linked lists) 
in the chunk itself, which obviously eliminates the overhead concerns, 
and you're right a singly-linked list should be enough.


There will be some minimum-chunk-size requirement, depending on the 
block size/chunk size. I wonder whether it makes sense to try to be 
smart and make it dynamic, so that we only require 1B or 2B for cases 
when only that many chunks fit into a block, or just say that it's 4B 
and be done with it.


I mean 2^32 chunks ought to be enough for anyone, right?

I'm still not buying the cache efficiency argument, though. One of the 
reasons is that the implementation prefers blocks with fewer free chunks 
when 

Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-11 Thread Tomas Vondra

On 02/11/2017 01:42 PM, Alexander Korotkov wrote:


I think it would make sense to run more kinds of tests.  Could you try
set of tests provided by Tomas Vondra?
If even we wouldn't see win some of the tests, it would be still
valuable to see that there is no regression there.



FWIW it shouldn't be difficult to tweak my scripts and run them on 
another machine. You'd have to customize the parameters (scales, client 
counts, ...) and there are a few hard-coded paths, but that's about it.


regards
Tomas

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-11 Thread Tomas Vondra

On 02/11/2017 01:21 PM, Alexander Korotkov wrote:

Hi, Tomas!

On Sat, Feb 11, 2017 at 2:28 AM, Tomas Vondra
> wrote:

As discussed at the Developer meeting ~ a week ago, I've ran a
number of benchmarks on the commit, on a small/medium-size x86
machines. I currently don't have access to a machine as big as used
by Alexander (with 72 physical cores), but it seems useful to verify
the patch does not have negative impact on smaller machines.

In particular I've ran these tests:

* r/o pgbench
* r/w pgbench
* 90% reads, 10% writes
* pgbench with skewed distribution
* pgbench with skewed distribution and skipping


Thank you very much for your efforts!
I took a look at these tests.  One thing catch my eyes.  You warmup
database using pgbench run.  Did you consider using pg_prewarm instead?

SELECT sum(x.x) FROM (SELECT pg_prewarm(oid) AS x FROM pg_class WHERE
relkind IN ('i', 'r') ORDER BY oid) x;

In my experience pg_prewarm both takes less time and leaves less
variation afterwards.



I've considered it, but the problem I see in using pg_prewarm for 
benchmarking purposes is that it only loads the data into memory, but it 
does not modify the tuples (so all tuples have the same xmin/xmax, no 
dead tuples, ...), it does not set usage counters on the buffers and 
also does not generate any clog records.


I don't think there's a lot of variability in the results I measured. If 
you look at (max-min) for each combination of parameters, the delta is 
generally within 2% of average, with a very few exceptions, usually 
caused by the first run (so perhaps the warmup should be a bit longer).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-11 Thread Alexander Korotkov
On Wed, Feb 8, 2017 at 5:00 PM, Bernd Helmle  wrote:

> Am Dienstag, den 07.02.2017, 16:48 +0300 schrieb Alexander Korotkov:
> > But win isn't
> > as high as I observed earlier.  And I wonder why absolute numbers are
> > lower
> > than in our earlier experiments.  We used IBM E880 which is actually
> > two
>
> Did you run your tests on bare metal or were they also virtualized?
>

I run tests on bare metal.

> nodes with interconnect.  However interconnect is not fast enough to
> > make
> > one PostgreSQL instance work on both nodes.  Thus, used half of IBM
> > E880
> > which has 4 sockets and 32 physical cores.  While you use IBM E850
> > which is
> > really single node with 4 sockets and 48 physical cores.  Thus, it
> > seems
> > that you have lower absolute numbers on more powerful hardware.  That
> > makes
> > me uneasy and I think we probably don't get the best from this
> > hardware.
> > Just in case, do you use SMT=8?
>
> Yes, SMT=8 was used.
>
> The machine has 4 sockets, 8 Core each, 3.7 GHz clock frequency. The
> Ubuntu LPAR running on PowerVM isn't using all physical cores,
> currently 28 cores are assigned (=224 SMT Threads). The other cores are
> dedicated to the PowerVM hypervisor and a (very) small AIX LPAR.
>

Thank you very much for the explanation.

Thus, I see reasons why in your tests absolute results are lower than in my
previous tests.
1.  You use 28 physical cores while I was using 32 physical cores.
2.  You run tests in PowerVM while I was running test on bare metal.
PowerVM could have some overhead.
3.  I guess you run pgbench on the same machine.  While in my tests pgbench
was running on another node of IBM E880.

Therefore, having lower absolute numbers in your tests, win of LWLock
optimization is also lower.  That is understandable.  But win of LWLock
optimization is clearly visible definitely exceeds variation.

I think it would make sense to run more kinds of tests.  Could you try set
of tests provided by Tomas Vondra?
If even we wouldn't see win some of the tests, it would be still valuable
to see that there is no regression there.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Access inside pg_node_tree from query?

2017-02-11 Thread Michael Paquier
On Sat, Feb 11, 2017 at 7:19 PM, Ryan Murphy  wrote:
> Quick question, just curious - is there a way to access the members of a
> `pg_node_tree` value within a Postgres query?  By pg_node_tree I mean for
> example, the `ev_qual` field in the `pg_rewrite` table.  By "access the
> members" I mean in the same way that you can access the members of a json or
> jsonp value (using -> or ->>).  Is there anything analogous for
> pg_node_tree?

There are no in-core operators or functions to manipulate pg_node_tree.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-11 Thread Alexander Korotkov
Hi, Tomas!

On Sat, Feb 11, 2017 at 2:28 AM, Tomas Vondra 
wrote:

> As discussed at the Developer meeting ~ a week ago, I've ran a number of
> benchmarks on the commit, on a small/medium-size x86 machines. I currently
> don't have access to a machine as big as used by Alexander (with 72
> physical cores), but it seems useful to verify the patch does not have
> negative impact on smaller machines.
>
> In particular I've ran these tests:
>
> * r/o pgbench
> * r/w pgbench
> * 90% reads, 10% writes
> * pgbench with skewed distribution
> * pgbench with skewed distribution and skipping
>

Thank you very much for your efforts!
I took a look at these tests.  One thing catch my eyes.  You warmup
database using pgbench run.  Did you consider using pg_prewarm instead?

SELECT sum(x.x) FROM (SELECT pg_prewarm(oid) AS x FROM pg_class WHERE
relkind IN ('i', 'r') ORDER BY oid) x;

In my experience pg_prewarm both takes less time and leaves less variation
afterwards.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] DROP SUBSCRIPTION and ROLLBACK

2017-02-11 Thread Petr Jelinek
On 10/02/17 19:55, Masahiko Sawada wrote:
> On Thu, Feb 9, 2017 at 12:44 AM, Petr Jelinek
>  wrote:
>> On 08/02/17 07:40, Masahiko Sawada wrote:
>>> On Wed, Feb 8, 2017 at 9:01 AM, Michael Paquier
>>>  wrote:
 On Wed, Feb 8, 2017 at 1:30 AM, Fujii Masao  wrote:
> On Wed, Feb 8, 2017 at 12:26 AM, Petr Jelinek
>  wrote:
>> For example what happens if apply crashes during the DROP
>> SUBSCRIPTION/COMMIT and is not started because the delete from catalog
>> is now visible so the subscription is no longer there?
>
> Another idea is to treat DROP SUBSCRIPTION in the same way as VACUUM, 
> i.e.,
> make it emit an error if it's executed within user's transaction block.

 It seems to me that this is exactly Petr's point: using
 PreventTransactionChain() to prevent things to happen.
>>>
>>> Agreed. It's better to prevent to be executed inside user transaction
>>> block. And I understood there is too many failure scenarios we need to
>>> handle.
>>>
> Also DROP SUBSCRIPTION should call CommitTransactionCommand() just
> after removing the entry from pg_subscription, then connect to the 
> publisher
> and remove the replication slot.

 For consistency that may be important.
>>>
>>> Agreed.
>>>
>>> Attached patch, please give me feedback.
>>>
>>
>> This looks good (and similar to what initial patch had btw). Works fine
>> for me as well.
>>
>> Remaining issue is, what to do about CREATE SUBSCRIPTION then, there are
>> similar failure scenarios there, should we prevent it from running
>> inside transaction as well?
>>
> 
> Hmm,  after thought I suspect current discussing approach. For
> example, please image the case where CRAETE SUBSCRIPTION creates
> subscription successfully but fails to create replication slot for
> whatever reason, and then DROP SUBSCRIPTION drops the subscription but
> dropping replication slot is failed. In such case, CREAET SUBSCRIPTION
> and DROP SUBSCRIPTION return ERROR but the subscription is created and
> dropped successfully. I think that this behaviour confuse the user.
>
> I think we should just prevent calling DROP SUBSCRIPTION in user's
> transaction block. Or I guess that it could be better to separate the
> starting/stopping logical replication from subscription management.
> 

We need to stop the replication worker(s) in order to be able to drop
the slot. There is no such issue with startup of the worker as that one
is launched by launcher after the transaction has committed.

IMO best option is to just don't allow DROP/CREATE SUBSCRIPTION inside a
transaction block and don't do any commits inside of those (so that
there are no rollbacks, which solves your initial issue I believe). That
way failure to create/drop slot will result in subscription not being
created/dropped which is what we want.

Note that we do give users options to not create and not drop slots if
they wish so we should really treat slot related failures as command errors.

I don't want subscription setup to be 20 step where you have to deal
with various things on multiple servers when it can be just plain simple
one command on each side in many cases.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

2017-02-11 Thread Michael Banck
Hi,

Am Samstag, den 11.02.2017, 11:25 +0100 schrieb Michael Banck:
> Am Samstag, den 11.02.2017, 11:07 +0100 schrieb Magnus Hagander:
> > As for the code, while I haven't tested it, isn't the "checkpoint
> > completed" message in the wrong place? Doesn't PQsendQuery() complete
> > immediately, and the check needs to be put *after* the PQgetResult()
> > call?
> 
> I guess you're right, I've moved it further down. There is in fact a
> message about the xlog location (unless you switch off wal entirely),
> but having another one right before that mentioning the completed
> checkpoint sounds ok to me.
> 
> There's also some inconsistencies around which messages are prepended
> with "pg_basebackup: " and which are translatable; I guess all messages
> printed on --verbose should be translatable? Also, as almost all
> messages have a "pg_basebackup: " prefix, I've added it to the rest.

Sorry, there were two typoes in the last patch, I've attached a fixed
one.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c9dd62c..a298e5c 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -660,6 +660,14 @@ PostgreSQL documentation
   Notes
 
   
+   At the beginning of the backup, a checkpoint needs to be written on the
+   server the backup is taken from.  Especially if the option
+   --checkpoint=fast is not used, this can take some time
+   during which pg_basebackup will be idle on the
+   server it is running on.
+  
+
+  
The backup will include all files in the data directory and tablespaces,
including the configuration files and any additional files placed in the
directory by third parties, except certain temporary files managed by
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index b6463fa..60200a9 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1754,6 +1754,11 @@ BaseBackup(void)
 	if (maxrate > 0)
 		maxrate_clause = psprintf("MAX_RATE %u", maxrate);
 
+	if (verbose)
+		fprintf(stderr,
+_("%s: initiating base backup, waiting for checkpoint to complete\n"),
+progname);
+
 	basebkp =
 		psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s",
  escaped_label,
@@ -1791,6 +1796,9 @@ BaseBackup(void)
 
 	strlcpy(xlogstart, PQgetvalue(res, 0, 0), sizeof(xlogstart));
 
+	if (verbose)
+		fprintf(stderr, _("%s: checkpoint completed\n"), progname);
+
 	/*
 	 * 9.3 and later sends the TLI of the starting point. With older servers,
 	 * assume it's the same as the latest timeline reported by
@@ -1804,8 +1812,8 @@ BaseBackup(void)
 	MemSet(xlogend, 0, sizeof(xlogend));
 
 	if (verbose && includewal != NO_WAL)
-		fprintf(stderr, _("transaction log start point: %s on timeline %u\n"),
-xlogstart, starttli);
+		fprintf(stderr, _("%s: transaction log start point: %s on timeline %u\n"),
+progname, xlogstart, starttli);
 
 	/*
 	 * Get the header
@@ -1907,7 +1915,7 @@ BaseBackup(void)
 	}
 	strlcpy(xlogend, PQgetvalue(res, 0, 0), sizeof(xlogend));
 	if (verbose && includewal != NO_WAL)
-		fprintf(stderr, "transaction log end point: %s\n", xlogend);
+		fprintf(stderr, _("%s: transaction log end point: %s\n"), progname, xlogend);
 	PQclear(res);
 
 	res = PQgetResult(conn);
@@ -2048,7 +2056,7 @@ BaseBackup(void)
 	}
 
 	if (verbose)
-		fprintf(stderr, "%s: base backup completed\n", progname);
+		fprintf(stderr, _("%s: base backup completed\n"), progname);
 }
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

2017-02-11 Thread Michael Banck
Hi,


Am Samstag, den 11.02.2017, 11:07 +0100 schrieb Magnus Hagander:


> As for the code, while I haven't tested it, isn't the "checkpoint
> completed" message in the wrong place? Doesn't PQsendQuery() complete
> immediately, and the check needs to be put *after* the PQgetResult()
> call?

I guess you're right, I've moved it further down. There is in fact a
message about the xlog location (unless you switch off wal entirely),
but having another one right before that mentioning the completed
checkpoint sounds ok to me.

There's also some inconsistencies around which messages are prepended
with "pg_basebackup: " and which are translatable; I guess all messages
printed on --verbose should be translatable? Also, as almost all
messages have a "pg_basebackup: " prefix, I've added it to the rest.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c9dd62c..a298e5c 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -660,6 +660,14 @@ PostgreSQL documentation
   Notes
 
   
+   At the beginning of the backup, a checkpoint needs to be written on the
+   server the backup is taken from.  Especially if the option
+   --checkpoint=fast is not used, this can take some time
+   during which pg_basebackup will be idle on the
+   server it is running on.
+  
+
+  
The backup will include all files in the data directory and tablespaces,
including the configuration files and any additional files placed in the
directory by third parties, except certain temporary files managed by
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index b6463fa..874b6d6 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1754,6 +1754,11 @@ BaseBackup(void)
 	if (maxrate > 0)
 		maxrate_clause = psprintf("MAX_RATE %u", maxrate);
 
+	if (verbose)
+		fprintf(stderr,
+_("%s: initiating base backup, waiting for checkpoint to complete\n"),
+progname);
+
 	basebkp =
 		psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s",
  escaped_label,
@@ -1791,6 +1796,9 @@ BaseBackup(void)
 
 	strlcpy(xlogstart, PQgetvalue(res, 0, 0), sizeof(xlogstart));
 
+	if (verbose)
+		fprintf(stderr, _("%s: checkpoint completed\n"), progname);
+
 	/*
 	 * 9.3 and later sends the TLI of the starting point. With older servers,
 	 * assume it's the same as the latest timeline reported by
@@ -1804,8 +1812,8 @@ BaseBackup(void)
 	MemSet(xlogend, 0, sizeof(xlogend));
 
 	if (verbose && includewal != NO_WAL)
-		fprintf(stderr, _("transaction log start point: %s on timeline %u\n"),
-xlogstart, starttli);
+		fprintf(stderr, _("%s: transaction log start point: %s on timeline %u\n"),
+progname, xlogstart, starttli);
 
 	/*
 	 * Get the header
@@ -1907,7 +1915,7 @@ BaseBackup(void)
 	}
 	strlcpy(xlogend, PQgetvalue(res, 0, 0), sizeof(xlogend));
 	if (verbose && includewal != NO_WAL)
-		fprintf(stderr, "transaction log end point: %s\n", xlogend);
+		fprintf(stderr, _("%s: transaction log end point: %s\n", progname, xlogend);
 	PQclear(res);
 
 	res = PQgetResult(conn);
@@ -2048,7 +2056,7 @@ BaseBackup(void)
 	}
 
 	if (verbose)
-		fprintf(stderr, "%s: base backup completed\n", progname);
+		fprintf(stderr, _("%s: base backup completed\n)", progname);
 }
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Access inside pg_node_tree from query?

2017-02-11 Thread Ryan Murphy
Hi Postgressers,

Quick question, just curious - is there a way to access the members of a
`pg_node_tree` value within a Postgres query?  By pg_node_tree I mean for
example, the `ev_qual` field in the `pg_rewrite` table.  By "access the
members" I mean in the same way that you can access the members of a json
or jsonp value (using -> or ->>).  Is there anything analogous for
pg_node_tree?

Thanks!
Ryan


Re: [HACKERS] Logical replication existing data copy

2017-02-11 Thread Erik Rijkers

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


Apart from the failing one make check test (test 'object_address') which 
I reported earlier, I find it is easy to 'confuse' the replication.


I attach a script that intends to test the default COPY DATA.   There 
are two instances, initially without any replication.  The script inits 
pgbench on the master, adds a serial column to pgbench_history, and 
dump-restores the 4 pgbench-tables to the future replica.  It then 
empties the 4 pgbench-tables on the 'replica'.  The idea is that when 
logrep is initiated, data will be replicated from master, with the end 
result being that there are 4 identical tables on master and replica.


This often works but it also fails far too often (in my hands).  I test 
whether the tables are identical by comparing an md5 from an ordered 
resultset, from both replica and master.  I estimate that 1 in 5 tries 
fail; 'fail'  being a somewhat different table on replica (compared to 
mater), most often pgbench_accounts (typically there are 10-30 differing 
rows).  No errors or warnings in either logfile.   I'm not sure but I 
think testing on faster machines seem to be doing somewhat better 
('better' being less replication error).


Another, probably unrelated, problem occurs (but much more rarely) when 
executing 'DROP SUBSCRIPTION sub1' on the replica (see the beginning of 
the script).  Sometimes that command hangs, and refuses to accept 
shutdown of the server.  I don't know how to recover from this -- I just 
have to kill the replica server (master server still obeys normal 
shutdown) and restart the instances.


The script accepts 2 parameters, scale and clients (used in pgbench -s 
resp. -c)


I don't think I've managed to successfully run the script with more than 
1 client yet.


Can you have a look whether this is reproducible elsewhere?

thanks,

Erik Rijkers






#!/bin/sh

#  assumes both instances are running, on port 6972 and 6973

logfile1=$HOME/pg_stuff/pg_installations/pgsql.logical_replication/logfile.logical_replication
logfile2=$HOME/pg_stuff/pg_installations/pgsql.logical_replication2/logfile.logical_replication2

scale=1
if [[ ! "$1" == "" ]]
then
   scale=$1
fi

clients=1
if [[ ! "$2" == "" ]]
then
   clients=$2
fi

unset PGSERVICEFILE PGSERVICE PGPORT PGDATA PGHOST
PGDATABASE=testdb

# (this script also uses a custom pgpassfile)

## just for info:
# env | grep PG
# psql -qtAXc "select current_setting('server_version')"

port1=6972
port2=6973

function cb()
{
  #  display the 4 pgbench tables' accumulated content as md5s
  #  a,b,t,h stand for:  pgbench_accounts, -branches, -tellers, -history
  md5_total_6972='-1'
  md5_total_6973='-2'
  for port in $port1 $port2
  do
md5_a=$(echo "select * from pgbench_accounts order by aid"|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_b=$(echo "select * from pgbench_branches order by bid"|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_t=$(echo "select * from pgbench_tellers  order by tid"|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_h=$(echo "select * from pgbench_history  order by hid"|psql -qtAXp$port|md5sum|cut -b 1-9)
cnt_a=$(echo "select count(*) from pgbench_accounts"|psql -qtAXp $port)
cnt_b=$(echo "select count(*) from pgbench_branches"|psql -qtAXp $port)
cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp $port)
cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp $port)
md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | md5sum )
printf "$port a,b,t,h: %6d %6d %6d %6d" $cnt_a  $cnt_b  $cnt_t  $cnt_h
echo -n "   $md5_a  $md5_b  $md5_t  $md5_h"
if   [[ $port -eq $port1 ]]; then echo"   master"
elif [[ $port -eq $port2 ]]; then echo -n "   replica"
else  echo" ERROR  "
fi
  done
  if [[ "${md5_total[6972]}" == "${md5_total[6973]}" ]]
  then
echo " ok"
  else
echo " NOK"
  fi
}

bail=0

pub_count=$( echo "select count(*) from pg_publication" | psql -qtAXp 6972 )
if  [[ $pub_count -ne 0 ]]
then
  echo "pub_count -ne 0 - deleting pub1 & bailing out"
  echo "drop publication if exists pub1" | psql -Xp 6972
  bail=1
fi
sub_count=$( echo "select count(*) from pg_subscription" | psql -qtAXp 6973 )
if  [[ $sub_count -ne 0 ]]
then
  echo "sub_count -ne 0 - deleting sub1 & bailing out"
  echo "drop subscription if exists sub1" | psql -Xp 6973
  rc=$?
  echo "(drop subscr. ) )  rc=$rc"
  bail=1
fi

pub_count=$( echo "select count(*) from pg_publication"  | psql -qtAXp 6972 )
sub_count=$( echo "select count(*) from pg_subscription" | psql -qtAXp 6973 )

#if [[ $bail -eq 1 ]]
#then
#if  [[ $pub_count -eq 0 ]] && [[ 

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

2017-02-11 Thread Magnus Hagander
On Sat, Feb 11, 2017 at 10:38 AM, Michael Banck 
wrote:

> Hi,
>
> one take-away from the Gitlab Post-Mortem[1] appears to be that after
> their secondary lost replication, they were confused about what
> pg_basebackup was doing when they tried to rebuild it. It just sat there
> and did nothing (even with --verbose), so they assumed something was
> wrong with either the primary or the connection, and restarted it
> several times.
>
> AFAICT, it turns out the checkpoint was written on the master (they
> probably did not use -c fast), but this wasn't obvious to them:
>


Yeah, I've seen this happen to a number of people. I think that sounds like
what's happened here as well. I've considered things in the line of the
patch you posted, but never got around to actually doing anything about it.



> ISTM that even with WAL streaming, nothing would be written on the
> client server until the checkpoint is complete, as do_pg_start_backup()
> runs the checkpoint and only returns the starting WAL location
> afterwards.
>
> The attached (untested) patch is to kick of a discussion on how to
> improve the situation, it is supposed to mention the checkpoint when
> --verbose is used and adds a paragraph about the checkpoint being run to
> the Notes section of the documentation.
>
>
Docs look good to me, other than claiming that pg_basebackup runs on a
server (it can run anywhere). I would just say "during which pg_basebackup
will appear idle". How does that sound to you?

As for the code, while I haven't tested it, isn't the "checkpoint
completed" message in the wrong place? Doesn't PQsendQuery() complete
immediately, and the check needs to be put *after* the PQgetResult() call?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


[HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

2017-02-11 Thread Michael Banck
Hi,

one take-away from the Gitlab Post-Mortem[1] appears to be that after
their secondary lost replication, they were confused about what
pg_basebackup was doing when they tried to rebuild it. It just sat there
and did nothing (even with --verbose), so they assumed something was
wrong with either the primary or the connection, and restarted it
several times.

AFAICT, it turns out the checkpoint was written on the master (they
probably did not use -c fast), but this wasn't obvious to them:

"One of the engineers went to the secondary and wiped the data
directory, then ran pg_basebackup. Unfortunately pg_basebackup would
hang, producing no meaningful output, despite the --verbose option being
set."

[...]

"Unfortunately this did not resolve the problem of pg_basebackup not
starting replication immediately. One of the engineers decided to run it
with strace to see what it was blocking on. strace showed that
pg_basebackup was hanging in a poll call, but that did not provide any
other meaningful information that might have explained why."

[...]

"It would later be revealed by another engineer (who wasn't around at
the time) that this is normal behavior: pg_basebackup will wait for the
primary to start sending over replication data and it will sit and wait
silently until that time. Unfortunately this was not clearly documented
in our engineering runbooks nor in the official pg_basebackup document."

ISTM that even with WAL streaming, nothing would be written on the
client server until the checkpoint is complete, as do_pg_start_backup()
runs the checkpoint and only returns the starting WAL location
afterwards.

The attached (untested) patch is to kick of a discussion on how to
improve the situation, it is supposed to mention the checkpoint when
--verbose is used and adds a paragraph about the checkpoint being run to
the Notes section of the documentation.


Michael

[1]https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c9dd62c..a298e5c 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -660,6 +660,14 @@ PostgreSQL documentation
   Notes
 
   
+   At the beginning of the backup, a checkpoint needs to be written on the
+   server the backup is taken from.  Especially if the option
+   --checkpoint=fast is not used, this can take some time
+   during which pg_basebackup will be idle on the
+   server it is running on.
+  
+
+  
The backup will include all files in the data directory and tablespaces,
including the configuration files and any additional files placed in the
directory by third parties, except certain temporary files managed by
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index b6463fa..ae18c16 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1754,6 +1754,9 @@ BaseBackup(void)
 	if (maxrate > 0)
 		maxrate_clause = psprintf("MAX_RATE %u", maxrate);
 
+	if (verbose)
+		fprintf(stderr, "%s: initiating base backup, waiting for checkpoint to complete\n", progname);
+
 	basebkp =
 		psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s",
  escaped_label,
@@ -1771,6 +1774,9 @@ BaseBackup(void)
 		disconnect_and_exit(1);
 	}
 
+	if (verbose)
+		fprintf(stderr, "%s: checkpoint completed\n", progname);
+
 	/*
 	 * Get the starting xlog position
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers