date:20140204

Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-04 Thread Heikki Linnakangas


On 02/04/2014 02:40 AM, Peter Geoghegan wrote:

On Fri, Jan 31, 2014 at 9:09 AM, Heikki Linnakangas
 wrote:

I refactored the loop in _bt_moveright to, well, not have that bug anymore.
The 'page' and 'opaque' pointers are now fetched at the beginning of the
loop. Did I miss something?


I think so, yes. You still aren't assigning the value returned by
_bt_getbuf() to 'buf'.


D'oh, you're right.


Since, as I mentioned, _bt_finish_split() ultimately unlocks *and
unpins*, it may not be the same buffer as before, so even with the
refactoring there are race conditions.


Care to elaborate? Or are you just referring to the missing "buf = " ?


A closely related issue is that you haven't mentioned anything about
buffer pins/refcount side effects in comments above
_bt_finish_split(), even though I believe you should.


Ok.


A minor stylistic concern is that I think it would be better to only
have one pair of _bt_finish_split()/_bt_getbuf() calls regardless of
the initial value of 'access'.


Ok.

I also changed _bt_moveright to never return a write-locked buffer, when 
the caller asked for a read-lock (an issue you pointed out earlier in 
this thread).


Attached is a new version of the patch, with those issues fixed. 
btree-incomplete-split-4.patch is a complete patch against the latest 
fix-btree-page-deletion patch, and moveright-assign-fix.patch is just 
the changes to _bt_moveright, if you want to review just the changes 
since the previous patch I posted.


- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 03efc29..43ee75f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -404,12 +404,34 @@ an additional insertion above that, etc).
 For a root split, the followon WAL entry is a "new root" entry rather than
 an "insertion" entry, but details are otherwise much the same.
 
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course).  This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because splitting involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks on-they-fly, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra updates in what would otherwise
+be a read-only operation (updating is not possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is cleared atomically
+with the insertion. The child page is kept locked until the insertion in the
+parent is finished and the flag in the child cleared, but can be released
+immediately after that, before recursing up the tree, if the parent also
+needs to be split. This ensures that incompletely split pages should not be
+seen under normal circumstances; only when insertion to the parent fails
+for some reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
 
 When splitting a non-root page that is alone on its level, the required
 metapage update (of the "fast root" link) is performed and logged as part
@@ -422,6 +444,14 @@ page is a second record.  If vacuum is interrupted for some reason, or the
 system crashes, the tree is consistent for searches and insertions.  The next
 VACUUM will find the half-dead leaf page and continue the deletion.
 
+Before 9.4, we used to keep track of incomplete splits and page deletions
+during recovery and finish them immediately at end of recovery, instead of
+doing it lazily at the next  insertion or vacuum. However, that made the
+recovery much more complicated, and only fixed the problem when crash
+recovery was performed. An incomplete split can also occur if an otherwise
+recoverable error, like out-of-memory or out-of-disk-space, happens while
+inserting the downlink to the parent.
+
 Sca

Re: [HACKERS] jsonb and nested hstore

2014-02-04 Thread Oleg Bartunov

Andrew provided us more information and we'll work on recv. What
people think about testing this stuff ?  btw, we don't have any
regression test on this.

Oleg

On Wed, Feb 5, 2014 at 2:03 AM, Josh Berkus  wrote:
> On 02/03/2014 07:27 AM, Andres Freund wrote:
>> On 2014-02-03 09:22:52 -0600, Merlin Moncure wrote:
 I lost my stomach (or maybe it was the glass of red) somewhere in the
 middle, but I think this needs a lot of work. Especially the io code
 doesn't seem ready to me. I'd consider ripping out the send/recv code
 for 9.4, that seems the biggest can of worms. It will still be usable
 without.
>>>
>>> Not having type send/recv functions is somewhat dangerous; it can
>>> cause problems for libraries that run everything through the binary
>>> wire format.  I'd give jsonb a pass on that, being a new type, but
>>> would be concerned if hstore had that ability revoked.
>>
>> Yea, removing it for hstore would be a compat problem...
>>
>>> offhand note: hstore_send seems pretty simply written and clean; it's
>>> a simple nonrecursive iterator...
>>
>> But a send function is pretty pointless without the corresponding recv
>> function... And imo recv simply is to dangerous as it's currently
>> written.
>> I am not saying that it cannot be made work, just that it's still nearly
>> as ugly as when I pointed out several of the dangers some weeks back.
>
> Oleg, Teodor, any comments on the above?
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL Failback without rebuild

2014-02-04 Thread Michael Paquier

On Wed, Feb 5, 2014 at 3:14 PM, Amit Kapila  wrote:
> On Wed, Feb 5, 2014 at 10:30 AM, James Sewell 
>> I've seen some proposals and a tool (pg_rewind), but all seem to have draw
>> backs.
>
> As far as I remember, one of the main drawbacks for pg_rewind was related to
> hint bits which can be avoided by wal_log_hints. pg_rewind is not part of
> core
> PostgreSQL code, however if you wish, you can try that tool to see if can it
> solve your purpose.
For 9.3, pg_rewind is only safe with page checksums enabled. For 9.4,
yes wal_log_hints or checksums is mandatory. The code contains as well
some safety checks as well to ensure that a node not using those
parameters cannot be rewinded.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb and nested hstore

2014-02-04 Thread Heikki Linnakangas

On 02/03/2014 05:22 PM, Merlin Moncure wrote:

>I lost my stomach (or maybe it was the glass of red) somewhere in the
>middle, but I think this needs a lot of work. Especially the io code
>doesn't seem ready to me. I'd consider ripping out the send/recv code
>for 9.4, that seems the biggest can of worms. It will still be usable
>without.

Not having type send/recv functions is somewhat dangerous; it can
cause problems for libraries that run everything through the binary
wire format.  I'd give jsonb a pass on that, being a new type, but
would be concerned if hstore had that ability revoked.

send/recv functions are also needed for binary-format COPY. IMHO jsonb 
must have send/recv functions. All other built-in types have them, 
except for types like 'smgr', 'aclitem' and 'any*' that no-one should be 
using as column types.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL Failback without rebuild

2014-02-04 Thread Amit Kapila

On Wed, Feb 5, 2014 at 10:30 AM, James Sewell 
wrote:
>
> Hello All,
>
> I have been reading through some of the recent discussions about failback
when in a streaming replication setup. I define failback as:
>
> Node A is master, Node B is slave
> Node A crashes || Node A is stopped || nothing happens
> Promote Node B to Master
> Attach Node A as slave
>
> My understanding is currently to achieve step three you need to take a
base backup of Node B and deploy it to Node A before starting streaming
replication (or use rsync etc...).

I think in above sentence you mean to say "to achieve step *four* .."

> This is very undesirable for many users, especially if they have a very
large database.
>
> From the discussions I can see that the problem is to do with Node A
writing changes to disk that Node B are not streamed before Node A crashes.

Yes, this is right.

> Has there been any consensus on this issue? Are there any solutions which
might make it into 9.4 or 9.5?

As far as I know, there is still no solution provided in 9.4, can't say
anything
for 9.5 with any certainity. However in 9.4, there is a new parameter
wal_log_hints which can be useful to overcome drawback of pg_rewind.

> I've seen some proposals and a tool (pg_rewind), but all seem to have
draw backs.

As far as I remember, one of the main drawbacks for pg_rewind was related to
hint bits which can be avoided by wal_log_hints. pg_rewind is not part of
core
PostgreSQL code, however if you wish, you can try that tool to see if can it
solve your purpose.

Note - James, in previous reply, I missed to cc to hackers, so sending it
again.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-04 Thread Amit Kapila

On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas  wrote:
> On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila  wrote:
>> Now there is approximately 1.4~5% CPU gain for
>> "hundred tiny fields, half nulled" case

> Assuming that the logic isn't buggy, a point in need of further study,
> I'm starting to feel like we want to have this.  And I might even be
> tempted to remove the table-level off switch.

I have tried to stress on worst case more, as you are thinking to
remove table-level switch and found that even if we increase the
data by approx. 8 times ("ten long fields, all changed", each field contains
80 byte data), the CPU overhead is still < 5% which clearly shows that
the overhead doesn't increase much even if the length of unmatched data
is increased by much larger factor.
So the data for worst case adds more weight to your statement
("remove table-level switch"), however there is no harm in keeping
table-level option with default as 'true' and if some users are really sure
the updates in their system will have nothing in common, then they can
make this new option as 'false'.

Below is data for the new case " ten long fields, all changed" added
in attached script file:

Unpatched
   testname   | wal_generated | duration
--+---+--
 ten long fields, all changed |3473999520 | 45.0375978946686
 ten long fields, all changed |3473999864 | 45.2536928653717
 ten long fields, all changed |3474006880 | 45.1887288093567

After pgrb_delta_encoding_v8.patch
--
  testname   | wal_generated | duration
--+---+--
 ten long fields, all changed |3474006456 | 47.5744359493256
 ten long fields, all changed |3474000136 | 47.3830440044403
 ten long fields, all changed |3474002688 | 46.9923310279846

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

wal-update-testsuite.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-04 Thread Tom Lane

Craig Ringer  writes:
> On 02/05/2014 06:29 AM, Tom Lane wrote:
>> I had been okay with the manual PGDLLIMPORT-sprinkling approach
>> (not happy with it, of course, but prepared to tolerate it) as long
>> as I believed the buildfarm would reliably tell us of the need for
>> it.  That assumption has now been conclusively disproven, though.

> I'm kind of horrified that the dynamic linker doesn't throw its toys
> when it sees this.

Indeed :-(.

The truly strange part of this is that it seems that the one Windows
buildfarm member that's telling the truth (or most nearly so, anyway)
is narwhal, which appears to have the oldest and cruftiest toolchain
of the lot.  I'd really like to come out the other end of this
investigation with a clear understanding of why the newer toolchains
are failing to report a link problem, and yet not building working
executables.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PostgreSQL Failback without rebuild

2014-02-04 Thread James Sewell

Hello All,

I have been reading through some of the recent discussions about failback
when in a streaming replication setup. I define failback as:

1. Node A is master, Node B is slave
2. Node A crashes || Node A is stopped || nothing happens
3. Promote Node B to Master
4. Attach Node A as slave

My understanding is currently to achieve step three you need to take a base
backup of Node B and deploy it to Node A before starting streaming
replication (or use rsync etc...).

This is very undesirable for many users, especially if they have a very
large database.

>From the discussions I can see that the problem is to do with Node A
writing changes to disk that Node B are not streamed before Node A crashes.

Has there been any consensus on this issue? Are there any solutions which
might make it into 9.4 or 9.5? I've seen some proposals and a tool
(pg_rewind), but all seem to have draw backs.

I've been looking mainly at these threads:

http://www.postgresql.org/message-id/CAF8Q-Gy7xa60HwXc0MKajjkWFEbFDWTG=ggyu1kmt+s2xcq...@mail.gmail.com

http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com

http://www.postgresql.org/message-id/519df910.4020...@vmware.com

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect
__

Level 2, 50 Queen St, Melbourne VIC 3000

*P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099

--
The contents of this email are confidential and may be subject to legal or
professional privilege and copyright. No representation is made that this
email is free of viruses or other defects. If you have received this
communication in error, you may not copy or distribute any part of it or
otherwise disclose its contents to anyone. Please advise the sender of your
incorrect receipt of this correspondence.

96 matches

Mail list logo