Re: [HACKERS] Question about behavior of snapshot too old feature

2016-10-16 Thread Masahiko Sawada
On Fri, Oct 14, 2016 at 11:29 PM, Kevin Grittner <kgri...@gmail.com> wrote:
> On Fri, Oct 14, 2016 at 8:53 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Fri, Oct 14, 2016 at 1:40 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>
>>> For example, I set old_snapshot_threshold = 1min and prepare a table
>>> and two terminals.
>>> And I did the followings steps.
>>>
>>> 1. [Terminal 1] Begin transaction and get snapshot data and wait.
>>>  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
>>>  SELECT * FROM test;
>>>
>>> 2. [Terminal 2] Another session updates test table in order to make
>>> snapshot dirty.
>>>  BEGIN;
>>>  UPDATE test SET c = c + 100;
>>>  COMMIT;
>>>
>>> 3. [Terminal 1] 1 minute after, read the test table again in same
>>> transaction opened at #1. I got no error.
>>> SELECT * FROM test;
>>>
>>> 4. [Terminal 2] Another session reads the test table.
>>>  BEGIN;
>>>  SELECT * FROM test;
>>>  COMMIT;
>>>
>>> 5. [Terminal 1] 1 minute after, read the test table again, and got
>>> "snapshot error" error.
>>>  SELECT * FROM test;
>>>
>>> Since #2 makes a snapshot I got at #1 dirty, I expected to get
>>> "snapshot too old" error at #3 where I read test table again after
>>> enough time. But I could never get "snapshot too old" error at #3.
>>>
>>
>> Here, the basic idea is that till the time corresponding page is not
>> pruned or table vacuuming hasn't triggered, this error won't occur.
>> So, I think what is happening here that during step #4 or step #3, it
>> has pruned the table, after which you started getting error.
>
> The pruning might be one factor.  Another possible issue is that
> effectively it doesn't start timing that 1 minute until the clock
> hits the start of the next minute (i.e., 0 seconds after the next
> minute).  The old_snapshot_threshold does not attempt to guarantee
> that the snapshot too old error will happen at the earliest
> opportunity, but that the error will *not* happen until the
> snapshot is *at least* that old.  Keep in mind that the expected
> useful values for this parameter are from a small number of hours
> to a day or two, depending on the workload.  The emphasis was on
> minimizing overhead, even when it meant the cleanup might not be
> quite as "eager" as it could otherwise be.
>

Thanks! I understood.
I've tested with autovacuum = off, so it has pruned the table at step #4.

When I set old_snapshot_threshold = 0 I got error at step #3, which
means that the error is occurred without table pruning.
We have regression test for this feature but it sets
old_snapshot_threshold = 0, I doubt about we can test it properly.
Am I missing something?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Question about behavior of snapshot too old feature

2016-10-14 Thread Masahiko Sawada
Hi all,

I have a question about behavior of snapshot too old feature.

For example, I set old_snapshot_threshold = 1min and prepare a table
and two terminals.
And I did the followings steps.

1. [Terminal 1] Begin transaction and get snapshot data and wait.
 BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
 SELECT * FROM test;

2. [Terminal 2] Another session updates test table in order to make
snapshot dirty.
 BEGIN;
 UPDATE test SET c = c + 100;
 COMMIT;

3. [Terminal 1] 1 minute after, read the test table again in same
transaction opened at #1. I got no error.
SELECT * FROM test;

4. [Terminal 2] Another session reads the test table.
 BEGIN;
 SELECT * FROM test;
 COMMIT;

5. [Terminal 1] 1 minute after, read the test table again, and got
"snapshot error" error.
 SELECT * FROM test;

Since #2 makes a snapshot I got at #1 dirty, I expected to get
"snapshot too old" error at #3 where I read test table again after
enough time. But I could never get "snapshot too old" error at #3.
On the other hand, when I set old_snapshot_threshold = 0 I can got the
error at #3.

Is this expected behavior?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-13 Thread Masahiko Sawada
On Fri, Oct 7, 2016 at 4:25 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Thu, Oct 6, 2016 at 2:52 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2016/10/06 17:45, Ashutosh Bapat wrote:
>>> On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>>> wrote:
>>>> On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat 
>>>> <ashutosh.ba...@enterprisedb.com> wrote:
>>>>>> My understanding is that basically the local server can not return
>>>>>> COMMIT to the client until 2nd phase is completed.
>>>>>
>>>>> If we do that, the local server may not return to the client at all,
>>>>> if the foreign server crashes and never comes up. Practically, it may
>>>>> take much longer to finish a COMMIT, depending upon how long it takes
>>>>> for the foreign server to reply to a COMMIT message.
>>>>
>>>> Yes, I think 2PC behaves so, please refer to [1].
>>>> To prevent local server stops forever due to communication failure.,
>>>> we could provide the timeout on coordinator side or on participant
>>>> side.
>>>
>>> This too, looks like a heuristic and shouldn't be the default
>>> behaviour and hence not part of the first version of this feature.
>>
>> At any rate, the coordinator should not return to the client until after
>> the 2nd phase is completed, which was the original point.  If COMMIT
>> taking longer is an issue, then it could be handled with one of the
>> approaches mentioned so far (even if not in the first version), but no
>> version of this feature should really return COMMIT to the client only
>> after finishing the first phase.  Am I missing something?
>
> There is small time window between actual COMMIT and a commit message
> returned. An actual commit happens when we insert a WAL saying
> transaction X committed and then we return to the client saying a
> COMMIT happened. Note that a transaction may be committed but we will
> never return to the client with a commit message, because connection
> was lost or the server crashed. I hope we agree on this.

Agree.

> COMMITTING the foreign prepared transactions happens after we COMMIT
> the local transaction. If we do it before COMMITTING local transaction
> and the local server crashes, we will roll back local transaction
> during subsequence recovery while the foreign segments have committed
> resulting in an inconsistent state.
>
> If we are successful in COMMITTING foreign transactions during
> post-commit phase, COMMIT message will be returned after we have
> committed all foreign transactions. But in case we can not reach a
> foreign server, and request times out, we can not revert back our
> decision that we are going to commit the transaction. That's my answer
> to the timeout based heuristic.

IIUC 2PC is the protocol that assumes that all of the foreign server live.
In case we can not reach a foreign server during post-commit phase,
basically the transaction and following transaction should stop until
the crashed server revived. This is the first place to implement 2PC
for FDW, I think. The heuristically determination approach I mentioned
is one of the optimization idea to avoid holding up transaction in
case a foreign server crashed.

> I don't see much point in holding up post-commit processing for a
> non-responsive foreign server, which may not respond for days
> together. Can you please elaborate a use case? Which commercial
> transaction manager does that?

For example, the client updates a data on foreign server and then
commits. And the next transaction from the same client selects new
data which was updated on previous transaction. In this case, because
the first transaction is committed the second transaction should be
able to see updated data, but it can see old data in your idea. Since
these is obviously order between first transaction and second
transaction I think that It's not problem of providing consistent
view.

I guess transaction manager of Postgres-XC behaves so, no?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-10-11 Thread Masahiko Sawada
On Wed, Sep 28, 2016 at 5:14 PM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> On Sat, Sep 24, 2016 at 5:37 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> I still vote for changing behaviour of existing syntax 'k (n1, n2)' to
>> quorum commit.
>> That is,
>> 1.  'First k (n1, n2, n3)' means that the master server waits for ACKs
>> from k standby servers whose name appear earlier in the list.
>> 2.  'Any k (n1, n2, n3)' means that the master server waits for ACKs
>> from any k listed standby servers.
>> 3.  'n1, n2, n3' is the same as #1 with k=1.
>> 4.  '(n1, n2, n3)' is the same as #2 with k=1.
>
> OK, so I have done a review of this patch keeping that in mind as
> that's the consensus. I am still getting familiar with the code...

Thank you for reviewing!

> -transactions will wait until all the standby servers which are considered
> +transactions will wait until the multiple standby servers which
> are considered
> There is no real need to update this sentence.
>
> +FIRST means to control the standby servers with
> +different priorities. The synchronous standbys will be those
> +whose name appear earlier in this list, and that are both
> +currently connected and streaming data in real-time(as shown
> +by a state of streaming in the
> +
> +pg_stat_replication view). Other standby
> +servers appearing later in this list represent potential
> +synchronous standbys. If any of the current synchronous
> +standbys disconnects for whatever reason, it will be replaced
> +immediately with the next-highest-priority standby.
> +For example, a setting of FIRST 3 (s1, s2, s3, s4)
> +makes transaction commits wait until their WAL records are received
> +by three higher-priority standbys chosen from standby servers
> +s1, s2, s3 and s4.
> It does not seem necessary to me to enter in this level of details:
> The keyword FIRST, coupled with an integer number N, chooses the first
> N higher-priority standbys and makes transaction commit when their WAL
> records are received. For example FIRST 3 (s1, s2, s3, s4)
> makes transaction commits wait until their WAL records are received by
> the three high-priority standbys chosen from standby servers s1, s2,
> s3 and s4.

Will fix.

> +ANY means to control all of standby servers with
> +same priority. The master sever will wait for receipt from
> +at least num_sync
> +standbys, which is quorum commit in the literature. The all of
> +listed standbys are considered as candidate of quorum commit.
> +For example, a setting of  ANY 3 (s1, s2, s3, s4) makes
> +transaction commits wait until receiving receipts from at least
> +any three standbys of four listed servers s1,
> +s2, s3, s4.
>
> Similarly, something like that...
> The keyword ANY, coupled with an integer number N, chooses N standbys
> in a set of standbys with the same, lowest, priority and makes
> transaction commit when WAL records are received those N standbys. For
> example ANY 3(s1, s2, s3, s4) makes transaction commits wait until WAL
> records have been received from 3 servers in the set s1, s2, s3 and
> s4.

Will fix.

> It could be good also to mention that no keyword specified means ANY,
> which is incompatible with 9.6. The docs also miss the fact that if a
> simple list of servers is given, without parenthesis and keywords,
> this is equivalent to FIRST 1.

Right. I will add those documentations.

> -synchronous_standby_names = '2 (s1, s2, s3)'
> +synchronous_standby_names = 'First 2 (s1, s2, s3)'
> Nit here. It may be a good idea to just use upper-case characters in
> the docs, or just lower-case for consistency, but not mix both.
> Usually GUCs use lower-case characters.

Agree. Will fix.

> +  when standby is considered as a condidate of quorum commit.
> s/condidate/candidate/

Will fix.

> -syncrep_scanner.c: FLEXFLAGS = -CF -p
> +syncrep_scanner.c: FLEXFLAGS = -CF -p -i
> Hm... Is that actually a good idea? Now "NODE" and "node" are two
> different things for application_name, but with this patch both would
> have the same meaning. I am getting to think that we could just use
> the lower-case characters for the keywords any/first. Is this -i
> switch a problem for elements in standby_list?

The string of standby name is not changed actually, only the parser
doesn't distinguish between "NODE" and "node".
The values used for checking application_name will still works fine.
If we want to name "first" or "any" as the standby name then it should
be double quoted.

Re: [HACKERS] Autovacuum launcher process launches worker process at high frequency

2016-10-10 Thread Masahiko Sawada
On Thu, Oct 6, 2016 at 12:11 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Wed, Oct 5, 2016 at 7:28 AM, Masahiko Sawada <sawada.m...@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> I found the kind of strange behaviour of the autovacuum launcher
>> process when XID anti-wraparound vacuum.
>>
>> Suppose that a database (say test_db) whose age of frozenxid is about
>> to reach max_autovacuum_max_age has three tables T1 and T2.
>> T1 is very large and is frequently updated, so vacuum takes long time
>> for vacuum.
>> T2 is static and already frozen table, thus vacuum can skip to vacuum
>> whole table.
>> And anti-wraparound vacuum was already executed on other databases.
>>
>> Once the age of datfrozenxid of test_db exceeded
>> max_autovacuum_max_age, autovacuum launcher launches worker process in
>> order to do anti-wraparound vacuum on testdb.
>> A worker process assigned to test_db begins to vacuum T1, it takes long
>> time.
>> Meanwhile another worker process is assigned to test_db and completes
>> to vacuum on T2 and exits.
>>
>> After for while, the autovacuum launcher launches new worker again and
>> worker is assigned to test_db again.
>> But that worker exits quickly because there is no table we need to
>> vacuum. (T1 is being vacuumed by another worker process).
>> When new worker process starts, worker process sends SIGUSR2 signal to
>> launcher process to wake up him.
>> Although the launcher process executes WaitLatch() after launched new
>> worker, it is woken up and launches another new worker process soon
>> again.
>
>
> See also this thread, which was never resolved:
>
> https://www.postgresql.org/message-id/flat/CAMkU%3D1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta%3DYPyFPQ%40mail.gmail.com#CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=ypy...@mail.gmail.com
>
>
>
>>
>> As a result, launcher process launches new worker process at extremely
>> high frequency regardless of autovacuum_naptime, which increase cpu
>> use rate.
>>
>> Why does auto vacuum worker need to wake up launcher process after
>> started?
>>
>> autovacuum.c:L1604
>>  /* wake up the launcher */
>> if (AutoVacuumShmem->av_launcherpid != 0)
>> kill(AutoVacuumShmem->av_launcherpid, SIGUSR2);
>
>
>
> I think that that is so that the launcher can launch multiple workers in
> quick succession if it has fallen behind schedule. It can't launch them in a
> tight loop, because its signals to the postmaster would get merged into one
> signal, so it has to wait for one to get mostly set-up before launching the
> next.
>
> But it doesn't make any real difference to your scenario, as the short-lived
> worker will wake the launcher up a few microseconds later anyway, when it
> realizes it has no work to do and so exits.
>

Thank you for the reply.

I also thought that it's better to have information about how many
tables there are in each database and not been vacuumed yet.
But I'm not sure how to implement that and  the current optimistic
logic is more safe in most situation.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-07 Thread Masahiko Sawada
On Fri, Oct 7, 2016 at 6:46 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Oct 6, 2016 at 3:56 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>>> Sure, I could handle each case separately, but the goal of this patch
>>> (as hinted at in the Subject) is to generalize all the different tasks
>>> we've been giving to VACUUM.  The only missing piece is what the first
>>> patch addresses; which is insert-mostly tables never getting vacuumed.
>>
>> I don't buy the argument that we should do this in order to be "general".
>> Those other pieces are present to achieve a specific job, not out of
>> generality.
>
> +1.
>
>> If we want to have something to vacuum insert-mostly tables for the sake of
>> the index-only-scans, then I don't think we can ignore the fact that the
>> visibility map is page based, not tuple based.  If we insert 10,000 tuples
>> into a full table and they all go into 100 pages at the end, that is very
>> different than inserting 10,000 tuples which each go into a separate page
>> (because each page has that amount of freespace).  In one case you have
>> 10,000 tuples not marked as all visible, in the other case you have
>> 1,000,000 not marked as all visible.
>
> +1.
>
>> Also, I think that doing more counts which get amalgamated into the same
>> threshold, rather than introducing another parameter, is a bad thing.  I
>> have insert-mostly, most of the time, tables which are never going to
>> benefit from index-only-scans, and I don't want to pay the cost of them
>> getting vacuumed just because of some inserts, when I will never get a
>> benefit out of it.  I could turn autovacuum off for those tables, but then I
>> would have to remember to manually intervene on the rare occasion they do
>> get updates or deletes.  I want to be able to manually pick which tables I
>> sign up for this feature (and then forget it), not manually pick which ones
>> to exempt from it and have to constantly review that.
>
> Of course, if you do that, then what will happen is eventually it will
> be time to advance relfrozenxid for that relation, and you'll get a
> giant soul-crushing VACUUM of doom, rather than a bunch of small,
> hopefully-innocuous VACUUMs.  I've been wondering if would make sense
> to trigger vacuums based on inserts based on a *fixed* number of
> pages, rather than a percentage of the table.  Say, for example,
> whenever we have 64MB of not-all-visible pages, we VACUUM.

+1

>
> There are some difficulties:
>
> 1. We don't want to vacuum too early.  For example, a bulk load may
> vastly inflate the size of a table, but vacuuming it while the load is
> in progress will be useless.  You can maybe avoid that problem by
> basing this on statistics only reported at the end of the transaction,
> but there's another, closely-related issue: vacuuming right after the
> transaction completes may be useless, too, if there are old snapshots
> still in existence that render the newly-inserted tuples
> not-all-visible.

If the dummy xid can be generated for vacuum before starting vacuum
for maintenance vm which is triggered by the amount of the cleared vm
page, that vacuum could wait for old transaction finishes.


> 2. We don't want to trigger index vacuuming for a handful of dead
> tuples, or at least not too often.  I've previously suggested
> requiring a certain minimum number of dead index tuples that would be
> required before index vacuuming occurs; prior to that, we'd just
> truncate to dead line pointers and leave those for the next vacuum
> cycle.  This might need some refinement - should it be page-based? -
> but the basic idea still seems sound.
>

Where do we leave dead line pointers at?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-07 Thread Masahiko Sawada
On Fri, Oct 7, 2016 at 9:40 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Thu, Oct 6, 2016 at 2:46 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>>
>> > Also, I think that doing more counts which get amalgamated into the same
>> > threshold, rather than introducing another parameter, is a bad thing.  I
>> > have insert-mostly, most of the time, tables which are never going to
>> > benefit from index-only-scans, and I don't want to pay the cost of them
>> > getting vacuumed just because of some iOn Thu, Oct 6, 2016 at 3:56 PM,
>> > Jeff Janes <jeff.ja...@gmail.com> wrote:
>> >> Sure, I could handle each case separately, but the goal of this patch
>>
>> nserts, when I will never get a
>> > benefit out of it.  I could turn autovacuum off for those tables, but
>> > then I
>> > would have to remember to manually intervene on the rare occasion they
>> > do
>> > get updates or deletes.  I want to be able to manually pick which tables
>> > I
>> > sign up for this feature (and then forget it), not manually pick which
>> > ones
>> > to exempt from it and have to constantly review that.
>>
>> Of course, if you do that, then what will happen is eventually it will
>> be time to advance relfrozenxid for that relation, and you'll get a
>> giant soul-crushing VACUUM of doom, rather than a bunch of small,
>> hopefully-innocuous VACUUMs.
>
>
> I think I will get the soul-crushing vacuum of doom anyway, if the database
> lasts that long.  For one reason, in my case while updates and deletes are
> rare, they are common enough that the frozen vm bits from early vacuums will
> be mostly cleared before the vacuum of doom comes around.
>
> For a second reason, I don't think the frozen bit in the vm actually gets
> set by much of anything other than a autovacuum-for-wraparound or a manual
> VACUUM FREEZE.

Yeah, the freeze map would be effective especially for static table.
Since we can skip to vacuum frozen page and the freezing tuples is not
big pain usually, we might want to change autovacuum more aggressive
to freeze the page by reducing default value of vacuum_freeze_min_age.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-06 Thread Masahiko Sawada
On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
>>>
>>> No, the COMMIT returns after the first phase. It can not wait for all
>>> the foreign servers to complete their second phase
>>
>> Hm, it sounds like it's same as normal commit (not 2PC).
>> What's the difference?
>>
>> My understanding is that basically the local server can not return
>> COMMIT to the client until 2nd phase is completed.
>
>
> If we do that, the local server may not return to the client at all,
> if the foreign server crashes and never comes up. Practically, it may
> take much longer to finish a COMMIT, depending upon how long it takes
> for the foreign server to reply to a COMMIT message.

Yes, I think 2PC behaves so, please refer to [1].
To prevent local server stops forever due to communication failure.,
we could provide the timeout on coordinator side or on participant
side.

>
>> Otherwise the next transaction can see data that is not committed yet
>> on remote server.
>
> 2PC doesn't guarantee transactional consistency all by itself. It only
> guarantees that all legs of a distributed transaction are either all
> rolled back or all committed. IOW, it guarantees that a distributed
> transaction is not rolled back on some nodes and committed on the
> other node.
> Providing a transactionally consistent view is a very hard problem.
> Trying to solve all those problems in a single patch would be very
> difficult and the amount of changes required may be really huge. Then
> there are many possible consistency definitions when it comes to
> consistency of distributed system. I have not seen a consensus on what
> kind of consistency model/s we want to support in PostgreSQL. That's
> another large debate. We have had previous attempts where people have
> tried to complete everything in one go and nothing has been completed
> yet.

Yes, providing a atomic visibility is hard problem, and it's a
separated issue[2].

> 2PC implementation OR guaranteeing that all the legs of a transaction
> commit or roll back, is an essential block of any kind of distributed
> transaction manager. So, we should at least support that one, before
> attacking further problems.

I agree.

[1]https://en.wikipedia.org/wiki/Two-phase_commit_protocol
[2]http://www.bailis.org/papers/ramp-sigmod2014.pdf

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Autovacuum launcher process launches worker process at high frequency

2016-10-05 Thread Masahiko Sawada
Hi all,

I found the kind of strange behaviour of the autovacuum launcher
process when XID anti-wraparound vacuum.

Suppose that a database (say test_db) whose age of frozenxid is about
to reach max_autovacuum_max_age has three tables T1 and T2.
T1 is very large and is frequently updated, so vacuum takes long time
for vacuum.
T2 is static and already frozen table, thus vacuum can skip to vacuum
whole table.
And anti-wraparound vacuum was already executed on other databases.

Once the age of datfrozenxid of test_db exceeded
max_autovacuum_max_age, autovacuum launcher launches worker process in
order to do anti-wraparound vacuum on testdb.
A worker process assigned to test_db begins to vacuum T1, it takes long time.
Meanwhile another worker process is assigned to test_db and completes
to vacuum on T2 and exits.

After for while, the autovacuum launcher launches new worker again and
worker is assigned to test_db again.
But that worker exits quickly because there is no table we need to
vacuum. (T1 is being vacuumed by another worker process).
When new worker process starts, worker process sends SIGUSR2 signal to
launcher process to wake up him.
Although the launcher process executes WaitLatch() after launched new
worker, it is woken up and launches another new worker process soon
again.
As a result, launcher process launches new worker process at extremely
high frequency regardless of autovacuum_naptime, which increase cpu
use rate.

Why does auto vacuum worker need to wake up launcher process after started?

autovacuum.c:L1604
 /* wake up the launcher */
if (AutoVacuumShmem->av_launcherpid != 0)
kill(AutoVacuumShmem->av_launcherpid, SIGUSR2);


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-04 Thread Masahiko Sawada
On Tue, Oct 4, 2016 at 8:29 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Tue, Oct 4, 2016 at 1:11 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2016/10/04 16:10, Ashutosh Bapat wrote:
>>>>> Heuristics can not become the default behavior. A user should be given
>>>>> an option to choose a heuristic, and he should be aware of the
>>>>> pitfalls when using this heuristic. I guess, first, we need to get a
>>>>> solution which ensures that the transaction gets committed on all the
>>>>> servers or is rolled back on all the foreign servers involved. AFAIR,
>>>>> my patch did that. Once we have that kind of solution, we can think
>>>>> about heuristics.
>>>>
>>>> I meant that we could determine it heuristically only when remote server
>>>> crashed in 2nd phase of 2PC.
>>>> For example, what does the local server returns to client when no one 
>>>> remote
>>>> server returns OK to local server in 2nd phase of 2PC for more than
>>>> statement_timeout seconds? Ok or error?
>>>>
>>>
>>> The local server doesn't wait for the completion of the second phase
>>> to finish the currently running statement. Once all the foreign
>>> servers have responded to PREPARE request in the first phase, the
>>> local server responds to the client. Am I missing something?
>>
>> PREPARE sent to foreign servers involved in a given transaction is
>> *transparent* to the user who started the transaction, no?  That is, user
>> just says COMMIT and if it is found that there are multiple servers
>> involved in the transaction, it must be handled using two-phase commit
>> protocol *behind the scenes*.  So the aforementioned COMMIT should not
>> return to the client until after the above two-phase commit processing has
>> finished.
>
> No, the COMMIT returns after the first phase. It can not wait for all
> the foreign servers to complete their second phase

Hm, it sounds like it's same as normal commit (not 2PC).
What's the difference?

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.
Otherwise the next transaction can see data that is not committed yet
on remote server.

> , which can take
> quite long (or never) if one of the servers has crashed in between.
>
>>
>> Or are you and Sawada-san talking about the case where the user issued
>> PREPARE and not COMMIT?
>
> I guess, Sawada-san is still talking about the user issued PREPARE.
> But my comment is applicable otherwise as well.
>

Yes, I'm considering the case where the local server tries to COMMIT
but the remote server crashed after the local server completes 1st
phase (PREPARE) on the all remote server.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Transactions involving multiple postgres foreign servers

2016-10-04 Thread Masahiko Sawada
On Tue, Oct 4, 2016 at 1:26 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com <javascript:;>> wrote:
>>>
>>> Why always rollback any dangling transaction? There can be a case that
>>> a foreign server has a dangling transaction which needs to be
>>> committed because the portions of that transaction on the other shards
>>> are committed.
>>
>> Right, we can heuristically make a decision whether we do COMMIT or
>> ABORT on local server.
>> For example, if COMMIT PREPARED succeeded on at least one foreign
>> server, the local server return OK to client and the other dangling
>> transactions should be committed later.
>> We can find out that we should do either commit or abort the dangling
>> transaction by checking CLOG.
>
> Heuristics can not become the default behavior. A user should be given
> an option to choose a heuristic, and he should be aware of the
> pitfalls when using this heuristic. I guess, first, we need to get a
> solution which ensures that the transaction gets committed on all the
> servers or is rolled back on all the foreign servers involved. AFAIR,
> my patch did that. Once we have that kind of solution, we can think
> about heuristics.

I meant that we could determine it heuristically only when remote server
crashed in 2nd phase of 2PC.
For example, what does the local server returns to client when no one
remote server returns OK to local server in 2nd phase of 2PC for more than
statement_timeout seconds? Ok or error?

>>
>> But we need to handle the case where the CLOG file containing XID
>> necessary for resolving dangling transaction is truncated.
>> If the user does VACUUM FREEZE just after remote server crashed, it
>> could be truncated.
>
> Hmm, this needs to be fixed. Even my patch relied on XID to determine
> whether the transaction committed or rolled back locally and thus to
> decide whether it should be committed or rolled back on all the
> foreign servers involved. I think I had taken care of the issue you
> have pointed out here. Can you please verify the same?
>
>>
>>> The way gid is crafted, there is no way to check whether the given
>>> prepared transaction was created by the local server or not. Probably
>>> the local server needs to add a unique signature in GID to identify
>>> the transactions prepared by itself. That signature should be
>>> transferred to standby to cope up with the fail-over of local server.
>>
>> Maybe we can use database system identifier in control file.
>
> may be.
>
>>
>>> In this idea, one has to keep on polling the foreign server to find
>>> any dangling transactions. In usual scenario, we shouldn't have a
>>> large number of dangling transactions, and thus periodic polling might
>>> be a waste.
>>
>> We can optimize it by storing the XID that is resolved heuristically
>> into the control file or system catalog, for example.
>>
>
> There will be many such XIDs. We don't want to dump so many things in
> control file, esp. when that's not control data. System catalog is out
> of question since a rollback of local transaction would make those
> rows in the system catalog invisible. That's the reason, why I chose
> to write the foreign prepared transactions to files rather than a
> system catalog.
>

We can store the lowest in-doubt transaction id (say in-doubt XID) that
needs to be resolved later into control file and the CLOG containing XID
greater than in-doubt XID is never truncated.
We need to try to solve such transaction only when in-doubt XID is not NULL.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-03 Thread Masahiko Sawada
On Wed, Sep 28, 2016 at 3:30 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Wed, Sep 28, 2016 at 10:43 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> On Tue, Sep 27, 2016 at 9:06 PM, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>>> wrote:
>>>> On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat
>>>> <ashutosh.ba...@enterprisedb.com> wrote:
>>>>> On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>>>>> wrote:
>>>>>> On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
>>>>>> <ashutosh.ba...@enterprisedb.com> wrote:
>>>>>>> My original patch added code to manage the files for 2 phase
>>>>>>> transactions opened by the local server on the remote servers. This
>>>>>>> code was mostly inspired from the code in twophase.c which manages the
>>>>>>> file for prepared transactions. The logic to manage 2PC files has
>>>>>>> changed since [1] and has been optimized. One of the things I wanted
>>>>>>> to do is see, if those optimizations are applicable here as well. Have
>>>>>>> you considered that?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Yeah, we're considering it.
>>>>>> After these changes are committed, we will post the patch incorporated
>>>>>> these changes.
>>>>>>
>>>>>> But what we need to do first is the discussion in order to get consensus.
>>>>>> Since current design of this patch is to transparently execute DCL of
>>>>>> 2PC on foreign server, this code changes lot of code and is
>>>>>> complicated.
>>>>>
>>>>> Can you please elaborate. I am not able to understand what DCL is
>>>>> involved here. According to [1], examples of DCL are GRANT and REVOKE
>>>>> command.
>>>>
>>>> I meant transaction management command such as PREPARE TRANSACTION and
>>>> COMMIT/ABORT PREPARED command.
>>>> The web page I refered might be wrong, sorry.
>>>>
>>>>>> Another approach I have is to push down DCL to only foreign servers
>>>>>> that support 2PC protocol, which is similar to DML push down.
>>>>>> This approach would be more simpler than current idea and is easy to
>>>>>> use by distributed transaction manager.
>>>>>
>>>>> Again, can you please elaborate, how that would be different from the
>>>>> current approach and how does it simplify the code.
>>>>>
>>>>
>>>> The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK
>>>> PREPARED to foreign servers that support 2PC.
>>>> With this idea, the client need to do following operation when foreign
>>>> server is involved with transaction.
>>>>
>>>> BEGIN;
>>>> UPDATE parent_table SET ...; -- update including foreign server
>>>> PREPARE TRANSACTION 'xact_id';
>>>> COMMIT PREPARED 'xact_id';
>>>>
>>>> The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed
>>>> down to foreign server.
>>>> That is, the client needs to execute PREPARE TRANSACTION and
>>>>
>>>> In this idea, I think that we don't need to do followings,
>>>>
>>>> * Providing the prepare id of 2PC.
>>>>   Current patch adds new API prepare_id_provider() but we can use the
>>>> prepare id of 2PC that is used on parent server.
>>>>
>>>> * Keeping track of status of foreign servers.
>>>>   Current patch keeps track of status of foreign servers involved with
>>>> transaction but this idea is just to push down transaction management
>>>> command to foreign server.
>>>>   So I think that we no longer need to do that.
>>>
>>>> COMMIT/ROLLBACK PREPARED explicitly.
>>>
>>> The problem with this approach is same as one previously stated. If
>>> the connection between local and foreign server is lost between
>>> PREPARE and COMMIT the prepared transaction on the foreign server
>>> remains dangling, none other than the local server knows what to do
>>> with it and the local server has lost track of the prepared
>>> transaction on t

Re: [HACKERS] [GENERAL] pg_upgrade from 9.5 to 9.6 fails with "invalid argument"

2016-09-30 Thread Masahiko Sawada
On Fri, Sep 30, 2016 at 2:40 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Fri, Sep 30, 2016 at 1:26 PM, Kyotaro HORIGUCHI
> <horiguchi.kyot...@lab.ntt.co.jp> wrote:
>> Hello,
>>
>> At Fri, 30 Sep 2016 13:11:21 +0900, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote in 

Re: [HACKERS] [GENERAL] pg_upgrade from 9.5 to 9.6 fails with "invalid argument"

2016-09-30 Thread Masahiko Sawada
On Fri, Sep 30, 2016 at 6:40 PM, Thomas Kellerer <spam_ea...@gmx.net> wrote:
> Tom Lane schrieb am 29.09.2016 um 23:10:
>> Thomas Kellerer <spam_ea...@gmx.net> writes:
>>> for some reason pg_upgrade failed on Windows 10 for me, with an error 
>>> message that one specifc _vm file couldn't be copied.
>>
>> Hmm ... a _vm file would go through rewriteVisibilityMap(), which is new
>> code for 9.6 and hasn't really gotten that much testing.  Its error
>> reporting is shamefully bad --- you can't tell which step failed, and
>> I wouldn't even put a lot of faith in the errno being meaningful,
>> considering that it does close() calls before capturing the errno.
>>
>> But what gets my attention in this connection is that it doesn't
>> seem to be taking the trouble to open the files in binary mode.
>> Could that lead to the reported failure?  Not sure, but it seems
>> like at the least it could result in corrupted VM files.
>
> I did this on two different computers, one with Windows 10 the other with 
> Windows 7.
> (only test-databases, so no real issue anyway)
>
> In both cases running a "vacuum full" for the table in question fixed the 
> problem and pg_upgrade finished without problems.

Because vacuum full removes the _vm file, pg_upgrade completed job successfully.
If you still have the _vm file
("d:/Daten/db/pgdata95/base/16410/85358_vm") that lead an error, is it
possible that you check if there is '\r\n' [0d 0a] character in that
_vm file or share that _vm file with us?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] pg_upgrade from 9.5 to 9.6 fails with "invalid argument"

2016-09-29 Thread Masahiko Sawada
On Fri, Sep 30, 2016 at 1:26 PM, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
> Hello,
>
> At Fri, 30 Sep 2016 13:11:21 +0900, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote in 

Re: [HACKERS] [GENERAL] pg_upgrade from 9.5 to 9.6 fails with "invalid argument"

2016-09-29 Thread Masahiko Sawada
On Fri, Sep 30, 2016 at 7:08 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvhe...@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> I wouldn't even put a lot of faith in the errno being meaningful,
>>> considering that it does close() calls before capturing the errno.
>
>> So we do close() in a bunch of places while closing shop, which calls
>> _close() on Windows; this function sets errno.
>
> But only on failure, no?  The close()s usually shouldn't fail, and
> therefore shouldn't change errno, it's just that you can't trust that
> 100%.
>
> I think likely what's happening is that we're seeing a leftover value from
> some previous syscall that set GetLastError's result (and, presumably,
> wasn't fatal so far as pg_upgrade was concerned).
>

It means that read() syscall could not read BLOCKSZ bytes because
probably _vm file is corrupted?

> if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)

It could be happen that read() syscall stopped to read at where it
reads bits representing '\n' characters (0x5c6e) because we don't open
file with O_BINARY flag?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-09-27 Thread Masahiko Sawada
On Tue, Sep 27, 2016 at 9:06 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>>> wrote:
>>>> On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
>>>> <ashutosh.ba...@enterprisedb.com> wrote:
>>>>> My original patch added code to manage the files for 2 phase
>>>>> transactions opened by the local server on the remote servers. This
>>>>> code was mostly inspired from the code in twophase.c which manages the
>>>>> file for prepared transactions. The logic to manage 2PC files has
>>>>> changed since [1] and has been optimized. One of the things I wanted
>>>>> to do is see, if those optimizations are applicable here as well. Have
>>>>> you considered that?
>>>>>
>>>>>
>>>>
>>>> Yeah, we're considering it.
>>>> After these changes are committed, we will post the patch incorporated
>>>> these changes.
>>>>
>>>> But what we need to do first is the discussion in order to get consensus.
>>>> Since current design of this patch is to transparently execute DCL of
>>>> 2PC on foreign server, this code changes lot of code and is
>>>> complicated.
>>>
>>> Can you please elaborate. I am not able to understand what DCL is
>>> involved here. According to [1], examples of DCL are GRANT and REVOKE
>>> command.
>>
>> I meant transaction management command such as PREPARE TRANSACTION and
>> COMMIT/ABORT PREPARED command.
>> The web page I refered might be wrong, sorry.
>>
>>>> Another approach I have is to push down DCL to only foreign servers
>>>> that support 2PC protocol, which is similar to DML push down.
>>>> This approach would be more simpler than current idea and is easy to
>>>> use by distributed transaction manager.
>>>
>>> Again, can you please elaborate, how that would be different from the
>>> current approach and how does it simplify the code.
>>>
>>
>> The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK
>> PREPARED to foreign servers that support 2PC.
>> With this idea, the client need to do following operation when foreign
>> server is involved with transaction.
>>
>> BEGIN;
>> UPDATE parent_table SET ...; -- update including foreign server
>> PREPARE TRANSACTION 'xact_id';
>> COMMIT PREPARED 'xact_id';
>>
>> The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed
>> down to foreign server.
>> That is, the client needs to execute PREPARE TRANSACTION and
>>
>> In this idea, I think that we don't need to do followings,
>>
>> * Providing the prepare id of 2PC.
>>   Current patch adds new API prepare_id_provider() but we can use the
>> prepare id of 2PC that is used on parent server.
>>
>> * Keeping track of status of foreign servers.
>>   Current patch keeps track of status of foreign servers involved with
>> transaction but this idea is just to push down transaction management
>> command to foreign server.
>>   So I think that we no longer need to do that.
>
>> COMMIT/ROLLBACK PREPARED explicitly.
>
> The problem with this approach is same as one previously stated. If
> the connection between local and foreign server is lost between
> PREPARE and COMMIT the prepared transaction on the foreign server
> remains dangling, none other than the local server knows what to do
> with it and the local server has lost track of the prepared
> transaction on the foreign server. So, just pushing down those
> commands doesn't work.

Yeah, my idea is one of the first step.
Mechanism that resolves the dangling foreign transaction and the
resolver worker process are necessary.

>>
>> * Adding max_prepared_foreign_transactions parameter.
>>   It means that the number of transaction involving foreign server is
>> the same as max_prepared_transactions.
>>
>
> That isn't true exactly. max_prepared_foreign_transactions indicates
> how many transactions can be prepared on the foreign server, which in
> the method you propose should have a cap of max_prepared_transactions
> * number of foreign servers.

Oh, I understood, thanks.

Consider sharding solution using postgres_fdw (that is, the parent
postgres server has multiple shard postgres servers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-09-27 Thread Masahiko Sawada
On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> My original patch added code to manage the files for 2 phase
>>> transactions opened by the local server on the remote servers. This
>>> code was mostly inspired from the code in twophase.c which manages the
>>> file for prepared transactions. The logic to manage 2PC files has
>>> changed since [1] and has been optimized. One of the things I wanted
>>> to do is see, if those optimizations are applicable here as well. Have
>>> you considered that?
>>>
>>>
>>
>> Yeah, we're considering it.
>> After these changes are committed, we will post the patch incorporated
>> these changes.
>>
>> But what we need to do first is the discussion in order to get consensus.
>> Since current design of this patch is to transparently execute DCL of
>> 2PC on foreign server, this code changes lot of code and is
>> complicated.
>
> Can you please elaborate. I am not able to understand what DCL is
> involved here. According to [1], examples of DCL are GRANT and REVOKE
> command.

I meant transaction management command such as PREPARE TRANSACTION and
COMMIT/ABORT PREPARED command.
The web page I refered might be wrong, sorry.

>> Another approach I have is to push down DCL to only foreign servers
>> that support 2PC protocol, which is similar to DML push down.
>> This approach would be more simpler than current idea and is easy to
>> use by distributed transaction manager.
>
> Again, can you please elaborate, how that would be different from the
> current approach and how does it simplify the code.
>

The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK
PREPARED to foreign servers that support 2PC.
With this idea, the client need to do following operation when foreign
server is involved with transaction.

BEGIN;
UPDATE parent_table SET ...; -- update including foreign server
PREPARE TRANSACTION 'xact_id';
COMMIT PREPARED 'xact_id';

The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed
down to foreign server.
That is, the client needs to execute PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED explicitly.

In this idea, I think that we don't need to do followings,

* Providing the prepare id of 2PC.
  Current patch adds new API prepare_id_provider() but we can use the
prepare id of 2PC that is used on parent server.

* Keeping track of status of foreign servers.
  Current patch keeps track of status of foreign servers involved with
transaction but this idea is just to push down transaction management
command to foreign server.
  So I think that we no longer need to do that.

* Adding max_prepared_foreign_transactions parameter.
  It means that the number of transaction involving foreign server is
the same as max_prepared_transactions.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-09-26 Thread Masahiko Sawada
On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> My original patch added code to manage the files for 2 phase
> transactions opened by the local server on the remote servers. This
> code was mostly inspired from the code in twophase.c which manages the
> file for prepared transactions. The logic to manage 2PC files has
> changed since [1] and has been optimized. One of the things I wanted
> to do is see, if those optimizations are applicable here as well. Have
> you considered that?
>
>

Yeah, we're considering it.
After these changes are committed, we will post the patch incorporated
these changes.

But what we need to do first is the discussion in order to get consensus.
Since current design of this patch is to transparently execute DCL of
2PC on foreign server, this code changes lot of code and is
complicated.
Another approach I have is to push down DCL to only foreign servers
that support 2PC protocol, which is similar to DML push down.
This approach would be more simpler than current idea and is easy to
use by distributed transaction manager.
I think that it would be good place to start.

I'd like to discuss what the best approach is for transaction
involving foreign servers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-24 Thread Masahiko Sawada
On Wed, Sep 21, 2016 at 11:22 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Sep 21, 2016 at 5:54 AM, Petr Jelinek <p...@2ndquadrant.com> wrote:
>>>> Reading again the thread, it seems that my previous post [1] was a bit
>>>> misunderstood. My position is to not introduce any new behavior
>>>> changes in 9.6, so we could just make the FIRST NUM grammar equivalent
>>>> to NUM.
>>>>
>>>> [1]: 
>>>> https://www.postgresql.org/message-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4njgmmgiagvcs...@mail.gmail.com
>>>
>>> I misunderstood your intent, then.  But I still stand by what I did
>>> understand, namely that 'k (...)'  should mean 'any k (...)'.  It's much
>>> more natural than having it mean 'first k (...)' and I also think it
>>> will be more frequent in practice.
>>>
>>
>> I think so as well.
>
> Well, I agree, but I think making behavior changes after rc1 is a
> non-starter.  It's better to live with the incompatibility than to
> change the behavior so close to release.  At least, that's my
> position.  Getting the release out on time with a minimal bug count is
> more important to me than a minor incompatibility in the meaning of
> one GUC.
>

As the release team announced, it's better to postpone changing the
syntax of existing s_s_name.
I still vote for changing behaviour of existing syntax 'k (n1, n2)' to
quorum commit.
That is,
1.  'First k (n1, n2, n3)' means that the master server waits for ACKs
from k standby servers whose name appear earlier in the list.
2.  'Any k (n1, n2, n3)' means that the master server waits for ACKs
from any k listed standby servers.
3.  'n1, n2, n3' is the same as #1 with k=1.
4.  '(n1, n2, n3)' is the same as #2 with k=1.

Attached updated patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


quorum_commit_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl promote wait

2016-09-22 Thread Masahiko Sawada
On Thu, Sep 22, 2016 at 1:25 AM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 8/11/16 9:28 AM, Michael Paquier wrote:
>> I have looked at them and the changes are looking fine for me. So I
>> have switched the patch as ready for committer, aka you.
>>
>> Just a nit:
>> +   if (wait_seconds > 0)
>> +   {
>> +   sleep(1);
>> +   wait_seconds--;
>> +   continue;
>> +   }
>> This may be better this pg_usleep() instead of sleep().
>
> Committed with that adjustment.
>

Commit e7010ce seems to forget to change help message of pg_ctl.
Attached patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_pg_ctl_help.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-16 Thread Masahiko Sawada
On Fri, Sep 9, 2016 at 6:23 PM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 8 September 2016 at 10:26, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
>> "k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
>> compatibility but most users would think "k(n1, n2, n3)" as quorum
>> after introduced quorum.
>> I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
>> n3)" style before 9.6 releasing if we got consensus.
>
> Let's see the proposed patch, so we can evaluate the proposal.
>

Attached 2 patches.
000 patch changes syntax of s_s_names from 'k(n1, n2, n3)' to 'First k
(n1, n2,n3)' for PG9.6.
001 patch adds the quorum commit using syntax 'Any k (n1, n2,n3)' for PG10.

Since we already released 9.6RC1, I understand that it's quite hard to
change syntax of 9.6.
But considering that we support the quorum commit, this could be one
of the solutions in order to avoid breaking backward compatibility and
to provide useful user interface.
So I attached these patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
commit bd18dda9be5ab0341eca81de3c48ec6f7466dded
Author: Masahiko Sawada <sawada.m...@gmail.com>
Date:   Fri Sep 16 15:32:24 2016 -0700

Change syntax

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cd66abc..f0f510c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3037,7 +3037,7 @@ include_dir 'conf.d'
 This parameter specifies a list of standby servers using
 either of the following syntaxes:
 
-num_sync ( standby_name [, ...] )
+FIRST num_sync ( standby_name [, ...] )
 standby_name [, ...]
 
 where num_sync is
@@ -3048,7 +3048,9 @@ include_dir 'conf.d'
 3 (s1, s2, s3, s4) makes transaction commits wait
 until their WAL records are received by three higher-priority standbys
 chosen from standby servers s1, s2,
-s3 and s4.
+s3 and s4. FIRST is
+case-insensitive but the standby having name FIRST
+must be double-quoted.
 
 
 The second syntax was used before PostgreSQL
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..84ccb6e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
 An example of synchronous_standby_names for multiple
 synchronous standbys is:
 
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 
 In this example, if four standby servers s1, s2,
 s3 and s4 are running, the two standbys
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..b6d2f6c 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token  NAME NUM JUNK
+%token  NAME NUM JUNK FIRST
 
 %type  result standby_config
 %type  standby_list
@@ -61,7 +61,7 @@ result:
 
 standby_config:
 		standby_list{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $4); }
 	;
 
 standby_list:
@@ -70,7 +70,7 @@ standby_list:
 	;
 
 standby_name:
-		NAME		{ $$ = $1; }
+		NAME		{ $$ = $1;}
 		| NUM		{ $$ = $1; }
 	;
 %%
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..9dbdfbc 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -64,6 +64,11 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+first		{
+yylval.str = pstrdup(yytext);
+return FIRST;
+		}
+
 {xdstart}	{
 initStringInfo();
 BEGIN(xd);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f0f510c..0ad06ad 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3021,44 +3021,68 @@ include_dir 'conf.d'
 There will be one or more active synchronous standbys;
 transactions waiting for commit will be allowed to proceed after
 these standby servers confirm r

Re: [HACKERS] Block level parallel vacuum WIP

2016-09-16 Thread Masahiko Sawada
On Thu, Sep 15, 2016 at 11:44 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 7:21 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> I'm implementing this patch but I need to resolve the problem
>> regarding lock for extension by multiple parallel workers.
>> In parallel vacuum, multiple workers could try to acquire the
>> exclusive lock for extension on same relation.
>> Since acquiring the exclusive lock for extension by multiple workers
>> is regarded as locking from same locking group, multiple workers
>> extend fsm or vm at the same time and end up with error.
>> I thought that it might be involved with parallel update operation, so
>> I'd like to discuss about this in advance.
>
> Hmm, yeah.  This is one of the reasons why parallel queries currently
> need to be entirely read-only.  I think there's a decent argument that
> the relation extension lock mechanism should be entirely redesigned:
> the current system is neither particularly fast nor particularly
> elegant, and some of the services that the heavyweight lock manager
> provides, such as deadlock detection, are not relevant for relation
> extension locks.  I'm not sure if we should try to fix that right away
> or come up with some special-purpose hack for vacuum, such as having
> backends use condition variables to take turns calling
> visibilitymap_set().
>

Yeah, I don't have a good solution for this problem so far.
We might need to improve group locking mechanism for the updating
operation or came up with another approach to resolve this problem.
For example, one possible idea is that the launcher process allocates
vm and fsm enough in advance in order to avoid extending fork relation
by parallel workers, but it's not resolve fundamental problem.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Masahiko Sawada
On Thu, Sep 15, 2016 at 2:40 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 14 September 2016 at 11:19, Pavan Deolasee <pavan.deola...@gmail.com> 
> wrote:
>
>>>  In
>>> theory we could even start with the list of TIDs and switch to the
>>> bitmap if the TID list becomes larger than the bitmap would have been,
>>> but I don't know if it's worth the effort.
>>>
>>
>> Yes, that works too. Or may be even better because we already know the
>> bitmap size requirements, definitely for the tuples collected so far. We
>> might need to maintain some more stats to further optimise the
>> representation, but that seems like unnecessary detailing at this point.
>
> That sounds best to me... build the simple representation, but as we
> do maintain stats to show to what extent that set of tuples is
> compressible.
>
> When we hit the limit on memory we can then selectively compress
> chunks to stay within memory, starting with the most compressible
> chunks.
>
> I think we should use the chunking approach Robert suggests, though
> mainly because that allows us to consider how parallel VACUUM should
> work - writing the chunks to shmem. That would also allow us to apply
> a single global limit for vacuum memory rather than an allocation per
> VACUUM.
> We can then scan multiple indexes at once in parallel, all accessing
> the shmem data structure.
>

Yeah, the chunking approach Robert suggested seems like a good idea
but considering implementing parallel vacuum, it would be more
complicated IMO.
I think It's better the multiple process simply allocate memory space
for its process than that the single process allocate huge memory
space using complicated way.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block level parallel vacuum WIP

2016-09-15 Thread Masahiko Sawada
On Sat, Sep 10, 2016 at 7:44 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Wed, Aug 24, 2016 at 3:31 AM, Michael Paquier <michael.paqu...@gmail.com>
> wrote:
>>
>> On Tue, Aug 23, 2016 at 10:50 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>> > On Tue, Aug 23, 2016 at 6:11 PM, Michael Paquier
>> > <michael.paqu...@gmail.com> wrote:
>> >> On Tue, Aug 23, 2016 at 8:02 PM, Masahiko Sawada
>> >> <sawada.m...@gmail.com> wrote:
>> >>> As for PoC, I implemented parallel vacuum so that each worker
>> >>> processes both 1 and 2 phases for particular block range.
>> >>> Suppose we vacuum 1000 blocks table with 4 workers, each worker
>> >>> processes 250 consecutive blocks in phase 1 and then reclaims dead
>> >>> tuples from heap and indexes (phase 2).
>> >>
>> >> So each worker is assigned a range of blocks, and processes them in
>> >> parallel? This does not sound performance-wise. I recall Robert and
>> >> Amit emails on the matter for sequential scan that this would suck
>> >> performance out particularly for rotating disks.
>> >>
>> >
>> > The implementation in patch is same as we have initially thought for
>> > sequential scan, but turned out that it is not good way to do because
>> > it can lead to inappropriate balance of work among workers.  Suppose
>> > one worker is able to finish it's work, it won't be able to do more.
>>
>> Ah, so it was the reason. Thanks for confirming my doubts on what is
>> proposed.
>> --
>
>
> I believe Sawada-san has got enough feedback on the design to work out the
> next steps. It seems natural that the vacuum workers are assigned a portion
> of the heap to scan and collect dead tuples (similar to what patch does) and
> the same workers to be responsible for the second phase of heap scan.

Yeah, thank you for the feedback.

> But as far as index scans are concerned, I agree with Tom that the best
> strategy is to assign a different index to each worker process and let them
> vacuum indexes in parallel.
> That way the work for each worker process is
> clearly cut out and they don't contend for the same resources, which means
> the first patch to allow multiple backends to wait for a cleanup buffer is
> not required. Later we could extend it further such multiple workers can
> vacuum a single index by splitting the work on physical boundaries, but even
> that will ensure clear demarkation of work and hence no contention on index
> blocks.

I also agree with this idea.
Each worker vacuums different indexes and then the leader process
should update all index statistics after parallel mode exited.

I'm implementing this patch but I need to resolve the problem
regarding lock for extension by multiple parallel workers.
In parallel vacuum, multiple workers could try to acquire the
exclusive lock for extension on same relation.
Since acquiring the exclusive lock for extension by multiple workers
is regarded as locking from same locking group, multiple workers
extend fsm or vm at the same time and end up with error.
I thought that it might be involved with parallel update operation, so
I'd like to discuss about this in advance.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DISABLE_PAGE_SKIPPING option for vacuumdb

2016-09-13 Thread Masahiko Sawada
On Wed, Sep 14, 2016 at 2:17 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Sep 8, 2016 at 1:32 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>> Attached patch add --disable-page-skipping option to vacuumed command for 
>> 9.6.
>> If by any chance the freeze map causes the serious data corruption bug
>> then the user will want to run pg_check_visible() and vacuum with
>> DISABLE_PAGE_SKIPPING option to the multiple tables having problem.
>> In this case, I think that this option for vacuumdb would be convenient.
>>
>> Please give me feedback.
>
> I think this isn't really necessary.
>

Thank you for comment.

Okay, I thought it would be useful for user who want to vacuum with
disable_page_skipping specified table but it's non-essential.
I pull it back. Thanks.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-09 Thread Masahiko Sawada
On Fri, Sep 9, 2016 at 12:33 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada <sawada.m...@gmail.com>
> wrote:
>>
>>
>>
>> Making the vacuum possible to choose between two data representations
>> sounds good.
>> I implemented the patch that changes dead tuple representation to bitmap
>> before.
>> I will measure the performance of bitmap representation again and post
>> them.
>
>
> Sounds great! I haven't seen your patch, but what I would suggest is to
> compute page density (D) = relpages/(dead+live tuples) and experiment with
> bitmap of sizes of D to 2D bits per page. May I also suggest that instead of
> putting in efforts in implementing the overflow area,  just count how many
> dead TIDs would fall under overflow area for a given choice of bitmap size.
>

Isn't that formula "page density (D) = (dead+live tuples)/relpages"?

> It might be a good idea to experiment with different vacuum scale factor,
> varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a
> longish pgbench test on a large table and then save the data directory for
> repeated experiments, although I'm not sure if pgbench will be a good choice
> because HOT will prevent accumulation of dead pointers, in which case you
> may try adding another index on abalance column.

Thank you, I will experiment with this.

>
> It'll be worth measuring memory consumption of both representations as well
> as performance implications on index vacuum. I don't expect to see any major
> difference in either heap scans.
>

Yeah, it would be effective for the index vacuum speed and the number
of execution of index vacuum.

Attached PoC patch changes the representation of dead tuple locations
to the hashmap having tuple bitmap.
The one hashmap entry consists of the block number and the TID bitmap
of corresponding block, and the block number is the hash key of
hashmap.
Current implementation of this patch is not smart yet because each
hashmap entry allocates the tuple bitmap with fixed
size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
In case where one block can store only the several tens tuples, the
most bits are would be waste.

After improved this patch as you suggested, I will measure performance benefit.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From d1968d625ca1bb07711681a2d824737c53d27c8c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.m...@gmail.com>
Date: Thu, 8 Sep 2016 13:59:23 -0400
Subject: [PATCH] Collect dead tuple location as a bitmap.

---
 src/backend/commands/vacuumlazy.c | 232 +-
 1 file changed, 153 insertions(+), 79 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c 
b/src/backend/commands/vacuumlazy.c
index b5fb325..ce7bd7e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -98,6 +98,28 @@
  */
 #define SKIP_PAGES_THRESHOLD   ((BlockNumber) 32)
 
+#define BITS_PER_BITMAP_CHUNK 32
+#define BITMAP_CHUNKS_PER_PAGE \
+   ((int) ((LAZY_ALLOC_TUPLES / BITS_PER_BITMAP_CHUNK) + 1))
+#define BitoffsetToOffsetNumber(chunk, offset) \
+   (((chunk) * BITS_PER_BITMAP_CHUNK) + (offset) + 1)
+#define OffsetNumberToChunk(offset) \
+   ((offset) / BITS_PER_BITMAP_CHUNK)
+#define OffsetNumberToBitoffset(offset) \
+   ((offset) % BITS_PER_BITMAP_CHUNK)
+
+typedef struct PageEntry
+{
+   BlockNumber blockno;
+   uint32  bm[BITMAP_CHUNKS_PER_PAGE]; /* tuple bitmap of blockno */
+} PageEntry;
+
+typedef struct DeadTupleMap
+{
+   HTAB*pagetable;
+   int npages; /* # of blocks hashmap has */
+} DeadTupleMap;
+
 typedef struct LVRelStats
 {
/* hasindex = true means two-pass strategy; false means one-pass */
@@ -120,6 +142,9 @@ typedef struct LVRelStats
int num_dead_tuples;/* current # of entries 
*/
int max_dead_tuples;/* # slots allocated in 
array */
ItemPointer dead_tuples;/* array of ItemPointerData */
+
+   DeadTupleMap *dead_tuple_map; /* hash map if dead ItemPointerData */
+
int num_index_scans;
TransactionId latestRemovedXid;
boollock_waiter_detected;
@@ -148,20 +173,19 @@ static void lazy_vacuum_index(Relation indrel,
 static void lazy_cleanup_index(Relation indrel,
   IndexBulkDeleteResult *stats,
   LVRelStats *vacrelstats);
-static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-int tupindex, LVRelStats *vacrelstats, Buffer 
*vmbuffer);
+static void lazy_vacuum_page(Relation

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Masahiko Sawada
On Thu, Sep 8, 2016 at 11:54 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
>>
>> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <st...@mit.edu> wrote:
>> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <si...@2ndquadrant.com>
>> > wrote:
>> >> On 6 September 2016 at 19:59, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> >>
>> >>> The idea of looking to the stats to *guess* about how many tuples are
>> >>> removable doesn't seem bad at all.  But imagining that that's going to
>> >>> be
>> >>> exact is folly of the first magnitude.
>> >>
>> >> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> >> so I think we agree that a reasonably accurate, yet imprecise
>> >> calculation is possible in most cases.
>> >
>> > That would all be well and good if it weren't trivial to do what
>> > Robert suggested. This is just a large unsorted list that we need to
>> > iterate throught. Just allocate chunks of a few megabytes and when
>> > it's full allocate a new chunk and keep going. There's no need to get
>> > tricky with estimates and resizing and whatever.
>>
>> I agree. While the idea of estimating the right size sounds promising
>> a priori, considering the estimate can go wrong and over or
>> underallocate quite severely, the risks outweigh the benefits when you
>> consider the alternative of a dynamic allocation strategy.
>>
>> Unless the dynamic strategy has a bigger CPU impact than expected, I
>> believe it's a superior approach.
>>
>
> How about a completely different representation for the TID array? Now this
> is probably not something new, but I couldn't find if the exact same idea
> was discussed before. I also think it's somewhat orthogonal to what we are
> trying to do here, and will probably be a bigger change. But I thought I'll
> mention since we are at the topic.
>
> What I propose is to use a simple bitmap to represent the tuples. If a tuple
> at <block, offset> is dead then the corresponding bit in the bitmap is set.
> So clearly the searches through dead tuples are O(1) operations, important
> for very large tables and large arrays.
>
> Challenge really is that a heap page can theoretically have MaxOffsetNumber
> of line pointers (or to be precise maximum possible offset number). For a 8K
> block, that comes be about 2048. Having so many bits per page is neither
> practical nor optimal. But in practice the largest offset on a heap page
> should not be significantly greater than MaxHeapTuplesPerPage, which is a
> more reasonable value of 291 on my machine. Again, that's with zero sized
> tuple and for real life large tables, with much wider tuples, the number may
> go down even further.
>
> So we cap the offsets represented in the bitmap to some realistic value,
> computed by looking at page density and then multiplying it by a small
> factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line
> pointers. That should practically represent majority of the dead tuples in
> the table, but we then keep an overflow area to record tuples beyond the
> limit set for per page. The search routine will do a direct lookup for
> offsets less than the limit and search in the sorted overflow area for
> offsets beyond the limit.
>
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.
>
> I guess we can make a choice between two representations at the start
> looking at various table stats. We can also be smart and change from bitmap
> to traditional representation as we scan the table and see many more tuples
> in the overflow region than we provisioned for. There will be some
> challenges in converting representation mid-way, especially in terms of
> memory allocation, but I think those can be sorted out if we 

[HACKERS] DISABLE_PAGE_SKIPPING option for vacuumdb

2016-09-08 Thread Masahiko Sawada
Hi,

Attached patch add --disable-page-skipping option to vacuumed command for 9.6.
If by any chance the freeze map causes the serious data corruption bug
then the user will want to run pg_check_visible() and vacuum with
DISABLE_PAGE_SKIPPING option to the multiple tables having problem.
In this case, I think that this option for vacuumdb would be convenient.

Please give me feedback.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


disable_page_skipping_option_for_vacuumdb_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-08 Thread Masahiko Sawada
On Thu, Sep 8, 2016 at 11:27 PM, Jim Nasby <jim.na...@bluetreble.com> wrote:
> On 9/8/16 3:03 AM, Masahiko Sawada wrote:
>>
>> Current manual vacuum doesn't output how may all_frozen pages we
>> skipped according to visibility map.
>> That's why I attached 0001 patch which makes the manual vacuum emit
>> such information.
>
>
> I think we should add that, and info about all-frozen skips, regardless of
> the rest of these changes. Can you submit that separately to the commitfest?
> (Or I can if you want.)

Yeah, autovacuum emits such information but manual vacuum doesn't.
I submitted that patch before[1] but patch was not committed, because
verbose output is already complicated.
But I will submit it again and start to discussion about that.

[1] 
https://www.postgresql.org/message-id/CAD21AoCQGTKDxQ6YfrJ0%2B%3Dev6A3i3pt2ULdWxCtwPLKR2E77jg%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-08 Thread Masahiko Sawada
On Wed, Sep 7, 2016 at 4:03 AM, Josh Berkus <j...@agliodbs.com> wrote:
> On 08/29/2016 06:52 AM, Fujii Masao wrote:
>> Also I like the following Simon's idea.
>>
>> https://www.postgresql.org/message-id/canp8+jlhfbvv_pw6grasnupw+bdk5dctu7gwpnap-+-zwvk...@mail.gmail.com
>> ---
>> * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
>> * any k (n1, n2, n3) – would release waiters as soon as we have the
>> responses from k out of N standbys. “any k” would be faster, so is
>> desirable for performance and resilience
>
> What are we going to do for backwards compatibility, here?
>
> So, here's the dilemma:
>
> If we want to keep backwards compatibility with 9.6, then:
>
> "k (n1, n2, n3)" == "first k (n1, n2, n3)"
>
> However, "first k" is not what most users will want, most of the time;
> users of version 13, years from now, will be getting constantly confused
> by "first k" behavior when they wanted quorum.  So the sensible default
> would be:
>
> "k (n1, n2, n3)" == "any k (n1, n2, n3)"
>

+1.

"k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
compatibility but most users would think "k(n1, n2, n3)" as quorum
after introduced quorum.
I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
n3)" style before 9.6 releasing if we got consensus.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-08 Thread Masahiko Sawada
On Wed, Sep 7, 2016 at 12:47 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 11:08 PM, Simon Riggs <si...@2ndquadrant.com> wrote:
>> On 29 August 2016 at 14:52, Fujii Masao <masao.fu...@gmail.com> wrote:
>>> On Sat, Aug 6, 2016 at 6:36 PM, Petr Jelinek <p...@2ndquadrant.com> wrote:
>>>> On 04/08/16 06:40, Masahiko Sawada wrote:
>>>>>
>>>>> On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
>>>>> <michael.paqu...@gmail.com> wrote:
>>>>>>
>>>>>> On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.m...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> I was thinking that the syntax for quorum method would use '[ ... ]'
>>>>>>> but it will be confused with '( ... )' priority method used.
>>>>>>> 001 patch adds 'Any N ( ... )' style syntax but I know that we still
>>>>>>> might need to discuss about better syntax, discussion is very welcome.
>>>>>>> Attached draft patch, please give me feedback.
>>>>>>
>>>>>>
>>>>>> I am +1 for using either "{}" or "[]" to define a quorum set, and -1
>>>>>> for the addition of a keyword in front of the integer defining for how
>>>>>> many nodes server need to wait for.
>>>>>
>>>>>
>>>>> Thank you for reply.
>>>>> "{}" or "[]" are not bad but because these are not intuitive, I
>>>>> thought that it will be hard for uses to use different method for each
>>>>> purpose.
>>>>>
>>>>
>>>> I think the "any" keyword is more explicit and understandable, also closer
>>>> to SQL. So I would be in favor of doing that.
>>>
>>> +1
>>>
>>> Also I like the following Simon's idea.
>>>
>>> https://www.postgresql.org/message-id/canp8+jlhfbvv_pw6grasnupw+bdk5dctu7gwpnap-+-zwvk...@mail.gmail.com
>>> ---
>>> * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
>>> * any k (n1, n2, n3) – would release waiters as soon as we have the
>>> responses from k out of N standbys. “any k” would be faster, so is
>>> desirable for performance and resilience
>>> ---
>>
>> +1
>>
>> "synchronous_method" -> "synchronization_method"
>
> Thanks, will fix.
>
>> I'm concerned about the performance of this code. Can we work out a
>> way of measuring it, so we can judge how successful we are at
>> releasing waiters quickly? Thanks
>
> I will measure the performance effect of this code.
> I'm expecting that performances are,
> 'first 1 (n1, n2)'  > 'any 1(n1, n2)' > 'first 2(n1, n2)'
> 'first 1 (n1, n2)' will be highest throughput.
>

Sorry, that's wrong.
'any 1(n1, n2)' will be highest throughput or same as 'first 1(n1, n2)'.

>> For 9.6 we implemented something that allows the DBA to define how
>> slow programs are. Previously, since 9.1 this was something specified
>> on the application side. I would like to put it back that way, so we
>> end up with a parameter on client e.g. commit_quorum = k. Forget the
>> exact parameters/user API for now, but I'd like to allow the code to
>> work with user defined settings. Thanks.
>
> I see. The parameter on client should effect for priority method as well.
> And similar to synchronous_commit, the client can specify the how much
> standbys the master waits to commit for according to synchronization
> method, even if s_s_names is defined.
>
> Regards,
>
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Masahiko Sawada
On Wed, Sep 7, 2016 at 2:39 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>>> The problem with this is that we allocate the entire amount of
>>> maintenance_work_mem even when the number of actual dead tuples turns
>>> out to be very small.  That's not so bad if the amount of memory we're
>>> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
>>> to remove the 1 GB limit, because somebody might have
>>> maintenance_work_mem set to tens or hundreds of gigabytes to speed
>>> index creation, and allocating that much space for a VACUUM that
>>> encounters 1 dead tuple does not seem like a good plan.
>>>
>>> What I think we need to do is make some provision to initially
>>> allocate only a small amount of memory and then grow the allocation
>>> later if needed.  For example, instead of having
>>> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
>>> ItemPointer * and allocate the array progressively in segments.  I'd
>>> actually argue that the segment size should be substantially smaller
>>> than 1 GB, like say 64MB; there are still some people running systems
>>> which are small enough that allocating 1 GB when we may need only 6
>>> bytes can drive the system into OOM.
>>
>> This would however incur the cost of having to copy the whole GB-sized
>> chunk every time it's expanded. It woudln't be cheap.
>
> No, I don't want to end up copying the whole array; that's what I
> meant by allocating it progressively in segments.  Something like what
> you go on to propose.
>
>> I've monitored the vacuum as it runs and the OS doesn't map the whole
>> block unless it's touched, which it isn't until dead tuples are found.
>> Surely, if overcommit is disabled (as it should), it could exhaust the
>> virtual address space if set very high, but it wouldn't really use the
>> memory unless it's needed, it would merely reserve it.
>
> Yeah, but I've seen actual breakage from exactly this issue on
> customer systems even with the 1GB limit, and when we start allowing
> 100GB it's going to get a whole lot worse.
>
>> To fix that, rather than repalloc the whole thing, dead_tuples would
>> have to be an ItemPointer** of sorted chunks. That'd be a
>> significantly more complex patch, but at least it wouldn't incur the
>> memcpy.
>
> Right, this is what I had in mind.  I don't think this is actually
> very complicated, because the way we use this array is really simple.
> We basically just keep appending to the array until we run out of
> space, and that's not very hard to implement with an array-of-arrays.
> The chunks are, in some sense, sorted, as you say, but you don't need
> to do qsort() or anything like that.  You're just replacing a single
> flat array with a data structure that can be grown incrementally in
> fixed-size chunks.
>

If we replaced dead_tuples with an array-of-array, isn't there
negative performance impact for lazy_tid_reap()?
As chunk is added, that performance would be decrease.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-08 Thread Masahiko Sawada
On Wed, Sep 7, 2016 at 4:11 PM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 7 September 2016 at 04:13, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
>> Since current HEAD could scan visibility map twice, the execution time
>> of Patched is approximately half of HEAD.
>
> Sounds good.
>
> To ensure we are doing exactly same amount of work as before, did you
> see the output of VACUUM VEROBOSE?

Sorry, the previous test result I posted was something wrong.
I rerun the performance test and results are,

* 1TB Table(visibility map size is 32MB)
HEAD : 4853.250 ms (00:04.853)
Patched : 3805.789 ms (00:03.806)

* 8TB Table(visibility map size is 257MB)
HEAD : 37853.891 ms (00:37.854)
Patched : 30153.943 ms (00:30.154)

* 32TB Table(visibility map size is 1GB)
HEAD: 151908.344 ms (02:31.908)
Patched: 120560.037 ms (02:00.560)

Since visibility map page can be cached onto shared buffer or OS cache
by first scanning, the benefit of this patch seems not to be large.

Here are outputs of VACUUM VERBOSE for 32TB table.

* HEAD
INFO:  vacuuming "public.vm_skip_test"
INFO:  "vm_skip_test": found 0 removable, 0 nonremovable row versions
in 0 out of 4294967294 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
Skipped 4294967294 all-frozen pages according to visibility map.
0 pages are entirely empty.
CPU 1.06s/148.11u sec elapsed 149.20 sec.
VACUUM
Time: 151908.344 ms (02:31.908)

* Patched
INFO:  vacuuming "public.vm_skip_test"
INFO:  "vm_skip_test": found 0 removable, 0 nonremovable row versions
in 0 out of 4294967294 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
Skipped 4294967294 all-frozen pages according to visibility map.
0 pages are entirely empty.
CPU 0.65s/117.15u sec elapsed 117.81 sec.
VACUUM
Time: 120560.037 ms (02:00.560)

Current manual vacuum doesn't output how may all_frozen pages we
skipped according to visibility map.
That's why I attached 0001 patch which makes the manual vacuum emit
such information.

>
> Can we produce a test that verifies the result patched/unpatched?
>

Attached test shell script but because I don't have such a large disk,
I've measured performance benefit using by something like unofficial
way.

To make a situation where table is extremly large and make
corresponding visibility map, I applied 0002 patch and made a fake
visibility map.
Attached 0002 patch adds GUC parameter cheat_vacuum_table_size which
artificially defines table size being vacuumed .
For example, If we do,
  SET cheat_vacuum_table_size = 4;
  VACUUM vm_test;
then in lazy_scan_heap, vm_test table is processed as an
8TB(MaxBlockNumber / 4) table.

Attached test shell script makes fake visibility map files and
executes the performance tests for 1TB, 8TB and 32TB table.
Please confirm it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From bbda8e4144101a41804865d88592a15bf0150340 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.m...@gmail.com>
Date: Thu, 8 Sep 2016 11:24:21 -0700
Subject: [PATCH 1/2] lazy_scan_heap outputs how many all_frozen pages are
 skipped.

---
 src/backend/commands/vacuumlazy.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..20d8334 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1331,6 +1331,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	"Skipped %u pages due to buffer pins.\n",
 	vacrelstats->pinskipped_pages),
 	 vacrelstats->pinskipped_pages);
+	appendStringInfo(, ngettext("Skipped %u all-frozen page according to visibility map.\n",
+	"Skipped %u all-frozen pages according to visibility map.\n",
+	vacrelstats->frozenskipped_pages),
+	 vacrelstats->frozenskipped_pages);
 	appendStringInfo(, ngettext("%u page is entirely empty.\n",
 					"%u pages are entirely empty.\n",
 	empty_pages),
-- 
2.8.1

From 01e493a44ba6c5bb9b59a1016bfce4b9bcf75e95 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.m...@gmail.com>
Date: Thu, 8 Sep 2016 11:36:28 -0700
Subject: [PATCH 2/2] Add cheat_vacuum_table_size.

---
 contrib/pg_visibility/pg_visibility.c |  6 +-
 src/backend/commands/vacuumlazy.c |  9 +++--
 src/backend/utils/misc/guc.c  | 10 ++
 src/include/commands/vacuum.h |  1 +
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 7034066..37ace99 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -12,6 +1

Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-06 Thread Masahiko Sawada
On Wed, Sep 7, 2016 at 1:47 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Sep 5, 2016 at 8:57 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>>> What performance difference does this make, in a realistic use case?
>>
>> I have yet to measure performance effect but it would be effect for
>> very large frozen table.
>
> I think if you are proposing this patch because you think that the
> existing code won't perform well, you definitely need to submit some
> performance results showing that your approach is better.  If you
> can't show that your approach is practically better (rather than just
> theoretically better), then I'm not sure we should change anything.
> It's very easy to screw up the code in this area and we do not want to
> risk corrupting data for the sake of an optimization that isn't really
> needed in the first place.
>
> Of course, if you can prove that the change has a significant benefit,
> then it's definitely worth considering.
>

I understood, thank you.

I've measured the performance benefit of this patch by following steps.
1. Create very large table and set all-visible flag to all blocks.
2. Measure the execution time of vacuum that skips to scan all heap pages.

* 1TB Table(visibility map size is 32MB)
HEAD : 11012.274 ms (00:11.012)
Patched : 6742.481 ms (00:06.742)

* 8TB Table(visibility map size is 64MB)
HEAD : 69886.605 ms (01:09.887)
Patched : 35562.131 ms (00:35.562)

* 32TB Table(visibility map size is 258MB)
HEAD: 265901.096 ms (04:25.901)
Patched: 131779.082 ms (02:11.779)

Since current HEAD could scan visibility map twice, the execution time
of Patched is approximately half of HEAD.
But when table is several hundreds gigabyte, performance benefit would
not be large.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-06 Thread Masahiko Sawada
On Tue, Sep 6, 2016 at 11:08 PM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 29 August 2016 at 14:52, Fujii Masao <masao.fu...@gmail.com> wrote:
>> On Sat, Aug 6, 2016 at 6:36 PM, Petr Jelinek <p...@2ndquadrant.com> wrote:
>>> On 04/08/16 06:40, Masahiko Sawada wrote:
>>>>
>>>> On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
>>>> <michael.paqu...@gmail.com> wrote:
>>>>>
>>>>> On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.m...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> I was thinking that the syntax for quorum method would use '[ ... ]'
>>>>>> but it will be confused with '( ... )' priority method used.
>>>>>> 001 patch adds 'Any N ( ... )' style syntax but I know that we still
>>>>>> might need to discuss about better syntax, discussion is very welcome.
>>>>>> Attached draft patch, please give me feedback.
>>>>>
>>>>>
>>>>> I am +1 for using either "{}" or "[]" to define a quorum set, and -1
>>>>> for the addition of a keyword in front of the integer defining for how
>>>>> many nodes server need to wait for.
>>>>
>>>>
>>>> Thank you for reply.
>>>> "{}" or "[]" are not bad but because these are not intuitive, I
>>>> thought that it will be hard for uses to use different method for each
>>>> purpose.
>>>>
>>>
>>> I think the "any" keyword is more explicit and understandable, also closer
>>> to SQL. So I would be in favor of doing that.
>>
>> +1
>>
>> Also I like the following Simon's idea.
>>
>> https://www.postgresql.org/message-id/canp8+jlhfbvv_pw6grasnupw+bdk5dctu7gwpnap-+-zwvk...@mail.gmail.com
>> ---
>> * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
>> * any k (n1, n2, n3) – would release waiters as soon as we have the
>> responses from k out of N standbys. “any k” would be faster, so is
>> desirable for performance and resilience
>> ---
>
> +1
>
> "synchronous_method" -> "synchronization_method"

Thanks, will fix.

> I'm concerned about the performance of this code. Can we work out a
> way of measuring it, so we can judge how successful we are at
> releasing waiters quickly? Thanks

I will measure the performance effect of this code.
I'm expecting that performances are,
'first 1 (n1, n2)'  > 'any 1(n1, n2)' > 'first 2(n1, n2)'
'first 1 (n1, n2)' will be highest throughput.

> For 9.6 we implemented something that allows the DBA to define how
> slow programs are. Previously, since 9.1 this was something specified
> on the application side. I would like to put it back that way, so we
> end up with a parameter on client e.g. commit_quorum = k. Forget the
> exact parameters/user API for now, but I'd like to allow the code to
> work with user defined settings. Thanks.

I see. The parameter on client should effect for priority method as well.
And similar to synchronous_commit, the client can specify the how much
standbys the master waits to commit for according to synchronization
method, even if s_s_names is defined.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-05 Thread Masahiko Sawada
On Mon, Sep 5, 2016 at 6:27 PM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 4 August 2016 at 05:57, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
>> While reviewing freeze map code, Andres pointed out[1] that
>> lazy_scan_heap could accesses visibility map twice and its logic is
>> seems a bit tricky.
>> As discussed before, it's not nice especially when large relation is
>> entirely frozen.
>>
>> I posted the patch for that before but since this is an optimization,
>> not bug fix, I'd like to propose it again.
>> Please give me feedback.
>>
>> [1] 
>> https://www.postgresql.org/message-id/20160505000840.epatoq6d2e556446%40alap3.anarazel.de
>
> If we have a freeze map now, surely tables will no longer be entirely frozen?

Well, if table is completely frozen, freezing for all pages will be skipped.

> What performance difference does this make, in a realistic use case?

I have yet to measure performance effect but it would be effect for
very large frozen table.

> How would we test this to check it is exactly correct?

One possible idea is that we emit the number of skipped page according
visibility map as a vacuum verbose message, and check it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] VACUUM's ancillary tasks

2016-08-28 Thread Masahiko Sawada
On Monday, 29 August 2016, Andres Freund <and...@anarazel.de
<javascript:_e(%7B%7D,'cvml','and...@anarazel.de');>> wrote:

> Hi,
>
> On 2016-08-29 03:26:06 +0200, Vik Fearing wrote:
> > The attached two patches scratch two itches I've been having for a
> > while.  I'm attaching them together because the second depends on the
> first.
> >
> > Both deal with the fact that [auto]vacuum has taken on more roles than
> > its original purpose.
> >
> >
> > Patch One: autovacuum insert-heavy tables
> >
> > If you have a table that mostly receives INSERTs, it will never get
> > vacuumed because there are no (or few) dead rows.  I have added an
> > "inserts_since_vacuum" field to PgStat_StatTabEntry which works exactly
> > the same way as "changes_since_analyze" does.
> >
> > The reason such a table needs to be vacuumed is currently twofold: the
> > visibility map is not updated, slowing down index-only scans; and BRIN
> > indexes are not maintained, rendering them basically useless.
>
> It might be worthwhile to look at
> http://archives.postgresql.org/message-id/CAMkU%3D1zGu5Oshfz
> xKBqDmxxKcoDJu4pJux8UAo5h7k%2BGA_jS3Q%40mail.gmail.com
> there's definitely some overlap.
>
>
> > Patch Two: autovacuum after table rewrites
> >
> > This patch addresses the absurdity that a standard VACUUM is required
> > after a VACUUM FULL because the visibility map gets blown away.  This is
> > also the case for CLUSTER and some versions of ALTER TABLE that rewrite
> > the table.
>
> I think this should rather fixed by maintaining the VM during
> cluster.


>
+1


>
> IIRC there was an attempt late in the 9.5 cycle, but Bruce
>
(IIRC) ran out of steam. And nobody picked it up again ... :(
>
>
It may be worth to look at
https://www.postgresql.org/message-id/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com

 I've updated this patch to apply to current HEAD, can propose it to pg10.

Regards,

--
Masahiko Sawada


-- 
Regards,

--
Masahiko Sawada


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-08-26 Thread Masahiko Sawada
On Fri, Aug 26, 2016 at 3:13 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
>
>
> On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada <sawada.m...@gmail.com>
> wrote:
>>
>> On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>> >
>> >
>> > On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada
>> > <sawada.m...@gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpok...@gmail.com>
>> >> wrote:
>> >> > Hi All,
>> >> >
>> >> > Ashutosh proposed the feature 2PC for FDW for achieving atomic
>> >> > commits
>> >> > across multiple foreign servers.
>> >> > If a transaction make changes to more than two foreign servers the
>> >> > current
>> >> > implementation in postgres_fdw doesn't make sure that either all of
>> >> > them
>> >> > commit or all of them rollback their changes.
>> >> >
>> >> > We (Masahiko Sawada and me) reopen this thread and trying to
>> >> > contribute
>> >> > in
>> >> > it.
>> >> >
>> >> > 2PC for FDW
>> >> > 
>> >> > The patch provides support for atomic commit for transactions
>> >> > involving
>> >> > foreign servers. when the transaction makes changes to foreign
>> >> > servers,
>> >> > either all the changes to all the foreign servers commit or rollback.
>> >> >
>> >> > The new patch 2PC for FDW include the following things:
>> >> > 1. The patch 0001 introduces a generic feature. All kinds of FDW that
>> >> > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can
>> >> > involve
>> >> > in
>> >> > the transaction.
>> >> >
>> >> > Currently we can push some conditions down to shard nodes, especially
>> >> > in
>> >> > 9.6
>> >> > the directly modify feature has
>> >> > been introduced. But such a transaction modifying data on shard node
>> >> > is
>> >> > not
>> >> > executed surely.
>> >> > Using 0002 patch, that modify is executed with 2PC. It means that we
>> >> > almost
>> >> > can provide sharding solution using
>> >> > multiple PostgreSQL server (one parent node and several shared node).
>> >> >
>> >> > For multi master, we definitely need transaction manager but
>> >> > transaction
>> >> > manager probably can use this 2PC for FDW feature to manage
>> >> > distributed
>> >> > transaction.
>> >> >
>> >> > 2. 0002 patch makes postgres_fdw possible to use 2PC.
>> >> >
>> >> > 0002 patch makes postgres_fdw to use below APIs. These APIs are
>> >> > generic
>> >> > features which can be used by all kinds of FDWs.
>> >> >
>> >> > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead
>> >> > of
>> >> > COMMIT/ABORT on foreign server which supports 2PC.
>> >> > b. Manage information of foreign prepared transactions resolver
>> >> >
>> >> > Masahiko Sawada will post the patch.
>> >> >
>> >> >
>> >>
>> >
>> > Thanks Vinayak and Sawada-san for taking this forward and basing your
>> > work
>> > on my patch.
>> >
>> >>
>> >> Still lot of work to do but attached latest patches.
>> >> These are based on the patch Ashutosh posted before, I revised it and
>> >> divided into two patches.
>> >> Compare with original patch, patch of pg_fdw_xact_resolver and
>> >> documentation are lacked.
>> >
>> >
>> > I am not able to understand the last statement.
>>
>> Sorry to confuse you.
>>
>> > Do you mean to say that your patches do not have pg_fdw_xact_resolver()
>> > and
>> > documentation that my patches had?
>>
>> Yes.
>> I'm confirming them that your patches had.
>
>
> Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve
> any transactions which can not be resolved immediately after they were
> prepared. There was a comment from Kevin (IIRC) that leaving transactions
> unresolved on the foreign server keeps the resources locked on those
> servers. That's not a very good situation. And nobody but the initiating
> server can resolve those. That functionality is important to make it a
> complete 2PC solution. So, please consider it to be included in your first
> set of patches.
>

Yeah, I know the reason why pg_fdw_xact_resolver is required.
I will add it as a separated patch.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-08-26 Thread Masahiko Sawada
On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
>
>
> On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada <sawada.m...@gmail.com>
> wrote:
>>
>> On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpok...@gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
>> > across multiple foreign servers.
>> > If a transaction make changes to more than two foreign servers the
>> > current
>> > implementation in postgres_fdw doesn't make sure that either all of them
>> > commit or all of them rollback their changes.
>> >
>> > We (Masahiko Sawada and me) reopen this thread and trying to contribute
>> > in
>> > it.
>> >
>> > 2PC for FDW
>> > 
>> > The patch provides support for atomic commit for transactions involving
>> > foreign servers. when the transaction makes changes to foreign servers,
>> > either all the changes to all the foreign servers commit or rollback.
>> >
>> > The new patch 2PC for FDW include the following things:
>> > 1. The patch 0001 introduces a generic feature. All kinds of FDW that
>> > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve
>> > in
>> > the transaction.
>> >
>> > Currently we can push some conditions down to shard nodes, especially in
>> > 9.6
>> > the directly modify feature has
>> > been introduced. But such a transaction modifying data on shard node is
>> > not
>> > executed surely.
>> > Using 0002 patch, that modify is executed with 2PC. It means that we
>> > almost
>> > can provide sharding solution using
>> > multiple PostgreSQL server (one parent node and several shared node).
>> >
>> > For multi master, we definitely need transaction manager but transaction
>> > manager probably can use this 2PC for FDW feature to manage distributed
>> > transaction.
>> >
>> > 2. 0002 patch makes postgres_fdw possible to use 2PC.
>> >
>> > 0002 patch makes postgres_fdw to use below APIs. These APIs are generic
>> > features which can be used by all kinds of FDWs.
>> >
>> > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of
>> > COMMIT/ABORT on foreign server which supports 2PC.
>> > b. Manage information of foreign prepared transactions resolver
>> >
>> > Masahiko Sawada will post the patch.
>> >
>> >
>>
>
> Thanks Vinayak and Sawada-san for taking this forward and basing your work
> on my patch.
>
>>
>> Still lot of work to do but attached latest patches.
>> These are based on the patch Ashutosh posted before, I revised it and
>> divided into two patches.
>> Compare with original patch, patch of pg_fdw_xact_resolver and
>> documentation are lacked.
>
>
> I am not able to understand the last statement.

Sorry to confuse you.

> Do you mean to say that your patches do not have pg_fdw_xact_resolver() and
> documentation that my patches had?

Yes.
I'm confirming them that your patches had.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for lazy_scan_heap

2016-08-25 Thread Masahiko Sawada
On Thu, Aug 25, 2016 at 1:41 AM, Anastasia Lubennikova
<lubennikov...@gmail.com> wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:not tested
>
> Hi,
> I haven't tested the performance yet, but the patch itself looks pretty good
> and reasonable improvement.
> I have a question about removing the comment. It seems to be really tricky
> moment. How do we know that all-frozen block hasn't changed since the
> moment we checked it?

I think that we don't need to check whether all-frozen block hasn't
changed since we checked it.
Even if the all-frozen block has changed after we saw it as an
all-frozen page, we can update the relfrozenxid.
Because any new XIDs just added to that page are newer than the
GlobalXmin we computed.

Am I missing something?

In this patch, since we count the number of all-frozen page even
during not an aggresive scan, I thought that we don't need to check
whether these blocks were all-frozen pages.

> I'm going to test the performance this week.
> I wonder if you could send a test script or describe the steps to test it?

This optimization reduces the number of scanning visibility map when
table is almost visible or frozen..
So it would be effective for very large table (more than several
hundreds GB) which is almost all-visible or all-frozen pages.

For example,
1. Create very large table.
2. Execute VACUUM FREEZE to freeze all pages of table.
3. Measure the execution time of VACUUM FREEZE.
I hope that the second VACUUM FREEZE become fast a little.

I modified the comment, and attached v2 patch.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/commands/vacuumlazy.c 
b/src/backend/commands/vacuumlazy.c
index 231e92d..226dc83 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -471,6 +471,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats 
*vacrelstats,
PROGRESS_VACUUM_MAX_DEAD_TUPLES
};
int64   initprog_val[3];
+   BlockNumber n_skipped;
+   BlockNumber n_all_frozen;
 
pg_rusage_init();
 
@@ -501,6 +503,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats 
*vacrelstats,
initprog_val[2] = vacrelstats->max_dead_tuples;
pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
+   n_skipped = 0;
+   n_all_frozen = 0;
+
/*
 * Except when aggressive is set, we want to skip pages that are
 * all-visible according to the visibility map, but only when we can 
skip
@@ -558,14 +563,20 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats 
*vacrelstats,
{
if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
break;
+   n_all_frozen++;
}
else
{
if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
break;
+
+   /* Count the number of all-frozen pages */
+   if (vmstatus & VISIBILITYMAP_ALL_FROZEN)
+   n_all_frozen++;
}
vacuum_delay_point();
next_unskippable_block++;
+   n_skipped++;
}
}
 
@@ -574,7 +585,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats 
*vacrelstats,
else
skipping_blocks = false;
 
-   for (blkno = 0; blkno < nblocks; blkno++)
+   blkno = 0;
+   while (blkno < nblocks)
{
Buffer  buf;
Pagepage;
@@ -592,18 +604,22 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats 
*vacrelstats,
TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
/* see note above about forcing scanning of last page */
-#define FORCE_CHECK_PAGE() \
-   (blkno == nblocks - 1 && should_attempt_truncation(vacrelstats))
+#define FORCE_CHECK_PAGE(blk) \
+   ((blk) == nblocks - 1 && should_attempt_truncation(vacrelstats))
 
pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, 
blkno);
 
if (blkno == next_unskippable_block)
{
+   n_skipped = 0;
+   n_all_frozen = 0;
+
/* Time to advance next_unskippable_block */
next_unskippable_block++;
if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
{
-   while (next_unskippable

Re: [HACKERS] Block level parallel vacuum WIP

2016-08-23 Thread Masahiko Sawada
On Tue, Aug 23, 2016 at 10:50 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Aug 23, 2016 at 7:02 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> I'd like to propose block level parallel VACUUM.
>> This feature makes VACUUM possible to use multiple CPU cores.
>
> Great.  This is something that I have thought about, too.  Andres and
> Heikki recommended it as a project to me a few PGCons ago.
>
>> As for PoC, I implemented parallel vacuum so that each worker
>> processes both 1 and 2 phases for particular block range.
>> Suppose we vacuum 1000 blocks table with 4 workers, each worker
>> processes 250 consecutive blocks in phase 1 and then reclaims dead
>> tuples from heap and indexes (phase 2).
>> To use visibility map efficiency, each worker scan particular block
>> range of relation and collect dead tuple locations.
>> After each worker finished task, the leader process gathers these
>> vacuum statistics information and update relfrozenxid if possible.
>
> This doesn't seem like a good design, because it adds a lot of extra
> index scanning work.  What I think you should do is:
>
> 1. Use a parallel heap scan (heap_beginscan_parallel) to let all
> workers scan in parallel.  Allocate a DSM segment to store the control
> structure for this parallel scan plus an array for the dead tuple IDs
> and a lock to protect the array.
>
> 2. When you finish the heap scan, or when the array of dead tuple IDs
> is full (or very nearly full?), perform a cycle of index vacuuming.
> For now, have each worker process a separate index; extra workers just
> wait.  Perhaps use the condition variable patch that I posted
> previously to make the workers wait.  Then resume the parallel heap
> scan, if not yet done.
>
> Later, we can try to see if there's a way to have multiple workers
> work together to vacuum a single index.  But the above seems like a
> good place to start.

Thank you for the advice.
That's a what I thought as an another design, I will change the patch
to this design.

>> I also changed the buffer lock infrastructure so that multiple
>> processes can wait for cleanup lock on a buffer.
>
> You won't need this if you proceed as above, which is probably a good thing.

Right.

>
>> And the new GUC parameter vacuum_parallel_workers controls the number
>> of vacuum workers.
>
> I suspect that for autovacuum there is little reason to use parallel
> vacuum, since most of the time we are trying to slow vacuum down, not
> speed it up.  I'd be inclined, for starters, to just add a PARALLEL
> option to the VACUUM command, for when people want to speed up
> parallel vacuums.  Perhaps
>
> VACUUM (PARALLEL 4) relation;
>
> ...could mean to vacuum the relation with the given number of workers, and:
>
> VACUUM (PARALLEL) relation;
>
> ...could mean to vacuum the relation in parallel with the system
> choosing the number of workers - 1 worker per index is probably a good
> starting formula, though it might need some refinement.

It looks convenient.
I was thinking that we can manage the number of parallel worker per
table using this parameter for autovacuum , like
ALTER TABLE relation SET (parallel_vacuum_workers = 2)

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block level parallel vacuum WIP

2016-08-23 Thread Masahiko Sawada
On Tue, Aug 23, 2016 at 9:40 PM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
> On Tue, Aug 23, 2016 at 3:32 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>
>> Claudio Freire <klaussfre...@gmail.com> writes:
>> > Not only that, but from your description (I haven't read the patch,
>> > sorry), you'd be scanning the whole index multiple times (one per
>> > worker).
>>
>> What about pointing each worker at a separate index?  Obviously the
>> degree of concurrency during index cleanup is then limited by the
>> number of indexes, but that doesn't seem like a fatal problem.
>
>
> +1
> We could eventually need some effective way of parallelizing vacuum of
> single index.
> But pointing each worker at separate index seems to be fair enough for
> majority of cases.
>

Or we can improve vacuum of single index by changing data
representation of dead tuple to bitmap.
 It can reduce the number of index whole scan during vacuum and make
comparing the index item to the dead tuples faster.
This is a listed on Todo list and I've implemented it.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Block level parallel vacuum WIP

2016-08-23 Thread Masahiko Sawada
Hi all,

I'd like to propose block level parallel VACUUM.
This feature makes VACUUM possible to use multiple CPU cores.

Vacuum Processing Logic
===

PostgreSQL VACUUM processing logic consists of 2 phases,
1. Collecting dead tuple locations on heap.
2. Reclaiming dead tuples from heap and indexes.
These phases 1 and 2 are executed alternately, and once amount of dead
tuple location reached maintenance_work_mem in phase 1, phase 2 will
be executed.

Basic Design
==

As for PoC, I implemented parallel vacuum so that each worker
processes both 1 and 2 phases for particular block range.
Suppose we vacuum 1000 blocks table with 4 workers, each worker
processes 250 consecutive blocks in phase 1 and then reclaims dead
tuples from heap and indexes (phase 2).
To use visibility map efficiency, each worker scan particular block
range of relation and collect dead tuple locations.
After each worker finished task, the leader process gathers these
vacuum statistics information and update relfrozenxid if possible.

I also changed the buffer lock infrastructure so that multiple
processes can wait for cleanup lock on a buffer.
And the new GUC parameter vacuum_parallel_workers controls the number
of vacuum workers.

Performance(PoC)
=

I ran parallel vacuum on 13GB table (pgbench scale 1000) with several
workers (on my poor virtual machine).
The result is,

1. Vacuum whole table without index (disable page skipping)
  1 worker   : 33 sec
  2 workers : 27 sec
  3 workers : 23 sec
  4 workers : 22 sec

2. Vacuum table and index (after 1 transaction executed)
  1 worker   : 12 sec
  2 workers : 49 sec
  3 workers : 54 sec
  4 workers : 53 sec

As a result of my test, since multiple process could frequently try to
acquire the cleanup lock on same index buffer, execution time of
parallel vacuum got worse.
And it seems to be effective for only table vacuum so far, but is not
improved as expected (maybe disk bottleneck).

Another Design

ISTM that processing index vacuum by multiple process is not good idea
in most cases because many index items can be stored in a page and
multiple vacuum worker could try to require the cleanup lock on the
same index buffer.
It's rather better that multiple workers process particular block
range and then multiple workers process each particular block range,
and then one worker per index processes index vacuum.

Still lots of work to do but attached PoC patch.
Feedback and suggestion are very welcome.

Regards,

--
Masahiko Sawada
From b25d491a05a43fb7adf014b2580c71ec7adb75a2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.m...@gmail.com>
Date: Mon, 8 Aug 2016 16:43:35 -0700
Subject: [PATCH 1/2] Allow muliple backends to wait for cleanup lock.

---
 src/backend/storage/buffer/buf_init.c |  3 +-
 src/backend/storage/buffer/bufmgr.c   | 57 +++
 src/include/storage/buf_internals.h   |  4 ++-
 src/include/storage/proc.h|  2 ++
 4 files changed, 45 insertions(+), 21 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c 
b/src/backend/storage/buffer/buf_init.c
index a4163cf..2aad030 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -134,7 +134,8 @@ InitBufferPool(void)
CLEAR_BUFFERTAG(buf->tag);
 
pg_atomic_init_u32(>state, 0);
-   buf->wait_backend_pid = 0;
+   dlist_init(>pin_count_waiters);
+   pg_atomic_write_u32(>nwaiters, 0);
 
buf->buf_id = i;
 
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index 76ade37..f2f4ab9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "catalog/storage.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
+#include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -1730,15 +1731,19 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
 */
buf_state = LockBufHdr(buf);
 
-   if ((buf_state & BM_PIN_COUNT_WAITER) &&
-   BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+   if (buf_state & BM_PIN_COUNT_WAITER)
{
-   /* we just released the last pin other than the 
waiter's */
-   int wait_backend_pid = 
buf->wait_backend_pid;
+   dlist_mutable_iter iter;
 
-   buf_state &= ~BM_PIN_COUNT_WAITER;
+   if (pg_atomic_read_u32(>nwaiters) == 1)
+  

Re: [HACKERS] synchronous_commit = remote_flush

2016-08-19 Thread Masahiko Sawada
On Fri, Aug 19, 2016 at 5:25 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Aug 18, 2016 at 12:22 AM, Thomas Munro
> <thomas.mu...@enterprisedb.com> wrote:
>> To do something about the confusion I keep seeing about what exactly
>> "on" means, I've often wished we had "remote_flush".  But it's not
>> obvious how the backwards compatibility could work, ie how to keep the
>> people happy who use "local" vs "on" to control syncrep, and also the
>> people who use "off" vs "on" to control asynchronous commit on
>> single-node systems.  Is there any sensible way to do that, or is it
>> not broken and I should pipe down, or is it just far too entrenched
>> and never going to change?
>
> I don't see why we can't add "remote_flush" as a synonym for "on".  Do
> you have something else in mind?
>

+1 for adding "remote_flush" as a synonym for "on".
It doesn't break backward compatibility.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup wish list

2016-08-18 Thread Masahiko Sawada
On Fri, Aug 19, 2016 at 7:06 AM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 7/12/16 9:55 PM, Masahiko Sawada wrote:
>> And what I think is pg_baseback never remove the directory specified
>> by -D option even if execution is failed. initdb command behaves so.
>> I think it's helpful for backup operation.
>
> This has been bothering me as well.
>
> How about the attached patch as a start?
>

Thank you for the patch!

I agree with adding this as an option and removing directory by default.
And it looks good to me except for missing new line in usage output.

printf(_("  -l, --label=LABEL  set backup label\n"));
+   printf(_("  -n, --noclean  do not clean up after errors"));
printf(_("  -P, --progress show progress information\n"));

Registered this patch to CF1.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new autovacuum criterion for visible pages

2016-08-12 Thread Masahiko Sawada
On Fri, Aug 12, 2016 at 9:01 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paqu...@gmail.com> writes:
>> In short, autovacuum will need to scan by itself the VM of each
>> relation and decide based on that.
>
> That seems like a worthwhile approach to pursue.  The VM is supposed to be
> small, and if you're worried it isn't, you could sample a few pages of it.
> I do not think any of the ideas proposed so far for tracking the
> visibility percentage on-the-fly are very tenable.
>

The one visibility map page can store the information of 32672 heap
pages (255MB), but it would be cost if autovacuum scan whole
visibility map for all tables.
So I think that it's better to provide
autovacuum_vacuum_pagevisible_factor as a relopts.
And the autovacuum scans or samples the visibility map of table that
autovacuum_vacuum_pagevisible_factor is set.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-08-09 Thread Masahiko Sawada
On Tue, Aug 9, 2016 at 5:13 PM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> On 9 August 2016 at 15:59, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
>>
>>
>> The logical replication launcher process and the apply process are
>> implemented as a bgworker. Isn't better to have them as an auxiliary
>> process like checkpointer, wal writer?
>
>
> I don't think so. The checkpointer, walwriter, autovacuum, etc predate
> bgworkers. I strongly suspect that if they were to be implemented now they'd
> use bgworkers.
>
> Now, perhaps we want a new bgworker "kind" for system workers or some other
> minor tweaks. But basically I think bgworkers are exactly what we should be
> using here.

I understood. Thanks!

>>
>> IMO the number of logical replication connections should not be
>> limited by max_worker_processes.
>
>
> Well, they *are* worker processes... but I take your point, that that
> setting has been "number of bgworkers the user can run" and it might not be
> expected that logical replication would use the same space.
>
> max_worker_progresses isn't just a limit, it controls how many shmem slots
> we allocate.
>
> I guess we could have a separate max_logical_workers or something, but I'm
> inclined to think that adds complexity without really making things any
> nicer. We'd just add them together to decide how many shmem slots to
> allocate and we'd have to keep track of how many slots were used by which
> types of backend. Or create a near-duplicate of the bgworker facility for
> logical rep.
>
> Sure, you can go deeper down the rabbit hole here and say that we need to
> add bgworker "categories" with reserved pools of worker slots for each
> category. But do we really need that?

If we change these processes to bgworker, we can categorize them into
two, auxiliary process(check pointer and  wal sender etc) and other
worker process.
And max_worker_processes controls the latter.

> max_connections includes everything, both system and user backends. It's not
> like we don't do this elsewhere. It's at worst a mild wart.
>
> The only argument I can see for not using bgworkers is for the supervisor
> worker. It's a singleton that launches the per-database workers, and
> arguably is a job that the postmaster could do better. The current design
> there stems from its origins as an extension. Maybe worker management could
> be simplified a bit as a result. I'd really rather not invent yet another
> new and mostly duplicate category of custom workers to achieve that though.
>
>

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-08-09 Thread Masahiko Sawada
On Sat, Aug 6, 2016 at 2:04 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 5 August 2016 at 16:22, Andres Freund <and...@anarazel.de> wrote:
>> On 2016-08-05 17:00:13 +0200, Petr Jelinek wrote:
>>> as promised here is WIP version of logical replication patch.
>>
>> Yay!
>
> Yay2
>

Thank you for working on this!

I've applied these patches to current HEAD, but got the following error.

libpqwalreceiver.c:48: error: redefinition of typedef ‘WalReceiverConnHandle’
../../../../src/include/replication/walreceiver.h:137: note: previous
declaration of ‘WalReceiverConnHandle’ was here
make[2]: *** [libpqwalreceiver.o] Error 1
make[1]: *** [install-backend/replication/libpqwalreceiver-recurse] Error 2
make: *** [install-src-recurse] Error 2

After fixed this issue with attached patch, I used logical replication a little.
Some random comments and questions.

The logical replication launcher process and the apply process are
implemented as a bgworker. Isn't better to have them as an auxiliary
process like checkpointer, wal writer?
IMO the number of logical replication connections should not be
limited by max_worker_processes.

--
We need to set the publication up by at least CREATE PUBLICATION and
ALTER PUBLICATION command.
Can we make CREATE PUBLICATION possible to define tables as well?
For example,
CREATE PUBLICATION mypub [ TABLE table_name, ...] [WITH options]

--
This patch can not drop the subscription.

=# drop subscription sub;
ERROR:  unrecognized object class: 6102

-- 
+/*-
+ *
+ * proto.c
+ * logical replication protocol functions
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *

The copyright of added files are old.

And this patch has some whitespace problems.
Please run "git show --check" or "git diff origin/master --check"

Regards,

--
Masahiko Sawada
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 94648c7..e4aaba4 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -40,12 +40,12 @@
 
 PG_MODULE_MAGIC;
 
-typedef struct WalReceiverConnHandle {
+struct WalReceiverConnHandle {
 	/* Current connection to the primary, if any */
 	PGconn *streamConn;
 	/* Buffer for currently read records */
 	char   *recvBuf;
-} WalReceiverConnHandle;
+};
 
 PGDLLEXPORT WalReceiverConnHandle *_PG_walreceirver_conn_init(WalReceiverConnAPI *wrcapi);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Optimization for lazy_scan_heap

2016-08-03 Thread Masahiko Sawada
Hi all,

While reviewing freeze map code, Andres pointed out[1] that
lazy_scan_heap could accesses visibility map twice and its logic is
seems a bit tricky.
As discussed before, it's not nice especially when large relation is
entirely frozen.

I posted the patch for that before but since this is an optimization,
not bug fix, I'd like to propose it again.
Please give me feedback.

[1] 
https://www.postgresql.org/message-id/20160505000840.epatoq6d2e556446%40alap3.anarazel.de

Regards,

--
Masahiko Sawada
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..e7cdd2c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -471,6 +471,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	BlockNumber	n_skipped;
+	BlockNumber n_all_frozen;
 
 	pg_rusage_init();
 
@@ -501,6 +503,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	initprog_val[2] = vacrelstats->max_dead_tuples;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
+	n_skipped = 0;
+	n_all_frozen = 0;
+
 	/*
 	 * Except when aggressive is set, we want to skip pages that are
 	 * all-visible according to the visibility map, but only when we can skip
@@ -558,14 +563,20 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			{
 if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
 	break;
+n_all_frozen++;
 			}
 			else
 			{
 if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
 	break;
+
+/* Count the number of all-frozen pages */
+if (vmstatus & VISIBILITYMAP_ALL_FROZEN)
+	n_all_frozen++;
 			}
 			vacuum_delay_point();
 			next_unskippable_block++;
+			n_skipped++;
 		}
 	}
 
@@ -574,7 +585,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	else
 		skipping_blocks = false;
 
-	for (blkno = 0; blkno < nblocks; blkno++)
+	blkno = 0;
+	while (blkno < nblocks)
 	{
 		Buffer		buf;
 		Page		page;
@@ -592,18 +604,22 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
 		/* see note above about forcing scanning of last page */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrelstats))
+#define FORCE_CHECK_PAGE(blk) \
+		((blk) == nblocks - 1 && should_attempt_truncation(vacrelstats))
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		if (blkno == next_unskippable_block)
 		{
+			n_skipped = 0;
+			n_all_frozen = 0;
+
 			/* Time to advance next_unskippable_block */
 			next_unskippable_block++;
 			if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 			{
-while (next_unskippable_block < nblocks)
+while (next_unskippable_block < nblocks &&
+	   !FORCE_CHECK_PAGE(next_unskippable_block))
 {
 	uint8		vmskipflags;
 
@@ -614,14 +630,22 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	{
 		if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
 			break;
+
+		/* Count the number of all-frozen pages */
+		n_all_frozen++;
 	}
 	else
 	{
 		if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			break;
+
+		/* Count the number of all-frozen pages */
+		if (vmskipflags & VISIBILITYMAP_ALL_FROZEN)
+			n_all_frozen++;
 	}
 	vacuum_delay_point();
 	next_unskippable_block++;
+	n_skipped++;
 }
 			}
 
@@ -646,26 +670,23 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current block is the first block of continuous skippable blocks,
+			 * and we know that how many blocks we can skip to scan. If we've
+			 * seen a long enough run of skippable blocks to justify skipping it,
+			 * then go ahead and skip. Otherwise, the page must be at least all-visible
+			 * if not all-frozen, so we can set all_visible_according_to_vm = true.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks)
 			{
 /*
- * Tricky, tricky.  If this is in aggressive vacuum, the page
- * must have been all-frozen at the time we checked whether it
- * was skippable, but it might not be any more.  We must be
- * careful to count it as a skipped all-frozen page in that
- * case, or else we'll think we can't update relfrozenxid and
- * relminmxid.  If it's not an aggressive vacuum, we don't
- * know whether it was all-frozen, so we have to rec

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-08-03 Thread Masahiko Sawada
On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>> I was thinking that the syntax for quorum method would use '[ ... ]'
>> but it will be confused with '( ... )' priority method used.
>> 001 patch adds 'Any N ( ... )' style syntax but I know that we still
>> might need to discuss about better syntax, discussion is very welcome.
>> Attached draft patch, please give me feedback.
>
> I am +1 for using either "{}" or "[]" to define a quorum set, and -1
> for the addition of a keyword in front of the integer defining for how
> many nodes server need to wait for.

Thank you for reply.
"{}" or "[]" are not bad but because these are not intuitive, I
thought that it will be hard for uses to use different method for each
purpose.

> -foreach(cell, sync_standbys)
> +foreach (cell, sync_standbys)
>  {
> -WalSnd   *walsnd = >walsnds[lfirst_int(cell)];
> +    WalSnd *walsnd = >walsnds[lfirst_int(cell)];
> This patch has some noise.

Will fix.

-- 
Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Quorum commit for multiple synchronous replication.

2016-08-02 Thread Masahiko Sawada
Hi all,

In 9.6 development cycle, we had been discussed about configuration
syntax for a long time while considering expanding.
As a result, we had new dedicated language for multiple synchronous
replication, but it supports only priority method.
We know that quorum commit is very useful for many users and can
expand dedicated language easily for quorum commit.
So I'd like to propose quorum commit for multiple synchronous replication here.

The followings are changes attached patches made.
- Add new syntax 'Any N ( node1, node2, ... )' to
synchornous_standby_names for quorum commit.
- In quorum commit, the master can return commit to client after
received ACK from *at least* any N servers of listed standbys.
- sync_priority of all listed servers are same, 1.
- Add regression test for quorum commit.

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 67249d8..0ce5399 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -76,9 +76,9 @@ char	   *SyncRepStandbyNames;
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
-static bool announce_next_takeover = true;
+SyncRepConfigData *SyncRepConfig = NULL;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
+static bool announce_next_takeover = true;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
@@ -89,7 +89,12 @@ static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 		   XLogRecPtr *flushPtr,
 		   XLogRecPtr *applyPtr,
 		   bool *am_sync);
+static bool SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr,
+		   XLogRecPtr *flushPtr,
+		   XLogRecPtr *applyPtr,
+		   int pos, bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -391,7 +396,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -418,11 +423,16 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(, ,
-			, _sync);
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		got_recptr = SyncRepGetOldestSyncRecPtr(, ,
+			 , _sync);
+	else /* SYNC_REP_QUORUM */
+		got_recptr = SyncRepGetNNewestSyncRecPtr(, ,
+			  , SyncRepConfig->num_sync,
+			  _sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -440,7 +450,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -476,6 +486,88 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
+ * Calculate the 'pos' newest Write, Flush and Apply positions among sync standbys.
+ *
+ * Return false if the number of sync standbys is less than
+ * synchronous_standby_names specifies. Otherwise return true and
+ * store the 'pos' newest positions into *writePtr, *flushPtr, *applyPtr.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static bool
+SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+		XLogRecPtr *applyPtr, int pos, bool *am_sync)
+{
+	XLogRecPtr	*write_array;
+	XLogRecPtr	*flush_array;
+	XLogRecPtr	*apply_array;
+	List	   *sync_standbys;
+	ListCell   *cell;
+	int			len;
+	int			i = 0;
+
+	*writePtr = InvalidXLogRecPtr;
+	*flushPtr = InvalidXLogRecPtr;
+	*applyPtr = InvalidXLogRecPtr;
+	*am_sync = false;
+
+	/* Get standbys that are considered as synchronous at this moment */
+	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+
+	/*
+	 * Quick exit if we are not managing a sync standby or there are not
+	 * enough synchronous standbys.
+	 */
+	if (!(*am_sync) ||
+		SyncRepConfig == NULL ||
+		list_length(sync_standbys) < SyncRepConfig->num_sync)
+	{
+		list_free(sync_standbys);
+		return false;
+	}
+
+	len = list_length(sync_standbys);
+	write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) *

Re: [HACKERS] Adjust recovery test file name

2016-07-20 Thread Masahiko Sawada
On Wed, Jul 20, 2016 at 5:08 AM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Masahiko Sawada wrote:
>> Hi all,
>>
>> The file 006_logical_decoding_timelines.pl was removed by the commit c1543a8.
>> But currently 005_***.pl and 007_***.pl exist on source tree.
>> Should we change its file number to 006?
>
> I don't think we need to bother about this.  Whenever somebody submits a
> new test script they can fill the 006 gap.
>

I understood.
Thank you for the reply!

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Adjust recovery test file name

2016-07-19 Thread Masahiko Sawada
Hi all,

The file 006_logical_decoding_timelines.pl was removed by the commit c1543a8.
But currently 005_***.pl and 007_***.pl exist on source tree.
Should we change its file number to 006?

Please find attached patch renames 007_sync_rep.pl to 006_sync_rep.pl.

Regards,

--
Masahiko Sawada


0001-Rename-file-number-of-sync_rep.pl.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix typo in postgres_fdw/deparse.c

2016-07-13 Thread Masahiko Sawada
Hi all,

Attached patch fixes small typo in contrib/postgres_fdw/deparse.c

s/whenver/whenever/g

Regards,

--
Masahiko Sawada


fix_small_typo_in_deparse_c.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup wish list

2016-07-12 Thread Masahiko Sawada
On Wed, Jul 13, 2016 at 1:53 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> I've been having some adventures with pg_basebackup lately, and had
> some suggestions based on those.

And what I think is pg_baseback never remove the directory specified
by -D option even if execution is failed. initdb command behaves so.
I think it's helpful for backup operation.

Regards,

Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-07-11 Thread Masahiko Sawada
On Fri, Jul 8, 2016 at 10:24 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> Than you for reviewing!
>>
>> On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <and...@anarazel.de> wrote:
>>> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>>>> diff --git a/src/backend/access/heap/heapam.c 
>>>> b/src/backend/access/heap/heapam.c
>>>> index 57da57a..fd66527 100644
>>>> --- a/src/backend/access/heap/heapam.c
>>>> +++ b/src/backend/access/heap/heapam.c
>>>> @@ -3923,6 +3923,17 @@ l2:
>>>>
>>>>   if (need_toast || newtupsize > pagefree)
>>>>   {
>>>> + /*
>>>> +  * To prevent data corruption due to updating old tuple by
>>>> +  * other backends after released buffer
>>>
>>> That's not really the reason, is it? The prime problem is crash safety /
>>> replication. The row-lock we're faking (by setting xmax to our xid),
>>> prevents concurrent updates until this backend died.
>>
>> Fixed.
>>
>>>> , we need to emit that
>>>> +  * xmax of old tuple is set and clear visibility map bits if
>>>> +  * needed before releasing buffer. We can reuse xl_heap_lock
>>>> +  * for this purpose. It should be fine even if we crash 
>>>> midway
>>>> +  * from this section and the actual updating one later, since
>>>> +  * the xmax will appear to come from an aborted xid.
>>>> +  */
>>>> + START_CRIT_SECTION();
>>>> +
>>>>   /* Clear obsolete visibility flags ... */
>>>>   oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>>>>   oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>>>> @@ -3936,6 +3947,46 @@ l2:
>>>>   /* temporarily make it look not-updated */
>>>>   oldtup.t_data->t_ctid = oldtup.t_self;
>>>>   already_marked = true;
>>>> +
>>>> + /* Clear PD_ALL_VISIBLE flags */
>>>> + if (PageIsAllVisible(BufferGetPage(buffer)))
>>>> + {
>>>> + all_visible_cleared = true;
>>>> + PageClearAllVisible(BufferGetPage(buffer));
>>>> + visibilitymap_clear(relation, 
>>>> BufferGetBlockNumber(buffer),
>>>> + vmbuffer);
>>>> + }
>>>> +
>>>> + MarkBufferDirty(buffer);
>>>> +
>>>> + if (RelationNeedsWAL(relation))
>>>> + {
>>>> + xl_heap_lock xlrec;
>>>> + XLogRecPtr recptr;
>>>> +
>>>> + /*
>>>> +  * For logical decoding we need combocids to 
>>>> properly decode the
>>>> +  * catalog.
>>>> +  */
>>>> + if (RelationIsAccessibleInLogicalDecoding(relation))
>>>> + log_heap_new_cid(relation, );
>>>
>>> Hm, I don't see that being necessary here. Row locks aren't logically
>>> decoded, so there's no need to emit this here.
>>
>> Fixed.
>>
>>>
>>>> + /* Clear PD_ALL_VISIBLE flags */
>>>> + if (PageIsAllVisible(page))
>>>> + {
>>>> + Buffer  vmbuffer = InvalidBuffer;
>>>> + BlockNumber block = BufferGetBlockNumber(*buffer);
>>>> +
>>>> + all_visible_cleared = true;
>>>> + PageClearAllVisible(page);
>>>> + visibilitymap_pin(relation, block, );
>>>> + visibilitymap_clear(relation, block, vmbuffer);
>>>> + }
>>>> +
>>>
>>> That clears all-visible unnecessarily, we only need to clear all-frozen.
>>>
>>
>> Fixed.
>>
>>>
>>>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>>>   }
>>>>   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>>>   HeapTupleHeaderSetCmax(htup, Firs

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Masahiko Sawada
Than you for reviewing!

On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>> diff --git a/src/backend/access/heap/heapam.c 
>> b/src/backend/access/heap/heapam.c
>> index 57da57a..fd66527 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -3923,6 +3923,17 @@ l2:
>>
>>   if (need_toast || newtupsize > pagefree)
>>   {
>> + /*
>> +  * To prevent data corruption due to updating old tuple by
>> +  * other backends after released buffer
>
> That's not really the reason, is it? The prime problem is crash safety /
> replication. The row-lock we're faking (by setting xmax to our xid),
> prevents concurrent updates until this backend died.

Fixed.

>> , we need to emit that
>> +  * xmax of old tuple is set and clear visibility map bits if
>> +  * needed before releasing buffer. We can reuse xl_heap_lock
>> +  * for this purpose. It should be fine even if we crash midway
>> +  * from this section and the actual updating one later, since
>> +  * the xmax will appear to come from an aborted xid.
>> +  */
>> + START_CRIT_SECTION();
>> +
>>   /* Clear obsolete visibility flags ... */
>>   oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>>   oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>> @@ -3936,6 +3947,46 @@ l2:
>>   /* temporarily make it look not-updated */
>>   oldtup.t_data->t_ctid = oldtup.t_self;
>>   already_marked = true;
>> +
>> + /* Clear PD_ALL_VISIBLE flags */
>> + if (PageIsAllVisible(BufferGetPage(buffer)))
>> + {
>> + all_visible_cleared = true;
>> + PageClearAllVisible(BufferGetPage(buffer));
>> + visibilitymap_clear(relation, 
>> BufferGetBlockNumber(buffer),
>> + vmbuffer);
>> + }
>> +
>> + MarkBufferDirty(buffer);
>> +
>> + if (RelationNeedsWAL(relation))
>> + {
>> + xl_heap_lock xlrec;
>> + XLogRecPtr recptr;
>> +
>> + /*
>> +  * For logical decoding we need combocids to properly 
>> decode the
>> +  * catalog.
>> +  */
>> + if (RelationIsAccessibleInLogicalDecoding(relation))
>> + log_heap_new_cid(relation, );
>
> Hm, I don't see that being necessary here. Row locks aren't logically
> decoded, so there's no need to emit this here.

Fixed.

>
>> + /* Clear PD_ALL_VISIBLE flags */
>> + if (PageIsAllVisible(page))
>> + {
>> + Buffer  vmbuffer = InvalidBuffer;
>> + BlockNumber block = BufferGetBlockNumber(*buffer);
>> +
>> + all_visible_cleared = true;
>> + PageClearAllVisible(page);
>> + visibilitymap_pin(relation, block, );
>> + visibilitymap_clear(relation, block, vmbuffer);
>> + }
>> +
>
> That clears all-visible unnecessarily, we only need to clear all-frozen.
>

Fixed.

>
>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>   }
>>   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>   HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>> +
>> + /* The visibility map need to be cleared */
>> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
>> + {
>> + RelFileNode rnode;
>> + Buffer  vmbuffer = InvalidBuffer;
>> + BlockNumber blkno;
>> + Relationreln;
>> +
>> + XLogRecGetBlockTag(record, 0, , NULL, );
>> + reln = CreateFakeRelcacheEntry(rnode);
>> +
>> + visibilitymap_pin(reln, blkno, );
>> + visibilitymap_clear(reln, blkno, vmbuffer);
>> + PageClearAllVisible(page);
>> + }
>> +
>
>
>>   PageSetLSN(page, lsn);
>>   MarkBufferDirty(buffer);
>> 

[HACKERS] Fix typo in jsonb.c

2016-07-05 Thread Masahiko Sawada
Hi all,

Attached patch fixes the typo in jsonb.c
Please find it.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 256ef80..ab46823 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -1849,7 +1849,7 @@ jsonb_object_agg_transfn(PG_FUNCTION_ARGS)
 	single_scalar = false;
 
 	/*
-	 * values can be anything, including structured and null, so we treate
+	 * values can be anything, including structured and null, so we treat
 	 * them as in json_agg_transfn, except that single scalars are always
 	 * pushed as WJB_VALUE items.
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-07-05 Thread Masahiko Sawada
On Mon, Jul 4, 2016 at 5:44 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapil...@gmail.com> 
>>> wrote:
>>>>
>>>> Why do you think IndexOnlyScan will return wrong result?  If the
>>>> server crash in the way as you described, the transaction that has
>>>> made modifications will anyway be considered aborted, so the result of
>>>> IndexOnlyScan should not be wrong.
>>>>
>>>
>>> Ah, you're right, I misunderstood.
>>>
>>> Attached updated patch incorporating your comments.
>>> I've changed it so that heap_xlog_lock clears vm flags if page is
>>> marked all frozen.
>>>
>>
>> I think we should make a similar change in heap_lock_tuple API as
>> well.
>> Also, currently by default heap_xlog_lock tuple tries to clear
>> the visibility flags, isn't it better to handle it as we do at all
>> other places (ex. see log_heap_update), by logging the information
>> about same.
>
> I will deal with them.
>
>> Though, it is important to get the patch right, but I feel in the
>> meantime, it might be better to start benchmarking.  AFAIU, even if
>> change some part of information while WAL logging it, the benchmark
>> results won't be much different.
>
> Okay, I will do the benchmark test as well.
>

I measured the thoughput and the output quantity of WAL with HEAD and
HEAD+patch(attached) on my virtual environment.
I used pgbench with attached custom script file which sets 3200 length
string to the filler column in order to make toast data.
The scale factor is 1000 and pgbench options are, -c 4 -T 600 -f toast_test.sql.
After changed filler column to the text data type I ran it.

* Throughput
HEAD : 1833.204172
Patched : 1827.399482

* Output quantity of WAL
HEAD :  7771 MB
Patched : 8082 MB

The throughput is almost same, but the ouput quantity of WAL is
slightly increased. (about 4%)

Regards,

--
Masahiko Sawada


toast_test.sql
Description: Binary data
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..fd66527 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,17 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * To prevent data corruption due to updating old tuple by
+		 * other backends after released buffer, we need to emit that
+		 * xmax of old tuple is set and clear visibility map bits if
+		 * needed before releasing buffer. We can reuse xl_heap_lock
+		 * for this purpose. It should be fine even if we crash midway
+		 * from this section and the actual updating one later, since
+		 * the xmax will appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3947,46 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		/* Clear PD_ALL_VISIBLE flags */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(BufferGetPage(buffer));
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+vmbuffer);
+		}
+
+		MarkBufferDirty(buffer);
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			/*
+			 * For logical decoding we need combocids to properly decode the
+			 * catalog.
+			 */
+			if (RelationIsAccessibleInLogicalDecoding(relation))
+log_heap_new_cid(relation, );
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(_self);
+			xlrec.locking_xid = xmax_old_tuple;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+  oldtup.t_data->t_infomask2);
+			if (all_visible_cleared)
+xlrec.infobits_set |= XLHL_ALL_VISIBLE_CLEARED;
+			XLogRegisterData((char *) , SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
@@ -4140,7 +4191,8 @@ l2:
 		 */
 		if (RelationIsAccessibleInLogicalDecoding(relation))
 		{
-			log_heap_new_cid(relation, );
+			if (!already_marked)
+log_heap_new_cid(relation, );
 			log_heap_new_cid(relation, heaptup);
 		}
 
@@ -4513,6 +4565,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 new_infomask2;
 	bool		first_time = true;
 	bool		have_tuple_lock =

Re: [HACKERS] Reviewing freeze map code

2016-07-04 Thread Masahiko Sawada
On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <and...@anarazel.de> wrote:
>> On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
>>> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>>> wrote:
>>> > Ah, you're right, I misunderstood.
>>> >
>>> > Attached updated patch incorporating your comments.
>>> > I've changed it so that heap_xlog_lock clears vm flags if page is
>>> > marked all frozen.
>>>
>>> I believe that this should be separated into two patches, since there
>>> are two issues here:
>>>
>>> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
>>> 2. heap_update releases the buffer content lock without logging the
>>> changes it has made.
>>>
>>> With respect to #1, there is no need to clear the all-visible bit,
>>> only the all-frozen bit.  However, that's a bit tricky given that we
>>> removed PD_ALL_FROZEN.  Should we think about putting that back again?
>>
>> I think it's fine to just do the vm lookup.
>>
>>> Should we just clear all-visible and call it good enough?
>>
>> Given that we need to do that in heap_lock_tuple, which entirely
>> preserves all-visible (but shouldn't preserve all-frozen), ISTM we
>> better find something that doesn't invalidate all-visible.
>>
>
> Sounds logical, considering that we have a way to set all-frozen and
> vacuum does that as well.  So probably either we need to have a new
> API or add a new parameter to visibilitymap_clear() to indicate the
> same.  If we want to go that route, isn't it better to have
> PD_ALL_FROZEN as well?
>

Cant' we call visibilitymap_set with all-visible but not all-frozen
bits instead of clearing flags?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-07-04 Thread Masahiko Sawada
On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>>
>>> Why do you think IndexOnlyScan will return wrong result?  If the
>>> server crash in the way as you described, the transaction that has
>>> made modifications will anyway be considered aborted, so the result of
>>> IndexOnlyScan should not be wrong.
>>>
>>
>> Ah, you're right, I misunderstood.
>>
>> Attached updated patch incorporating your comments.
>> I've changed it so that heap_xlog_lock clears vm flags if page is
>> marked all frozen.
>>
>
> I think we should make a similar change in heap_lock_tuple API as
> well.
> Also, currently by default heap_xlog_lock tuple tries to clear
> the visibility flags, isn't it better to handle it as we do at all
> other places (ex. see log_heap_update), by logging the information
> about same.

I will deal with them.

> Though, it is important to get the patch right, but I feel in the
> meantime, it might be better to start benchmarking.  AFAIU, even if
> change some part of information while WAL logging it, the benchmark
> results won't be much different.

Okay, I will do the benchmark test as well.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Comment and function argument names are mismatched in bugmgr.c.

2016-07-01 Thread Masahiko Sawada
On Sat, Jul 2, 2016 at 6:34 AM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> On 2016-06-23 16:24:12 +0900, Masahiko Sawada wrote:
>> By commit 428b1d6b, function WritebackContextInit is added but the
>> comment of this function seems to be incorrect.
>> *max_coalesce variable doesn't exist at anywhere.
>> Also, I think it should be fixed that the argument name of this
>> function does not match function declare in buf_internal.h.
>> Patch for that attached.
>
> Fix looks good to me, and your 'bugmgr.c' typo in $subject made me laugh
> ;). Pushed.

Oops, I realized it now :)
Thanks!

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Masahiko Sawada
On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <and...@anarazel.de> wrote:
>>>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
>>>>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <and...@anarazel.de> 
>>>>> wrote:
>>>>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>>>>> >> There is nothing in this record which recorded the information about
>>>>> >> visibility clear flag.
>>>>> >
>>>>> > I think we can actually defer the clearing to the lock release?
>>>>>
>>>>> How about the case if after we release the lock on page, the heap page
>>>>> gets flushed, but not vm and then server crashes?
>>>>
>>>> In the released branches there's no need to clear all visible at that
>>>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
>>>> fine there, and that's the part where reusing an existing record is
>>>> important (for compatibility).
>>>>
>>>
>>> For back branches, I agree that heap_lock_tuple is sufficient,
>>
>> Even if we use heap_lock_tuple, If server crashed after flushed heap
>> but not vm, after crash recovery the heap is still marked all-visible
>> on vm.
>
> So, in this case both vm and page will be marked as all_visible.
>
>> This case could be happen even on released branched, and could make
>> IndexOnlyScan returns wrong result?
>>
>
> Why do you think IndexOnlyScan will return wrong result?  If the
> server crash in the way as you described, the transaction that has
> made modifications will anyway be considered aborted, so the result of
> IndexOnlyScan should not be wrong.
>

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

Regards,

--
Masahiko Sawada


emit_wal_already_marked_true_case_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-30 Thread Masahiko Sawada
On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <and...@anarazel.de> wrote:
>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
>>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <and...@anarazel.de> wrote:
>>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>>> >> There is nothing in this record which recorded the information about
>>> >> visibility clear flag.
>>> >
>>> > I think we can actually defer the clearing to the lock release?
>>>
>>> How about the case if after we release the lock on page, the heap page
>>> gets flushed, but not vm and then server crashes?
>>
>> In the released branches there's no need to clear all visible at that
>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
>> fine there, and that's the part where reusing an existing record is
>> important (for compatibility).
>>
>
> For back branches, I agree that heap_lock_tuple is sufficient,

Even if we use heap_lock_tuple, If server crashed after flushed heap
but not vm, after crash recovery the heap is still marked all-visible
on vm.
This case could be happen even on released branched, and could make
IndexOnlyScan returns wrong result?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-28 Thread Masahiko Sawada
On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <and...@anarazel.de> wrote:
>> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
>>> Andres Freund wrote:
>>>
>>> > I'm looking into three approaches right now:
>>> >
>>> > 3) Use WAL logging for the already_marked = true case.
>>>
>>>
>>> > 3) This approach so far seems the best. It's possible to reuse the
>>> > xl_heap_lock record (in an afaics backwards compatible manner), and in
>>> > most cases the overhead isn't that large.  It's of course annoying to
>>> > emit more WAL, but it's not that big an overhead compared to extending a
>>> > file, or to toasting.  It's also by far the simplest fix.
>>>
>
> +1 for proceeding with Approach-3.
>
>>> I suppose it's fine if we crash midway from emitting this wal record and
>>> the actual heap_update one, since the xmax will appear to come from an
>>> aborted xid, right?
>>
>> Yea, that should be fine.
>>
>>
>>> I agree that the overhead is probably negligible, considering that this
>>> only happens when toast is invoked.  It's probably not as great when the
>>> new tuple goes to another page, though.
>>
>> I think it has to happen in both cases unfortunately. We could try to
>> add some optimizations (e.g. only release lock & WAL log if the target
>> page, via fsm, is before the current one), but I don't really want to go
>> there in the back branches.
>>
>
> You are right, I think we can try such an optimization in Head and
> that too if we see a performance hit with adding this new WAL in
> heap_update.
>
>

+1 for #3 approach, and attached draft patch for that.
I think attached patch would fix this problem but please let me know
if this patch is not what you're thinking.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..2f3fd83 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,28 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * To prevent data corruption due to updating old tuple by
+		 * other backends after released buffer, we need to emit that
+		 * xmax of old tuple is set and clear visibility map bits if
+		 * needed before relasing buffer. We can reuse xl_heap_lock
+		 * for this pupose. It should be fine even if we crash midway
+		 * from this section and the actual updating one later, since
+		 * the xmax will appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
+		/* Celar PD_ALL_VISIBLE flags */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(BufferGetPage(buffer));
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+vmbuffer);
+		}
+
+		MarkBufferDirty(buffer);
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3958,26 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(_self);
+			xlrec.locking_xid = xid;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+  oldtup.t_data->t_infomask2);
+			XLogRegisterData((char *) , SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-28 Thread Masahiko Sawada
On Tue, Jun 28, 2016 at 8:06 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund <and...@anarazel.de> wrote:
>> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <and...@anarazel.de> wrote:
>>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>>> >> What exactly is the point of all of that already_marked stuff?
>>> >
>>> > Preventing the old tuple from being locked/updated by another backend,
>>> > while unlocking the buffer.
>>> >
>>> >> I
>>> >> mean, suppose we just don't do any of that before we go off to do
>>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>>> >> when we reacquire the page lock, we might find that somebody else has
>>> >> already updated the tuple, but couldn't that be handled by
>>> >> (approximately) looping back up to l2 just as we do in several other
>>> >> cases?
>>> >
>>> > We'd potentially have to undo a fair amount more work: the toasted data
>>> > would have to be deleted and such, just to retry. Which isn't going to
>>> > super easy, because all of it will be happening with the current cid (we
>>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>>
>>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>>> to the TOAST data, but not the other way around.  So if we change our
>>> mind about where to put the tuple, I don't think that requires
>>> re-TOASTing.
>>
>> Consider what happens if we, after restarting at l2, notice that we
>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>> branch. If the caller doesn't error out - and there's certainly callers
>> doing that - we'd "leak" a toasted datum.
>
> Sorry for interrupt you, but I have a question about this case.
> Is there case where we back to l2 after we created toasted
> datum(called toast_insert_or_update)?
> IIUC, after we stored toast datum we just insert heap tuple and log
> WAL (or error out for some reasons).
>

I understood now, sorry for the noise.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-28 Thread Masahiko Sawada
On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <and...@anarazel.de> wrote:
>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>> >> What exactly is the point of all of that already_marked stuff?
>> >
>> > Preventing the old tuple from being locked/updated by another backend,
>> > while unlocking the buffer.
>> >
>> >> I
>> >> mean, suppose we just don't do any of that before we go off to do
>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>> >> when we reacquire the page lock, we might find that somebody else has
>> >> already updated the tuple, but couldn't that be handled by
>> >> (approximately) looping back up to l2 just as we do in several other
>> >> cases?
>> >
>> > We'd potentially have to undo a fair amount more work: the toasted data
>> > would have to be deleted and such, just to retry. Which isn't going to
>> > super easy, because all of it will be happening with the current cid (we
>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>
>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>> to the TOAST data, but not the other way around.  So if we change our
>> mind about where to put the tuple, I don't think that requires
>> re-TOASTing.
>
> Consider what happens if we, after restarting at l2, notice that we
> can't actually insert, but return in the !HeapTupleMayBeUpdated
> branch. If the caller doesn't error out - and there's certainly callers
> doing that - we'd "leak" a toasted datum.

Sorry for interrupt you, but I have a question about this case.
Is there case where we back to l2 after we created toasted
datum(called toast_insert_or_update)?
IIUC, after we stored toast datum we just insert heap tuple and log
WAL (or error out for some reasons).

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Comment and function argument names are mismatched in bugmgr.c.

2016-06-23 Thread Masahiko Sawada
Hi all,

By commit 428b1d6b, function WritebackContextInit is added but the
comment of this function seems to be incorrect.
*max_coalesce variable doesn't exist at anywhere.
Also, I think it should be fixed that the argument name of this
function does not match function declare in buf_internal.h.
Patch for that attached.

bufmgr.c:L4160
/*
 * Initialize a writeback context, discarding potential previous state.
 *
 * *max_coalesce is a pointer to a variable containing the current maximum
 * number of writeback requests that will be coalesced into a bigger one. A
 * value <= 0 means that no writeback control will be performed. max_pending
 * is a pointer instead of an immediate value, so the coalesce limits can
 * easily changed by the GUC mechanism, and so calling code does not have to
 * check the current configuration.
 */
void
WritebackContextInit(WritebackContext *context, int *max_pending)
{
Assert(*max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);

context->max_pending = max_pending;
context->nr_pending = 0;
}

buf_internal.h:L303
/*
 * Internal routines: only called by bufmgr
 * Internal buffer management routines
 */
/* bufmgr.c */
extern void WritebackContextInit(WritebackContext *context, int *max_coalesce);


Regards,

--
Masahiko Sawada
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a67e518..76ade37 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4160,12 +4160,10 @@ ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
 /*
  * Initialize a writeback context, discarding potential previous state.
  *
- * *max_coalesce is a pointer to a variable containing the current maximum
- * number of writeback requests that will be coalesced into a bigger one. A
- * value <= 0 means that no writeback control will be performed. max_pending
- * is a pointer instead of an immediate value, so the coalesce limits can
- * easily changed by the GUC mechanism, and so calling code does not have to
- * check the current configuration.
+ * *max_pending is a pointer instead of an immediate value, so the coalesce
+ * limits can easily changed by the GUC mechanism, and so calling code does
+ * not have to check the current configuration. A value is 0 means that no
+ * writeback control will be performed.
  */
 void
 WritebackContextInit(WritebackContext *context, int *max_pending)
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 511d740..e0dfb2f 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -300,7 +300,7 @@ extern CkptSortItem *CkptBufferIds;
  * Internal buffer management routines
  */
 /* bufmgr.c */
-extern void WritebackContextInit(WritebackContext *context, int *max_coalesce);
+extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Actuall row count of Parallel Seq Scan in EXPLAIN ANALYZE .

2016-06-20 Thread Masahiko Sawada
On Mon, Jun 20, 2016 at 3:42 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Mon, Jun 20, 2016 at 11:48 AM, Masahiko Sawada <sawada.m...@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> My colleague noticed that the output of EXPLAIN ANALYZE doesn't work
>> fine for parallel seq scan.
>>
>> postgres(1)=# explain analyze verbose select count(*) from
>> pgbench_accounts ;
>> QUERY PLAN
>>
>> -
>>  Finalize Aggregate  (cost=217018.55..217018.56 rows=1 width=8)
>> (actual time=2640.015..2640.015 rows=1 loops=1)
>>Output: count(*)
>>->  Gather  (cost=217018.33..217018.54 rows=2 width=8) (actual
>> time=2639.064..2640.002 rows=3 loops=1)
>>  Output: (PARTIAL count(*))
>>  Workers Planned: 2
>>  Workers Launched: 2
>>  ->  Partial Aggregate  (cost=216018.33..216018.34 rows=1
>> width=8) (actual time=2632.714..2632.715 rows=1 loops=3)
>>Output: PARTIAL count(*)
>>Worker 0: actual time=2632.583..2632.584 rows=1 loops=1
>>Worker 1: actual time=2627.517..2627.517 rows=1 loops=1
>>->  Parallel Seq Scan on public.pgbench_accounts
>> (cost=0.00..205601.67 rows=417 width=0) (actual
>> time=0.042..1685.542 rows=333 loops=3)
>>  Worker 0: actual time=0.033..1657.486 rows=3457968
>> loops=1
>>  Worker 1: actual time=0.039..1702.979 rows=3741069
>> loops=1
>>  Planning time: 1.026 ms
>>  Execution time: 2640.225 ms
>> (15 rows)
>>
>> For example, the above result shows,
>> Parallel Seq Scan : actual rows = 333
>> worker 0   : actual rows = 3457968
>> worker 1   : actual rows = 3741069
>> Summation of these is 10532370, but actual total rows is 1000.
>> I think that Parallel Seq Scan should show actual rows =
>> 1000(total rows) or actual rows = 2800963(rows collected by
>> itself). (1000 maybe better)
>>
>
> You have to read the rows at Parallel Seq Scan nodes as total count of rows,
> but you have to consider the loops parameter as well.
>

In following case, it look to me that no one collect the tuple.
But it's obviously incorrect, this query collects a tuple(aid = 10) actually.

postgres(1)=# explain analyze verbose select * from pgbench_accounts
where aid = 10;
 QUERY PLAN

 Gather  (cost=1000.00..217018.43 rows=1 width=97) (actual
time=0.541..2094.773 rows=1 loops=1)
   Output: aid, bid, abalance, filler
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.pgbench_accounts
(cost=0.00..216018.34 rows=0 width=97) (actual time=1390.109..2088.103
rows=0 loops=3)
 Output: aid, bid, abalance, filler
 Filter: (pgbench_accounts.aid = 10)
     Rows Removed by Filter: 333
 Worker 0: actual time=2082.681..2082.681 rows=0 loops=1
 Worker 1: actual time=2087.532..2087.532 rows=0 loops=1
 Planning time: 0.126 ms
 Execution time: 2095.564 ms
(12 rows)

How can we consider actual rows and nloops?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Actuall row count of Parallel Seq Scan in EXPLAIN ANALYZE .

2016-06-20 Thread Masahiko Sawada
Hi all,

My colleague noticed that the output of EXPLAIN ANALYZE doesn't work
fine for parallel seq scan.

postgres(1)=# explain analyze verbose select count(*) from pgbench_accounts ;
QUERY PLAN
-
 Finalize Aggregate  (cost=217018.55..217018.56 rows=1 width=8)
(actual time=2640.015..2640.015 rows=1 loops=1)
   Output: count(*)
   ->  Gather  (cost=217018.33..217018.54 rows=2 width=8) (actual
time=2639.064..2640.002 rows=3 loops=1)
 Output: (PARTIAL count(*))
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=216018.33..216018.34 rows=1
width=8) (actual time=2632.714..2632.715 rows=1 loops=3)
   Output: PARTIAL count(*)
   Worker 0: actual time=2632.583..2632.584 rows=1 loops=1
   Worker 1: actual time=2627.517..2627.517 rows=1 loops=1
   ->  Parallel Seq Scan on public.pgbench_accounts
(cost=0.00..205601.67 rows=417 width=0) (actual
time=0.042..1685.542 rows=333 loops=3)
 Worker 0: actual time=0.033..1657.486 rows=3457968 loops=1
 Worker 1: actual time=0.039..1702.979 rows=3741069 loops=1
 Planning time: 1.026 ms
 Execution time: 2640.225 ms
(15 rows)

For example, the above result shows,
Parallel Seq Scan : actual rows = 333
worker 0   : actual rows = 3457968
worker 1   : actual rows = 3741069
Summation of these is 10532370, but actual total rows is 1000.
I think that Parallel Seq Scan should show actual rows =
1000(total rows) or actual rows = 2800963(rows collected by
itself). (1000 maybe better)

After spent time to investigate this behaviour, ISTM that the problem
is nloops of Parallel Seq Scan.
Parallel Seq Scan is done only once, but nloops is incremented to 3.
So its "actual rows" is calculated 333(1000 / 3) at explain.c:L1223.
Is it a bug?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] forcing a rebuild of the visibility map

2016-06-16 Thread Masahiko Sawada
On Thu, Jun 16, 2016 at 10:03 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 16, 2016 at 2:33 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> Option name DISABLE_PAGE_SKIPPING is good to me.
>> I'm still working on this, but here are some review comments.
>>
>> ---
>> +CREATE FUNCTION pg_truncate_visibility_map(regclass)
>> +RETURNS void
>> +AS 'MODULE_PATHNAME', 'pg_truncate_visibility_map'
>> +LANGUAGE C STRICT
>> +PARALLEL UNSAFE;  -- let's not make this any more dangerous
>> +
>>
>> "REVOKE ALL ON FUNCTION pg_truncate_visibility_map(regclass) FROM
>> PUBLIC;" is missing in pg_visibility/pg_visibility--1.0--1.1.sql.
>
> OK, thanks.  I'll fix that.
>
>> I think that VACUUM with VERBOSE option can show the information for
>> how many frozen pages we skipped like autovacuum does. Thought?
>> Patch attached.
>
> I'm not sure.  The messages VACUUM emits are already quite long and
> hard to read, and adding more lines to them might make that problem
> worse.  On the other hand, having more details can be helpful, too.

Because these informations are emitted only when VERBOSE option is
specified, I think that it would be better to output that rather than
nothing, although it makes messages more complicated.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] forcing a rebuild of the visibility map

2016-06-16 Thread Masahiko Sawada
On Thu, Jun 16, 2016 at 12:06 PM, Robert Haas <robertmh...@gmail.com> wrote:
> [ Changing subject line in the hopes of keeping separate issues in
> separate threads. ]
>
> On Mon, Jun 6, 2016 at 11:00 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmh...@gmail.com> writes:
>>> I'm intuitively sympathetic to the idea that we should have an option
>>> for this, but I can't figure out in what case we'd actually tell
>>> anyone to use it.  It would be useful for the kinds of bugs listed
>>> above to have VACUUM (rebuild_vm) to blow away the VM fork and rebuild
>>> it, but that's different semantics than what we proposed for VACUUM
>>> (even_frozen_pages).  And I'd be sort of inclined to handle that case
>>> by providing some other way to remove VM forks (like a new function in
>>> the pg_visibilitymap contrib module, maybe?) and then just tell people
>>> to run regular VACUUM afterwards, rather than putting the actual VM
>>> fork removal into VACUUM.
>>
>> There's a lot to be said for that approach.  If we do it, I'd be a bit
>> inclined to offer an option to blow away the FSM as well.
>
> After having been scared out of my mind by the discovery of
> longstanding breakage in heap_update[1], it occurred to me that this
> is an excellent example of a case in which the option for which Andres
> was agitating - specifically forcing VACUUM to visit ever single page
> in the heap - would be useful.  Even if there's functionality
> available to blow away the visibility map forks, you might not want to
> just go remove them all and re-vacuum everything, because while you
> were rebuilding the VM forks, index-only scans would stop being
> index-only, and that might cause an operational problem.  Being able
> to simply force every page to be visited is better.  Patch for that
> attached.  I decided to call the option disable_page_skipping rather
> than even_frozen_pages because it should also force visiting pages
> that are all-visible but not all-frozen.  (I was sorely tempted to go
> with the competing proposal of calling it VACUUM (THEWHOLEDAMNTHING)
> but I refrained.)
>
> However, I also think that the approach described above - providing a
> way to excise VM forks specifically - is useful.  Patch for that
> attached, too.  It turns out that can't be done without either adding
> a new WAL record type or extending an existing one; I chose to add a
> "flags" argument to XLOG_SMGR_TRUNCATE, specifying the forks to be
> truncated.  Since this will require bumping the XLOG version, if we're
> going to do it, and I think we should, it would be good to try to get
> it done before beta2 rather than closer to release.  However, I don't
> want to commit this without some review, so please review this.
> Actually, please review both patches.[2]  The same core support could
> be used to add an option to truncate the FSM, but I didn't write a
> patch for that because I'm incredibly { busy | lazy }.
>

Option name DISABLE_PAGE_SKIPPING is good to me.
I'm still working on this, but here are some review comments.

---
+CREATE FUNCTION pg_truncate_visibility_map(regclass)
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_truncate_visibility_map'
+LANGUAGE C STRICT
+PARALLEL UNSAFE;  -- let's not make this any more dangerous
+

"REVOKE ALL ON FUNCTION pg_truncate_visibility_map(regclass) FROM
PUBLIC;" is missing in pg_visibility/pg_visibility--1.0--1.1.sql.

---
I think that VACUUM with VERBOSE option can show the information for
how many frozen pages we skipped like autovacuum does. Thought?
Patch attached.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 7a67fa5..ddbb835 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1329,6 +1329,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	"Skipped %u pages due to buffer pins.\n",
 	vacrelstats->pinskipped_pages),
 	 vacrelstats->pinskipped_pages);
+	appendStringInfo(, _("Skipped %u frozen pages.\n"),
+	 vacrelstats->frozenskipped_pages);
 	appendStringInfo(, ngettext("%u page is entirely empty.\n",
 	"%u pages are entirely empty.\n",
 	empty_pages),

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-15 Thread Masahiko Sawada
On Wed, Jun 15, 2016 at 9:56 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> <thomas.mu...@enterprisedb.com> wrote:
>> I spent some time chasing down the exact circumstances.  I suspect
>> that there may be an interlocking problem in heap_update.  Using the
>> line numbers from cae1c788 [1], I see the following interaction
>> between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
>> in reference to the same block number:
>>
>>   [VACUUM] sets all visible bit
>>
>>   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 
>> xmax_old_tuple);
>>   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>>
>>   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
>>   [SELECT] observes VM_ALL_VISIBLE as true
>>   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
>>   [SELECT] barfs
>>
>>   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.  Only if that other work all goes OK do we
> relock the page and perform the WAL-logged actions.
>
> That doesn't seem like a good idea even in existing releases, because
> you've taken a tuple on an all-visible page and made it not
> all-visible, and you've made a page modification that is not
> necessarily atomic without logging it.  This is is particularly bad in
> 9.6, because if that page is also all-frozen then XMAX will eventually
> be pointing into space and VACUUM will never visit the page to
> re-freeze it the way it would have done in earlier releases.  However,
> even in older releases, I think there's a remote possibility of data
> corruption.  Backend #1 makes these changes to the page, releases the
> lock, and errors out.  Backend #2 writes the page to the OS.  DBA
> takes a hot backup, tearing the page in the middle of XMAX.  Oops.
>
> I'm not sure what to do about this: this part of the heap_update()
> logic has been like this forever, and I assume that if it were easy to
> refactor this away, somebody would have done it by now.
>

How about changing collect_corrupt_items to acquire
AccessExclusiveLock for safely checking?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-10 Thread Masahiko Sawada
On Fri, Jun 10, 2016 at 1:11 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 9, 2016 at 5:48 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Attached patch implements the above 2 functions.  I have addressed the
>> comments by Sawada San and you in latest patch and updated the documentation
>> as well.
>
> I made a number of changes to this patch.  Here is the new version.
>
> 1. The algorithm you were using for growing the array size is unsafe
> and can easily overrun the array.  Suppose that each of the first two
> pages have some corrupt tuples, more than 50% of MaxHeapTuplesPerPage
> but less than the full value of MaxTuplesPerPage.  Your code will
> conclude that the array does need to be enlarged after processing the
> first page.  I switched this to what I consider the normal coding
> pattern for such problems.
>
> 2. The all-visible checks seemed to me to be incorrect and incomplete.
> I made the check match the logic in lazy_scan_heap.
>
> 3. Your 1.0 -> 1.1 upgrade script was missing copies of the REVOKE
> statements you added to the 1.1 script.  I added them.
>
> 4. The tests as written were not safe under concurrency; they could
> return spurious results if the page changed between the time you
> checked the visibility map and the time you actually examined the
> tuples.  I think people will try running these functions on live
> systems, so I changed the code to recheck the VM bits after locking
> the page.  Unfortunately, there's either still a concurrency-related
> problem here or there's a bug in the all-frozen code itself because I
> once managed to get pg_check_frozen('pgbench_accounts') to return a
> TID while pgbench was running concurrently.  That's a bit alarming,
> but since I can't reproduce it I don't really have a clue how to track
> down the problem.
>
> 5. I made various cosmetic improvements.
>
> If there are not objections, I will go ahead and commit this tomorrow,
> because even if there is a bug (see point #4 above) I think it's
> better to have this in the tree than not.  However, code review and/or
> testing with these new functions seems like it would be an extremely
> good idea.
>

Thank you for working on this.
Here are some minor comments.

---
+/*
+ * Return the TIDs of not-all-visible tuples in pages marked all-visible

If there is even one non-visible tuple in pages marked all-visible,
the database might be corrupted.
Is it better "not-visible" or "non-visible" instead of "not-all-visible"?
---
Do we need to check page header flag?
I think that database also might be corrupt in case where there is
non-visible tuple in page set PD_ALL_VISIBLE.
We could emit the WARNING log in such case.

Also, using attached tool which allows us to set spurious visibility
map status without actual modifying the tuple , I manually made the
some situations where database is corrupted and tested it, but ISTM
that this tool works fine.
It doesn't mean proposing as a new feature of course, but please use
it as appropriate.

Regards,

--
Masahiko Sawada


corrupted_test.sql
Description: Binary data


set_spurious_vm_status.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-08 Thread Masahiko Sawada
On Wed, Jun 8, 2016 at 12:19 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Tue, Jun 7, 2016 at 10:10 PM, Masahiko Sawada <sawada.m...@gmail.com>
> wrote:
>>
>> Thank you for implementing the patch.
>>
>> I've not test it deeply but here are some comments.
>> This check tool only checks if the frozen page has live-unfrozen tuple.
>> That is, it doesn't care in case where the all-frozen page mistakenly
>> has dead-frozen tuple.
>>
>
> Do you mean to say that we should have a check for ItemIdIsDead() and then
> if item is found to be dead, then add it to array of non_frozen items?

Yes.

>  If so, earlier I thought we might not need this check as we are already using
> heap_tuple_needs_eventual_freeze(),

You're right. Sorry, I had misunderstood.

> but now again looking at it, it seems
> wise to check for dead items separately as those won't be covered by other
> check.

Sounds good.

>>
>> +   /* Clean up. */
>> +   if (vmbuffer != InvalidBuffer)
>> +   ReleaseBuffer(vmbuffer);
>>
>> I think that we should use BufferIsValid() here.
>>
>
> We can use BufferIsValid() as well, but I am trying to be consistent with
> nearby code, refer collect_visibility_data().  We can change at all places
> together if people prefer that way.
>

In vacuumlazy.c we use it like BufferisValid(vmbuffer), so I think we
can replace all these thing to be more safety if there is not specific
reason.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-07 Thread Masahiko Sawada
On Tue, Jun 7, 2016 at 11:19 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>> On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <and...@anarazel.de> wrote:
>>
>>
>> > I'd also be ok with adding & documenting (beta release notes)
>> > CREATE EXTENSION pg_visibility;
>> > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT
>> > pg_check_visibility(oid);
>> > or something olong those lines.
>>
>> That wouldn't be too useful as-written in my book, because it gives
>> you no detail on what exactly the problem was.  Maybe it could be
>> "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
>> TIDs are non-frozen TIDs on frozen pages.  Then I think something like
>> this would work:
>>
>> SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
>> IN ('r', 't', 'm');
>>
>
> I have implemented the above function in attached patch.  Currently, it
> returns SETOF tupleids, but if we want some variant of same, that should
> also be possible.
>
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

Thank you for implementing the patch.

I've not test it deeply but here are some comments.
This check tool only checks if the frozen page has live-unfrozen tuple.
That is, it doesn't care in case where the all-frozen page mistakenly
has dead-frozen tuple.
I think this tool should check such case, otherwise the function name
would need to be changed.

+   /* Clean up. */
+   if (vmbuffer != InvalidBuffer)
+   ReleaseBuffer(vmbuffer);

I think that we should use BufferIsValid() here.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Improve tab completion for USER MAPPING

2016-06-06 Thread Masahiko Sawada
Hi all,

I found that the tab completion for USER MAPPING doesn't work fine.
For example,

The below is no problem.
postgres=# create user[TAB]
user  user mapping for

But this doesn't work fine. (Note that there is a white space between
'user' and [TAB])
postgres=# create user [TAB]
hoge_user  masahiko   pg_signal_backend

After manual input of the 'mapping', 'for' is added by tab completion.
It means that the tab completion for 'mapping' is not working.

Patch attached.
Please review it.

Regards,

--
Masahiko Sawada


fix_tab_completion_for_user_mapping.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-06 Thread Masahiko Sawada
On Mon, Jun 6, 2016 at 6:34 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
> <michael.paqu...@gmail.com> wrote:
>>> Attached is a sample patch that controls full page vacuum by new GUC 
>>> parameter.
>>
>> Don't we want a reloption for that? Just wondering...
>
> Why?  Just for consistency?  I think the bigger question here is
> whether we need to do anything at all.  It's true that, without some
> new option, we'll lose the ability to forcibly vacuum every page in
> the relation, even if all-frozen.  But there's not much use case for
> that in the first place.  It will be potentially helpful if it turns
> out that we have a bug that sets the all-frozen bit on pages that are
> not, in fact, all-frozen.  Otherwise, what's the use?
>

I cannot agree with using this parameter as a reloption.
We set it true only when the serious bug is discovered and we want to
re-generate the visibility maps of specific tables.
I thought that control by GUC parameter would be convenient rather
than adding the new option.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-06 Thread Masahiko Sawada
On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>>> Can you submit that part as a separate patch?
>>>
>>> Attached.
>>
>> Thanks, committed.
>>
>>>>> I'm address the review comment of 7087166 commit, and will post the patch.
>>>>
>>>> When?
>>>
>>> On Saturday.
>>
>> Great.  Will that address everything for this open item, then?
>>
>
> Attached patch for commit 7087166 on another mail.
> I think that only the test tool for visibility map is remaining and
> under the discussion.
> Even if we have verification tool or function for visibility map, we
> cannot repair the contents of visibility map if we turned out that
> contents of visibility map is something wrong.
> So I think we should have the way that re-generates the visibility map.
> For this purpose, doing vacuum while ignoring visibility map by a new
> option or new function is one idea.
> But IMHO, it's not good idea to allow a function to do vacuum, and
> expanding the VACUUM syntax might be somewhat overkill.
>
> So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
> If this parameter is set true (false by default), we do vacuum whole
> table forcibly and re-generate visibility map.
> The advantage of this idea is that we don't necessary to expand VACUUM
> syntax and relatively easily can remove this parameter if it's not
> necessary anymore.
>

Attached is a sample patch that controls full page vacuum by new GUC parameter.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/commands/vacuumlazy.c 
b/src/backend/commands/vacuumlazy.c
index 784c3e9..03f026d 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -125,6 +125,8 @@ typedef struct LVRelStats
boollock_waiter_detected;
 } LVRelStats;
 
+/* GUC parameter */
+bool vacuum_even_frozen_page; /* should we scan all pages forcibly? */
 
 /* A few variables that don't seem worth passing around as parameters */
 static int elevel = -1;
@@ -138,7 +140,7 @@ static BufferAccessStrategy vac_strategy;
 
 /* non-export function prototypes */
 static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-  Relation *Irel, int nindexes, bool aggressive);
+  Relation *Irel, int nindexes, bool aggressive, bool 
scan_all);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
@@ -246,7 +248,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams 
*params,
vacrelstats->hasindex = (nindexes > 0);
 
/* Do the vacuuming */
-   lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive);
+   lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive,
+  vacuum_even_frozen_page);
 
/* Done with indexes */
vac_close_indexes(nindexes, Irel, NoLock);
@@ -261,7 +264,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams 
*params,
if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
< vacrelstats->rel_pages)
{
-   Assert(!aggressive);
+   Assert(!aggressive && !vacuum_even_frozen_page);
scanned_all_unfrozen = false;
}
else
@@ -442,7 +445,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats 
*vacrelstats)
  */
 static void
 lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-  Relation *Irel, int nindexes, bool aggressive)
+  Relation *Irel, int nindexes, bool aggressive, bool 
scan_all)
 {
BlockNumber nblocks,
blkno;
@@ -513,6 +516,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 * such pages do not need freezing and do not affect the value that we 
can
 * safely set for relfrozenxid or relminmxid.
 *
+* When scan_all is set, we have to scan all pages forcibly while 
ignoring
+* visibility map status, and then we can safely set for relfrozenxid or
+* relminmxid.
+*
 * Before entering the main loop, establish the invariant that
 * next_unskippable_block is the next block number >= blkno that's not 
we
 * can't skip based on the visibility map, either all-visible for a
@@ -639,11 +646,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
 * The current block is potentially sk

Re: [HACKERS] Reviewing freeze map code

2016-06-03 Thread Masahiko Sawada
On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>>> Can you submit that part as a separate patch?
>>
>> Attached.
>
> Thanks, committed.
>
>>>> I'm address the review comment of 7087166 commit, and will post the patch.
>>>
>>> When?
>>
>> On Saturday.
>
> Great.  Will that address everything for this open item, then?
>

Attached patch for commit 7087166 on another mail.
I think that only the test tool for visibility map is remaining and
under the discussion.
Even if we have verification tool or function for visibility map, we
cannot repair the contents of visibility map if we turned out that
contents of visibility map is something wrong.
So I think we should have the way that re-generates the visibility map.
For this purpose, doing vacuum while ignoring visibility map by a new
option or new function is one idea.
But IMHO, it's not good idea to allow a function to do vacuum, and
expanding the VACUUM syntax might be somewhat overkill.

So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
If this parameter is set true (false by default), we do vacuum whole
table forcibly and re-generate visibility map.
The advantage of this idea is that we don't necessary to expand VACUUM
syntax and relatively easily can remove this parameter if it's not
necessary anymore.

Thought?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-03 Thread Masahiko Sawada
On Sat, Jun 4, 2016 at 12:41 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Jun 3, 2016 at 10:25 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>>>> +   charnew_vmbuf[BLCKSZ];
>>>> +   char   *new_cur = new_vmbuf;
>>>> +   boolempty = true;
>>>> +   boolold_lastpart;
>>>> +
>>>> +   /* Copy page header in advance */
>>>> +   memcpy(new_vmbuf, , 
>>>> SizeOfPageHeaderData);
>>>>
>>>> Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
>>>> with old_lastpart && !empty, right?
>>>
>>> Oh, dear.  That seems like a possible data corruption bug.  Maybe we'd
>>> better fix that right away (although I don't actually have time before
>>> the wrap).
>
> Actually, on second thought, I'm not seeing the bug here.  It seems to
> me that the loop commented this way:
>
> /* Process old page bytes one by one, and turn it into new page. 
> */
>
> ...should always write to every byte in new_vmbuf, because we process
> exactly half the bytes in the old block at a time, and so that's going
> to generate exactly one full page of new bytes.  Am I missing
> something?

Yeah, you're right.
the rewriteVisibilityMap() always exactly writes whole new_vmbuf.

>
>> Since the force is always set true, I removed the force from argument
>> of copyFile() and rewriteVisibilityMap().
>> And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags .
>
> I'm not happy with this.  I think we should always open with O_EXCL,
> because the new file is not expected to exist and if it does,
> something's probably broken.  I think we should default to the safe
> behavior (which is failing) rather than the unsafe behavior (which is
> clobbering data).

I specified O_EXCL instead of O_TRUNC.

Attached updated patch.

Regards,

--
Masahiko Sawada


fix_freeze_map_7087166_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-03 Thread Masahiko Sawada
On Sat, May 7, 2016 at 5:42 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, May 5, 2016 at 2:20 PM, Andres Freund <and...@anarazel.de> wrote:
>> On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
>>> 7087166 pg_upgrade: Convert old visibility map format to new format.
>>
>> +const char *
>> +rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
>> ...
>>
>> +   while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
>> +   {
>> ..
>>
>> Uh, shouldn't we actually fail if we read incompletely? Rather than
>> silently ignoring the problem? Ok, this causes no corruption, but it
>> indicates that something went significantly wrong.
>
> Sure, that's reasonable.
>

Fixed.

>> +   charnew_vmbuf[BLCKSZ];
>> +   char   *new_cur = new_vmbuf;
>> +   boolempty = true;
>> +   boolold_lastpart;
>> +
>> +   /* Copy page header in advance */
>> +   memcpy(new_vmbuf, , SizeOfPageHeaderData);
>>
>> Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
>> with old_lastpart && !empty, right?
>
> Oh, dear.  That seems like a possible data corruption bug.  Maybe we'd
> better fix that right away (although I don't actually have time before
> the wrap).

Since the force is always set true, I removed the force from argument
of copyFile() and rewriteVisibilityMap().
And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags .

>> +   if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), 
>> S_IRUSR | S_IWUSR)) < 0)
>> +   {
>> +   close(src_fd);
>> +   return getErrorText();
>> +   }
>>
>> I know you guys copied this, but what's the force thing about?
>> Expecially as it's always set to true by the callers (i.e. what is the
>> parameter even about?)?  Wouldn't we at least have to specify O_TRUNC in
>> the force case?
>
> I just work here.
>
>> +   old_cur += BITS_PER_HEAPBLOCK_OLD;
>> +   new_cur += BITS_PER_HEAPBLOCK;
>>
>> I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD
>> stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not
>> be able to have differing ones anyway, should we decide to add a third
>> bit?
>
> I think that's just a matter of style.

So this comments is not incorporated.

Attached patch, please review it.

Regards,

--
Masahiko Sawada


fix_freeze_map_7087166.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-03 Thread Masahiko Sawada
On Sat, Jun 4, 2016 at 12:08 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Jun 3, 2016 at 10:49 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> That patch also incorporates the following review comment.
>> We can push at least this fix.
>
> Can you submit that part as a separate patch?

Attached.

>> I'm address the review comment of 7087166 commit, and will post the patch.
>
> When?
>

On Saturday.

Regards,

--
Masahiko Sawada


fix_freeze_map_fd31cd2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-03 Thread Masahiko Sawada
On Fri, Jun 3, 2016 at 11:03 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 2, 2016 at 11:24 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> Attached patch optimises skipping pages logic so that blkno can jump to
>> next_unskippable_block directly while counting the number of all_visible
>> and all_frozen pages. So we can avoid double checking visibility map.
>
> I think this is 9.7 material.  This patch has already won the
> "scariest patch" tournament.  Changing the logic more than necessary
> at this late date seems like it just increases the scariness.  I think
> this is an opportunity for further optimization, not a defect.
>

I agree with you.
I'll submit this as a improvement for 9.7.
That patch also incorporates the following review comment.
We can push at least this fix.
>> /*
>>  * Compute whether we actually scanned the whole relation. If we 
>> did, we
>>  * can adjust relfrozenxid and relminmxid.
>>  *
>>  * NB: We need to check this before truncating the relation, because 
>> that
>>  * will change ->rel_pages.
>>  */
>>
>> Comment is out-of-date now.

I'm address the review comment of 7087166 commit, and will post the patch.
And testing feature for freeze map is under the discussion.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-02 Thread Masahiko Sawada
 how to make it easier
>> though without just ripping out the SKIP_PAGES_THRESHOLD stuff.
>
> Yep, I had the same problem.
>> Hm. This also doubles the number of VM accesses. While I guess that's
>> not noticeable most of the time, it's still not nice; especially when a
>> large relation is entirely frozen, because it'll mean we'll sequentially
>> go through the visibilityma twice.
>
> Compared to what we're saving, that's obviously a trivial cost.
> That's not to say that we might not want to improve it, but it's
> hardly a disaster.
>
> In short: wah, wah, wah.
>

Attached patch optimises skipping pages logic so that blkno can jump to
next_unskippable_block directly while counting the number of all_visible
and all_frozen pages. So we can avoid double checking visibility map.

Regards,

--
Masahiko Sawada


fix_freeze_map_fd31cd2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rename synchronous_standby_names?

2016-06-01 Thread Masahiko Sawada
On Wed, Jun 1, 2016 at 9:49 AM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> On Wed, Jun 1, 2016 at 3:56 AM, David G. Johnston
> <david.g.johns...@gmail.com> wrote:
>> On Tue, May 31, 2016 at 2:19 PM, Peter Eisentraut
>> <peter.eisentr...@2ndquadrant.com> wrote:
>>>
>>> On 5/31/16 1:47 PM, Jaime Casanova wrote:
>>>>
>>>> Are we going to change synchronous_standby_names? Certainly the GUC is
>>>> not *only* a list of names anymore.
>>>>
>>>> synchronous_standby_config?
>>>> synchronous_standbys (adjust to correct english if necesary)?
>>>
>>>
>>> If the existing values are still going to be accepted, then I would leave
>>> it as is.
>>
>>
>> +1
>
> +1. We've made quite a lot of deal to take an approach for the N-sync
> that is 100% backward-compatible, it would be good to not break that
> effort.

+1

-- 
Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-01 Thread Masahiko Sawada
  * NOTE: This function is typically called without a lock on the heap page,
>>   * so somebody else could change the bit just after we look at it.  In fact,
>> @@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, 
>> Buffer heapBuf,
>>
>> I'm not seing what flags the above comment change is referring to?
>
> Ugh.  I think that's leftover cruft from an earlier patch version that
> should have been excised from what got committed.

Fixed.

>> /*
>> -* A single-bit read is atomic.  There could be memory-ordering 
>> effects
>> +* A single byte read is atomic.  There could be memory-ordering 
>> effects
>>  * here, but for performance reasons we make it the caller's job to 
>> worry
>>  * about that.
>>  */
>> -   result = (map[mapByte] & (1 << mapBit)) ? true : false;
>> -
>> -   return result;
>> +   return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
>>  }
>>
>> Not a new issue, and *very* likely to be irrelevant in practice (given
>> the value is only referenced once): But there's really no guarantee
>> map[mapByte] is only read once here.
>
> Meh.  But we can fix if you want to.

Fixed.

>> -BlockNumber
>> -visibilitymap_count(Relation rel)
>> +void
>> +visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber 
>> *all_frozen)
>>
>> Not really a new issue again: The parameter types (previously return
>> type) to this function seem wrong to me.
>
> Not this patch's job to tinker.

This comment is not incorporate this patch yet.

>> @@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, 
>> TransactionId *visibility_cut
>> }
>> +   /*
>> +* We don't bother clearing *all_frozen when the page is discovered 
>> not
>> +* to be all-visible, so do that now if necessary.  The page might 
>> fail
>> +* to be all-frozen for other reasons anyway, but if it's not 
>> all-visible,
>> +* then it definitely isn't all-frozen.
>> +*/
>> +   if (!all_visible)
>> +   *all_frozen = false;
>> +
>>
>> Why don't we just set *all_frozen to false when appropriate? It'd be
>> just as many lines and probably easier to understand?
>
> I thought that looked really easy to mess up, either now or down the
> road.  This way seemed more solid to me.  That's a judgement call, of
> course.

To be understanding easier, I changed it so.

>> +   /*
>> +* If the page is marked as all-visible but not all-frozen, 
>> we should
>> +* so mark it.  Note that all_frozen is only valid if 
>> all_visible is
>> +* true, so we must check both.
>> +*/
>>
>> This kinda seems to imply that all-visible implies all_frozen. Also, why
>> has that block been added to the end of the if/else if chain? Seems like
>> it belongs below the (all_visible && !all_visible_according_to_vm) block.
>
> We can adjust the comment a bit to make it more clear, if you like,
> but I doubt it's going to cause serious misunderstanding.  As for the
> placement, the reason I put it at the end is because I figured that we
> did not want to mark it all-frozen if any of the "oh crap, emit a
> warning" cases applied.
>

Fixed comment.
I think that we should care about all-visible problem first, and then
care all-frozen problem.
So this patch doesn't change the placement.

Attached patch fixes only above comments, other are being addressed now.

-- 
Regards,

--
Masahiko Sawada
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index eaab4be..05422f1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -33,7 +33,7 @@
  * is set, we know the condition is true, but if a bit is not set, it might or
  * might not be true.
  *
- * Clearing both visibility map bits is not separately WAL-logged.  The callers
+ * Clearing visibility map bits is not separately WAL-logged.  The callers
  * must make sure that whenever a bit is cleared, the bit is cleared on WAL
  * replay of the updating operation as well.
  *
@@ -104,13 +104,16 @@
  */
 #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
 
+/* Number of heap blocks we can represent in one byte */
+#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)
+
 /* Number of heap blocks we can represent in one visibility map page. */
 #define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
 
 /* Mapping from heap block number to the right

Re: [HACKERS] Reviewing freeze map code

2016-05-29 Thread Masahiko Sawada
On Sun, May 29, 2016 at 2:44 PM, Noah Misch <n...@leadboat.com> wrote:
> On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote:
>> On Thu, May 5, 2016 at 2:20 PM, Andres Freund <and...@anarazel.de> wrote:
>> > On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
>> > +   charnew_vmbuf[BLCKSZ];
>> > +   char   *new_cur = new_vmbuf;
>> > +   boolempty = true;
>> > +   boolold_lastpart;
>> > +
>> > +   /* Copy page header in advance */
>> > +   memcpy(new_vmbuf, , 
>> > SizeOfPageHeaderData);
>> >
>> > Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
>> > with old_lastpart && !empty, right?
>>
>> Oh, dear.  That seems like a possible data corruption bug.  Maybe we'd
>> better fix that right away (although I don't actually have time before
>> the wrap).
>
> [This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> 9.6 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within 72 hours of this
> message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> efforts toward speedy resolution.  Thanks.
>
> [1] 
> http://www.postgresql.org/message-id/20160527025039.ga447...@tornado.leadboat.com

Thank you for notification.

Regarding check tool for visibility map is still under the discussion.
I'm going to address other review comments, and send the patch ASAP.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-17 Thread Masahiko Sawada
On Tue, May 17, 2016 at 4:34 PM, Joshua D. Drake <j...@commandprompt.com> wrote:
> On 05/17/2016 12:32 PM, Alvaro Herrera wrote:
>
>>   Syntaxes are;
>>   VACUUM (SCAN_ALL) table_name;
>>   VACUUM (SCAN_ALL);  -- for all tables on database
>>
>> Is SCAN_ALL really the best we can do here?  The business of having an
>> underscore in an option name has no precedent (other than
>> CURRENT_DATABASE and the like).  How about COMPLETE, TOTAL, or WHOLE?
>>
>
> VACUUM (ANALYZE, VERBOSE, WHOLE)
> 
>
> That seems reasonable? I agree that SCAN_ALL doesn't fit. I am not trying to
> pull a left turn but is there a technical reason we don't just make FULL do
> this?
>

FULL option requires AccessExclusiveLock, which could be a problem.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-17 Thread Masahiko Sawada
On Tue, May 17, 2016 at 3:32 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Masahiko Sawada wrote:
>> On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmh...@gmail.com> wrote:
>
>> > We should support scan_all only with the new-style options syntax for
>> > VACUUM; that is, vacuum (scan_all) rename.  That doesn't require
>> > making scan_all a keyword, which is good: this is a minor feature, and
>> > we don't want to bloat the parsing tables for it.
>>
>> I agree with having new-style options syntax.
>> Isn't it better to have SCAN_ALL option without parentheses?
>>
>> Syntaxes are;
>> VACUUM SCAN_ALL table_name;
>> VACUUM SCAN_ALL;  -- for all tables on database
>
> No, I agree with Robert that we shouldn't add any more such options to
> avoid keyword proliferation.
>
>  Syntaxes are;
>  VACUUM (SCAN_ALL) table_name;
>  VACUUM (SCAN_ALL);  -- for all tables on database

Okay, I agree with this.

> Is SCAN_ALL really the best we can do here?  The business of having an
> underscore in an option name has no precedent (other than
> CURRENT_DATABASE and the like).

Another way is having tool or function that removes _vm file safely for example.

> How about COMPLETE, TOTAL, or WHOLE?

IMHO, I don't have strong opinion about SCAN_ALL as long as we have
document about that option and option name doesn't confuse users.
But ISTM that COMPLETE, TOTAL might make users mislead normal vacuum
as it doesn't do that completely.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-17 Thread Masahiko Sawada
On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, May 10, 2016 at 10:40 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> Or second way I came up with is having tool to remove particular _vm
>> file safely, which is executed via SQL or client tool like
>> pg_resetxlog.
>>
>> Attached updated VACUUM SCAN_ALL patch.
>> Please find it.
>
> We should support scan_all only with the new-style options syntax for
> VACUUM; that is, vacuum (scan_all) rename.  That doesn't require
> making scan_all a keyword, which is good: this is a minor feature, and
> we don't want to bloat the parsing tables for it.
>

I agree with having new-style options syntax.
Isn't it better to have SCAN_ALL option without parentheses?

Syntaxes are;
VACUUM SCAN_ALL table_name;
VACUUM SCAN_ALL;  -- for all tables on database

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-10 Thread Masahiko Sawada
On Tue, May 10, 2016 at 11:30 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, May 9, 2016 at 7:40 PM, Ants Aasma <ants.aa...@eesti.ee> wrote:
>> On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>>> wrote:
>>>> Attached draft patch adds SCANALL option to VACUUM in order to scan
>>>> all pages forcibly while ignoring visibility map information.
>>>> The option name is SCANALL for now but we could change it after got 
>>>> consensus.
>>>
>>> If we're going to go that way, I'd say it should be scan_all rather
>>> than scanall.  Makes it clearer, at least IMHO.
>>
>> Just to add some diversity to opinions, maybe there should be a
>> separate command for performing integrity checks. Currently the best
>> ways to actually verify database correctness do so as a side effect.
>> The question that I get pretty much every time after I explain why we
>> have data checksums, is "how do I check that they are correct" and we
>> don't have a nice answer for that now. We could also use some ways to
>> sniff out corrupted rows that don't involve crashing the server in a
>> loop. Vacuuming pages that supposedly don't need vacuuming just to
>> verify integrity seems very much in the same vein.
>>
>> I know right now isn't exactly the best time to hastily slap on such a
>> feature, but I just wanted the thought to be out there for
>> consideration.
>
> I think that it's quite reasonable to have ways of performing an
> integrity check that are separate from VACUUM, but this is about
> having a way to force VACUUM to scan all-frozen pages

Or second way I came up with is having tool to remove particular _vm
file safely, which is executed via SQL or client tool like
pg_resetxlog.

Attached updated VACUUM SCAN_ALL patch.
Please find it.

Regards,

--
Masahiko Sawada
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 19fd748..8f63fad 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -21,9 +21,9 @@ PostgreSQL documentation
 
  
 
-VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE } [, ...] ) ] [ table_name [ (column_name [, ...] ) ] ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ table_name ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ table_name [ (column_name [, ...] ) ] ]
+VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE | SCAN_ALL } [, ...] ) ] [ table_name [ (column_name [, ...] ) ] ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCAN_ALL ] [ table_name ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCAN_ALL ] ANALYZE [ table_name [ (column_name [, ...] ) ] ]
 
  
 
@@ -120,6 +120,17 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ 

 

+SCAN_ALL
+
+ 
+  Selects forcibly full page scanning vacuum while ignoring visibility map.
+  Forcibly full page scanning vacuum is always performed when the table is
+  rewritten so this option is redundant when FULL is specified.
+ 
+
+   
+   
+   
 ANALYZE
 
  
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 426e756..eee93c4 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -138,7 +138,7 @@ static BufferAccessStrategy vac_strategy;
 
 /* non-export function prototypes */
 static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive);
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
@@ -185,6 +185,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	double		read_rate,
 write_rate;
 	bool		aggressive;		/* should we scan all unfrozen pages? */
+	bool		scan_all;		/* should we scan all pages forcibly? */
 	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
@@ -233,6 +234,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
 			  mxactFullScanLimit);
 
+	/* If SCAN_ALL option is specified, we have to scan all pages forcibly */
+	scan_all = options & VACOPT_SCANALL;
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
@@ -246,14 +250,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	vacrelstats->hasindex = (nindexes > 0);
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, vacrelstats, Ire

Re: [HACKERS] Reviewing freeze map code

2016-05-08 Thread Masahiko Sawada
On Tue, May 3, 2016 at 6:48 AM, Andres Freund <and...@anarazel.de> wrote:
> fd31cd2 Don't vacuum all-frozen pages.

-   appendStringInfo(, _("pages: %u removed,
%u remain, %u skipped due to pins\n"),
+   appendStringInfo(, _("pages: %u removed,
%u remain, %u skipped due to pins, %u skipped frozen\n"),

vacrelstats->pages_removed,
 vacrelstats->rel_pages,
-
vacrelstats->pinskipped_pages);
+
vacrelstats->pinskipped_pages,
+
vacrelstats->frozenskipped_pages);

The verbose information about skipping frozen page is emitted by only
autovacuum.
But I think that this information is also helpful for manual vacuum.

Please find attached patch which fixes that.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 426e756..fa6e5fa 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1316,6 +1316,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	"Skipped %u pages due to buffer pins.\n",
 	vacrelstats->pinskipped_pages),
 	 vacrelstats->pinskipped_pages);
+	appendStringInfo(, _("Skipped %u frozen pages.\n"),
+	 vacrelstats->frozenskipped_pages);
 	appendStringInfo(, ngettext("%u page is entirely empty.\n",
 	"%u pages are entirely empty.\n",
 	empty_pages),

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-08 Thread Masahiko Sawada
On Sun, May 8, 2016 at 3:18 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Sat, May 7, 2016 at 11:08 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <j...@commandprompt.com> 
>> wrote:
>>> On 05/06/2016 01:58 PM, Stephen Frost wrote:
>>>>
>>>> * Joshua D. Drake (j...@commandprompt.com) wrote:
>>>>>
>>>>> Yeah I thought about that, it is the word "FORCE" that bothers me.
>>>>> When you use FORCE there is an assumption that no matter what, it
>>>>> plows through (think rm -f). So if we don't use FROZEN, that's cool
>>>>> but FORCE doesn't work either.
>>>>
>>>>
>>>> Isn't that exactly what this FORCE option being contemplated would do
>>>> though?  Plow through the entire relation, regardless of what the VM
>>>> says is all frozen or not?
>>>>
>>>> Seems like FORCE is a good word for that to me.
>>>
>>>
>>> Except that we aren't FORCING a vacuum. That is the part I have contention
>>> with. To me, FORCE means:
>>>
>>> No matter what else is happening, we are vacuuming this relation (think
>>> locks).
>>>
>>> But I am also not going to dig in my heals. If that is truly what -hackers
>>> come up with, thank you at least considering what I said.
>>>
>>> Sincerely,
>>>
>>> JD
>>>
>>
>> As Joshua mentioned, FORCE word might imply doing VACUUM while plowing
>> through locks.
>> I guess that it might confuse the users.
>> IMO, since this option will be a way for emergency, SCANALL word works for 
>> me.
>>
>> Or other ideas are,
>> VACUUM IGNOREVM
>> VACUUM RESCURE
>>
>
> Oops, VACUUM RESCUE is correct.
>

Attached draft patch adds SCANALL option to VACUUM in order to scan
all pages forcibly while ignoring visibility map information.
The option name is SCANALL for now but we could change it after got consensus.

Regards,

--
Masahiko Sawada
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 19fd748..130a70e 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -21,9 +21,9 @@ PostgreSQL documentation
 
  
 
-VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE } [, ...] ) ] [ table_name [ (column_name [, ...] ) ] ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ table_name ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ table_name [ (column_name [, ...] ) ] ]
+VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE | SCANALL } [, ...] ) ] [ table_name [ (column_name [, ...] ) ] ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCANALL ] [ table_name ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCANALL ] ANALYZE [ table_name [ (column_name [, ...] ) ] ]
 
  
 
@@ -120,6 +120,17 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ 

 

+SCANALL
+
+ 
+  Selects forcibly full page scanning vacuum while ignoring visibility map.
+  Forcibly full page scanning vacuum is always performed when the table is
+  rewritten so this option is redundant when FULL is specified.
+ 
+
+   
+   
+   
 ANALYZE
 
  
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 426e756..85e04ac 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -138,7 +138,7 @@ static BufferAccessStrategy vac_strategy;
 
 /* non-export function prototypes */
 static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive);
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
@@ -185,6 +185,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	double		read_rate,
 write_rate;
 	bool		aggressive;		/* should we scan all unfrozen pages? */
+	bool		scan_all;		/* should we scan all pages forcibly? */
 	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
@@ -233,6 +234,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
 			  mxactFullScanLimit);
 
+	/* If SCANALL option is specified, we have to scan all pages forcibly */
+	scan_all = options & VACOPT_SCANALL;
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
@@ -246,14 +250,14 @@ lazy_vacuum_rel(Relation onerel, i

Re: [HACKERS] Reviewing freeze map code

2016-05-08 Thread Masahiko Sawada
On Sat, May 7, 2016 at 11:08 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <j...@commandprompt.com> 
> wrote:
>> On 05/06/2016 01:58 PM, Stephen Frost wrote:
>>>
>>> * Joshua D. Drake (j...@commandprompt.com) wrote:
>>>>
>>>> Yeah I thought about that, it is the word "FORCE" that bothers me.
>>>> When you use FORCE there is an assumption that no matter what, it
>>>> plows through (think rm -f). So if we don't use FROZEN, that's cool
>>>> but FORCE doesn't work either.
>>>
>>>
>>> Isn't that exactly what this FORCE option being contemplated would do
>>> though?  Plow through the entire relation, regardless of what the VM
>>> says is all frozen or not?
>>>
>>> Seems like FORCE is a good word for that to me.
>>
>>
>> Except that we aren't FORCING a vacuum. That is the part I have contention
>> with. To me, FORCE means:
>>
>> No matter what else is happening, we are vacuuming this relation (think
>> locks).
>>
>> But I am also not going to dig in my heals. If that is truly what -hackers
>> come up with, thank you at least considering what I said.
>>
>> Sincerely,
>>
>> JD
>>
>
> As Joshua mentioned, FORCE word might imply doing VACUUM while plowing
> through locks.
> I guess that it might confuse the users.
> IMO, since this option will be a way for emergency, SCANALL word works for me.
>
> Or other ideas are,
> VACUUM IGNOREVM
> VACUUM RESCURE
>

Oops, VACUUM RESCUE is correct.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-07 Thread Masahiko Sawada
On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <j...@commandprompt.com> wrote:
> On 05/06/2016 01:58 PM, Stephen Frost wrote:
>>
>> * Joshua D. Drake (j...@commandprompt.com) wrote:
>>>
>>> Yeah I thought about that, it is the word "FORCE" that bothers me.
>>> When you use FORCE there is an assumption that no matter what, it
>>> plows through (think rm -f). So if we don't use FROZEN, that's cool
>>> but FORCE doesn't work either.
>>
>>
>> Isn't that exactly what this FORCE option being contemplated would do
>> though?  Plow through the entire relation, regardless of what the VM
>> says is all frozen or not?
>>
>> Seems like FORCE is a good word for that to me.
>
>
> Except that we aren't FORCING a vacuum. That is the part I have contention
> with. To me, FORCE means:
>
> No matter what else is happening, we are vacuuming this relation (think
> locks).
>
> But I am also not going to dig in my heals. If that is truly what -hackers
> come up with, thank you at least considering what I said.
>
> Sincerely,
>
> JD
>

As Joshua mentioned, FORCE word might imply doing VACUUM while plowing
through locks.
I guess that it might confuse the users.
IMO, since this option will be a way for emergency, SCANALL word works for me.

Or other ideas are,
VACUUM IGNOREVM
VACUUM RESCURE

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Initial release notes created for 9.6

2016-05-05 Thread Masahiko Sawada
On Fri, May 6, 2016 at 2:32 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I've pushed a first cut at release notes for 9.6.  There's a good deal
> of work to do yet:
>
> * The section about parallel query could probably stand to be fleshed out,
> but I'm unsure what to say.  Somebody who's worked on that should provide
> some text.
>
> * Bruce usually likes to sprinkle the notes with a whole lot of links
> to the main docs.  I've only bothered with links for new GUCs and system
> views.
>
> * As is somewhat customary for early drafts of the notes, I've made no
> attempt to call out which are the most significant changes.  I've not
> tried to isolate the non-backwards-compatible items, either.
>
> Please review and comment before Monday, if you can.
>

+Avoid re-vacuuming pages containing only frozen tuples
+(Masahiko Sawada, Robert Haas)

+Support synchronous replication with multiple synchronous standby
+servers, not just one (Sawada Masahiko, Beena Emerson, Michael
+Paquier, Fujii Masao, Kyotaro Horiguchi)

Very minor comment but I'd like to unify my name to First Last (i.g.,
Masahiko Sawada).

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-03 Thread Masahiko Sawada
On Tue, May 3, 2016 at 6:48 AM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> The freeze map changes, besides being very important, seem to be one of
> the patches with a high risk profile in 9.6.  Robert had asked whether
> I'd take a look.  I thought it'd be a good idea to review that while
> running tests for
> http://www.postgresql.org/message-id/CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+pce93s...@mail.gmail.com

Thank you for reviewing.

> For starters, I'm just going through the commits. It seems the relevant
> pieces are:
>
> a892234 Change the format of the VM fork to add a second bit per page.
> 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
> fd31cd2 Don't vacuum all-frozen pages.
> 7087166 pg_upgrade: Convert old visibility map format to new format.
> ba0a198 Add pg_visibility contrib module.
>
> did I miss anything important?
>

That's all.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: pg_upgrade: Convert old visibility map format to new format.

2016-04-26 Thread Masahiko Sawada
On Wed, Apr 27, 2016 at 10:00 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Apr 26, 2016 at 8:45 PM, Bruce Momjian <br...@momjian.us> wrote:
>> On Fri, Mar 11, 2016 at 05:36:34PM +, Robert Haas wrote:
>>> pg_upgrade: Convert old visibility map format to new format.
>>>
>>> Commit a892234f830e832110f63fc0a2afce2fb21d1584 added a second bit per
>>> page to the visibility map, but pg_upgrade has been unaware of it up
>>> until now.  Therefore, a pg_upgrade from an earlier major release of
>>> PostgreSQL to any commit preceding this one and following the one
>>> mentioned above would result in invalid visibility map contents on the
>>> new cluster, very possibly leading to data corruption.  This plugs
>>> that hole.
>>>
>>> Masahiko Sawada, reviewed by Jeff Janes, Bruce Momjian, Simon Riggs,
>>> Michael Paquier, Andres Freund, me, and others.
>>
>> I have tested the current git trees of all supported versions of
>> Postgres in all possible pg_upgrade combinations and they all worked
>> perfectly.

Thank you for testing!
Good news for me.

>
> That's good!
>
> All the commits related to this topic could use extra-careful review
> and testing.  If Masahiko-san got anything wrong, or I did, this could
> eat people's data in ways that are very hard to track down.
>

Yeah, we should test this feature in various ways, and I'm going to do so.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2   3   4   5   6   7   8   >