RE: [PATCH] Speedup truncates of relation forks

2019-09-24 Thread Jamison, Kirk
On Tuesday, September 24, 2019 5:41 PM (GMT+9), Fujii Masao wrote:
> On Thu, Sep 19, 2019 at 9:42 AM Jamison, Kirk 
> wrote:
> >
> > On Wednesday, September 18, 2019 8:38 PM, Fujii Masao wrote:
> > > On Tue, Sep 17, 2019 at 10:44 AM Jamison, Kirk
> > > 
> > > wrote:
> > > >
> > > > On Friday, September 13, 2019 10:06 PM (GMT+9), Fujii Masao wrote:
> > > > > On Fri, Sep 13, 2019 at 9:51 PM Alvaro Herrera
> > > > > 
> > > > > wrote:
> > > > > >
> > > > > > On 2019-Sep-13, Fujii Masao wrote:
> > > > > >
> > > > > > > On Mon, Sep 9, 2019 at 3:52 PM Jamison, Kirk
> > > > > > > 
> > > > > wrote:
> > > > > >
> > > > > > > > > Please add a preliminary patch that removes the function.
> > > > > > > > > Dead code is good, as long as it is gone.  We can get it
> > > > > > > > > pushed ahead of
> > > > > the rest of this.
> > > > > > > >
> > > > > > > > Alright. I've attached a separate patch removing the
> > > smgrdounlinkfork.
> > > > > > >
> > > > > > > Per the past discussion, some people want to keep this "dead"
> > > > > > > function for some reasons. So, in my opinion, it's better to
> > > > > > > just enclose the function with #if NOT_USED and #endif, to
> > > > > > > keep the function itself as it is, and then to start new
> > > > > > > discussion on hackers about the removal of that separatedly from
> this patch.
> > > > > >
> > > > > > I searched for anybody requesting to keep the function.  I
> > > > > > couldn't find anything.  Tom said in 2012:
> > > > > > https://www.postgresql.org/message-id/1471.1339106...@sss.pgh.
> > > > > > pa.u
> > > > > > s
> > > > >
> > > > > Yes. And I found Andres.
> > > > > https://www.postgresql.org/message-id/20180621174129.hogefyopje4
> > > > > xazn
> > > > > u@al
> > > > > ap3.anarazel.de
> > > > >
> > > > > > > As committed, the smgrdounlinkfork case is actually dead
> > > > > > > code; it's never called from anywhere.  I left it in place
> > > > > > > just in case we want it someday.
> > > > > >
> > > > > > but if no use has appeared in 7 years, I say it's time to kill it.
> > > > >
> > > > > +1
> > > >
> > > > The consensus is we remove it, right?
> > > > Re-attaching the patch that removes the deadcode: smgrdounlinkfork().
> > > >
> > > > ---
> > > > I've also fixed Fujii-san's comments below in the latest attached
> > > > speedup
> > > truncate rel patch (v8).
> > >
> > > Thanks for updating the patch!
> > >
> > > + block = visibilitymap_truncate_prepare(rel, 0); if
> > > + (BlockNumberIsValid(block))
> > >   {
> > > - xl_smgr_truncate xlrec;
> > > + fork = VISIBILITYMAP_FORKNUM;
> > > + smgrtruncate(rel->rd_smgr, , 1, );
> > > +
> > > + if (RelationNeedsWAL(rel))
> > > + {
> > > + xl_smgr_truncate xlrec;
> > >
> > > I don't think this fix is right. Originally, WAL is generated even
> > > in the case where visibilitymap_truncate_prepare() returns
> > > InvalidBlockNumber. But the patch unexpectedly changed the logic so
> > > that WAL is not generated in that case.
> > >
> > > + if (fsm)
> > > + FreeSpaceMapVacuumRange(rel, first_removed_nblocks,
> > > + InvalidBlockNumber);
> > >
> > > This code means that FreeSpaceMapVacuumRange() is called if FSM
> > > exists even if FreeSpaceMapLocateBlock() returns InvalidBlockNumber.
> > > This seems not right. Originally, FreeSpaceMapVacuumRange() is not
> > > called in the case where InvalidBlockNumber is returned.
> > >
> > > So I updated the patch based on yours and fixed the above issues.
> > > Attached. Could you review this one? If there is no issue in that,
> > > I'm thinking to commit that.
> >
> > Oops. Thanks for the catch to correct my fix and revision of some
> descriptions.
> > I also noticed you reordered the truncation of forks,  by which main
> > fork will be truncated first instead of FSM. I'm not sure if the order
> > matters now given that we're truncating the forks simultaneously, so I'm
> ok with that change.
> 
> I changed that order so that DropRelFileNodeBuffers() can scan shared_buffers
> more efficiently. Usually the number of buffers for MAIN fork is larger than
> the others, in shared_buffers. So it's better to compare MAIN fork first for
> performance, during full scan of shared_buffers.
> 
> > Just one minor comment:
> > + * Return the number of blocks of new FSM after it's truncated.
> >
> > "after it's truncated" is quite confusing.
> > How about, "as a result of previous truncation" or just end the sentence
> after new FSM?
> 
> Thanks for the comment!
> I adopted the latter and committed the patch. Thanks!

Thank you very much Fujii-san for taking time to review
as well as for committing this patch!

Regards,
Kirk Jamison


RE: [PATCH] Speedup truncates of relation forks

2019-09-18 Thread Jamison, Kirk
On Wednesday, September 18, 2019 8:38 PM, Fujii Masao wrote:
> On Tue, Sep 17, 2019 at 10:44 AM Jamison, Kirk 
> wrote:
> >
> > On Friday, September 13, 2019 10:06 PM (GMT+9), Fujii Masao wrote:
> > > On Fri, Sep 13, 2019 at 9:51 PM Alvaro Herrera
> > > 
> > > wrote:
> > > >
> > > > On 2019-Sep-13, Fujii Masao wrote:
> > > >
> > > > > On Mon, Sep 9, 2019 at 3:52 PM Jamison, Kirk
> > > > > 
> > > wrote:
> > > >
> > > > > > > Please add a preliminary patch that removes the function.
> > > > > > > Dead code is good, as long as it is gone.  We can get it
> > > > > > > pushed ahead of
> > > the rest of this.
> > > > > >
> > > > > > Alright. I've attached a separate patch removing the
> smgrdounlinkfork.
> > > > >
> > > > > Per the past discussion, some people want to keep this "dead"
> > > > > function for some reasons. So, in my opinion, it's better to
> > > > > just enclose the function with #if NOT_USED and #endif, to keep
> > > > > the function itself as it is, and then to start new discussion
> > > > > on hackers about the removal of that separatedly from this patch.
> > > >
> > > > I searched for anybody requesting to keep the function.  I
> > > > couldn't find anything.  Tom said in 2012:
> > > > https://www.postgresql.org/message-id/1471.1339106...@sss.pgh.pa.u
> > > > s
> > >
> > > Yes. And I found Andres.
> > > https://www.postgresql.org/message-id/20180621174129.hogefyopje4xazn
> > > u@al
> > > ap3.anarazel.de
> > >
> > > > > As committed, the smgrdounlinkfork case is actually dead code;
> > > > > it's never called from anywhere.  I left it in place just in
> > > > > case we want it someday.
> > > >
> > > > but if no use has appeared in 7 years, I say it's time to kill it.
> > >
> > > +1
> >
> > The consensus is we remove it, right?
> > Re-attaching the patch that removes the deadcode: smgrdounlinkfork().
> >
> > ---
> > I've also fixed Fujii-san's comments below in the latest attached speedup
> truncate rel patch (v8).
> 
> Thanks for updating the patch!
> 
> + block = visibilitymap_truncate_prepare(rel, 0); if
> + (BlockNumberIsValid(block))
>   {
> - xl_smgr_truncate xlrec;
> + fork = VISIBILITYMAP_FORKNUM;
> + smgrtruncate(rel->rd_smgr, , 1, );
> +
> + if (RelationNeedsWAL(rel))
> + {
> + xl_smgr_truncate xlrec;
> 
> I don't think this fix is right. Originally, WAL is generated even in the
> case where visibilitymap_truncate_prepare() returns InvalidBlockNumber. But
> the patch unexpectedly changed the logic so that WAL is not generated in that
> case.
> 
> + if (fsm)
> + FreeSpaceMapVacuumRange(rel, first_removed_nblocks,
> + InvalidBlockNumber);
> 
> This code means that FreeSpaceMapVacuumRange() is called if FSM exists even
> if FreeSpaceMapLocateBlock() returns InvalidBlockNumber.
> This seems not right. Originally, FreeSpaceMapVacuumRange() is not called
> in the case where InvalidBlockNumber is returned.
> 
> So I updated the patch based on yours and fixed the above issues.
> Attached. Could you review this one? If there is no issue in that, I'm 
> thinking
> to commit that.

Oops. Thanks for the catch to correct my fix and revision of some descriptions.
I also noticed you reordered the truncation of forks,  by which main fork will 
be
truncated first instead of FSM. I'm not sure if the order matters now given that
we're truncating the forks simultaneously, so I'm ok with that change.

Just one minor comment:
+ * Return the number of blocks of new FSM after it's truncated.

"after it's truncated" is quite confusing. 
How about, "as a result of previous truncation" or just end the sentence after 
new FSM?


Thank you for committing the other patch as well!


Regards,
Kirk Jamison


RE: [PATCH] Speedup truncates of relation forks

2019-09-16 Thread Jamison, Kirk
On Friday, September 13, 2019 10:06 PM (GMT+9), Fujii Masao wrote:
> On Fri, Sep 13, 2019 at 9:51 PM Alvaro Herrera 
> wrote:
> >
> > On 2019-Sep-13, Fujii Masao wrote:
> >
> > > On Mon, Sep 9, 2019 at 3:52 PM Jamison, Kirk 
> wrote:
> >
> > > > > Please add a preliminary patch that removes the function.  Dead
> > > > > code is good, as long as it is gone.  We can get it pushed ahead of
> the rest of this.
> > > >
> > > > Alright. I've attached a separate patch removing the smgrdounlinkfork.
> > >
> > > Per the past discussion, some people want to keep this "dead"
> > > function for some reasons. So, in my opinion, it's better to just
> > > enclose the function with #if NOT_USED and #endif, to keep the
> > > function itself as it is, and then to start new discussion on
> > > hackers about the removal of that separatedly from this patch.
> >
> > I searched for anybody requesting to keep the function.  I couldn't
> > find anything.  Tom said in 2012:
> > https://www.postgresql.org/message-id/1471.1339106...@sss.pgh.pa.us
> 
> Yes. And I found Andres.
> https://www.postgresql.org/message-id/20180621174129.hogefyopje4xaznu@al
> ap3.anarazel.de
> 
> > > As committed, the smgrdounlinkfork case is actually dead code; it's
> > > never called from anywhere.  I left it in place just in case we want
> > > it someday.
> >
> > but if no use has appeared in 7 years, I say it's time to kill it.
> 
> +1

The consensus is we remove it, right?
Re-attaching the patch that removes the deadcode: smgrdounlinkfork().

---
I've also fixed Fujii-san's comments below in the latest attached speedup 
truncate rel patch (v8).
> Here are other comments for the latest patch:
> 
> + block = visibilitymap_truncate_prepare(rel, 0); if
> + (BlockNumberIsValid(block)) fork = VISIBILITYMAP_FORKNUM;
> +
> + smgrtruncate(rel->rd_smgr, , 1, );
> 
> If visibilitymap_truncate_prepare() returns InvalidBlockNumber,
> smgrtruncate() should not be called.
> 
> + FreeSpaceMapVacuumRange(rel, first_removed_nblocks,
> + InvalidBlockNumber);

Thank you again for the review!

Regards,
Kirk Jamison


v1-0001-Remove-deadcode-smgrdounlinkfork.patch
Description: v1-0001-Remove-deadcode-smgrdounlinkfork.patch


v8-0001-Speedup-truncates-of-relation-forks.patch
Description: v8-0001-Speedup-truncates-of-relation-forks.patch


RE: [PATCH] Speedup truncates of relation forks

2019-09-09 Thread Jamison, Kirk
On Friday, September 6, 2019 11:51 PM (GMT+9), Alvaro Herrera wrote:

Hi Alvaro,
Thank you very much for the review!

> On 2019-Sep-05, Jamison, Kirk wrote:
> 
> > I also mentioned it from my first post if we can just remove this dead code.
> > If not, it would require to modify the function because it would also
> > need nforks as input argument when calling DropRelFileNodeBuffers. I
> > kept my changes in the latest patch.
> > So should I remove the function now or keep my changes?
> 
> Please add a preliminary patch that removes the function.  Dead code is good,
> as long as it is gone.  We can get it pushed ahead of the rest of this.

Alright. I've attached a separate patch removing the smgrdounlinkfork.


> What does it mean to "mark" a dirty page in FSM?  We don't have the concept
> of marking pages as far as I know (and I don't see that the patch does any
> sort of mark).  Do you mean to find where it is and return its block number?

Yes. "Mark" is probably not a proper way to describe it, so I temporarily
changed it to "locate" and renamed the function to FreeSpaceMapLocateBlock().
If anyone could suggest a more appropriate name, that'd be appreciated.


> If so, I wonder how this handles concurrent table extension: are we keeping
> some sort of lock that prevents it?
> (... or would we lose any newly added pages that receive tuples while this
> truncation is ongoing?)

I moved the the description about acquiring AccessExclusiveLock
from FreeSpaceMapLocateBlock() and visibilitymap_truncate_prepare() to the
smgrtruncate description instead.
IIUC, in lazy_truncate_heap() we still acquire AccessExclusiveLock for the 
relation
before calling RelationTruncate(), which then calls smgrtruncate().
While holding the exclusive lock, the following are also called to check
if rel has not extended and verify that end pages contain no tuples while
we were vacuuming (with non-exclusive lock).
  new_rel_pages = RelationGetNumberOfBlocks(onerel);
  new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
I assume that the above would update the correct number of pages.
We then release the exclusive lock as soon as we have truncated the pages.


> I think the new API of smgrtruncate() is fairly confusing.  Would it be better
> to define a new struct { ForkNum forknum; BlockNumber blkno; } and pass an
> array of those?

This is for readbility, right? However, I think there's no need to define a
new structure for it, so I kept my changes.
  smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber 
*nblocks).
I also declared *forkNum and nforks next to each other as suggested by 
Sawada-san.


What do you think about these changes?


Regards,
Kirk Jamison


v1-0001-Remove-deadcode-smgrdounlinkfork.patch
Description: v1-0001-Remove-deadcode-smgrdounlinkfork.patch


v7-0001-Speedup-truncates-of-relation-forks.patch
Description: v7-0001-Speedup-truncates-of-relation-forks.patch


RE: [PATCH] Speedup truncates of relation forks

2019-09-05 Thread Jamison, Kirk
On Tuesday, September 3, 2019 9:44 PM (GMT+9), Fujii Masao wrote:
> Thanks for the patch!

Thank you as well for the review!

> -smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum, bool isRedo)
> +smgrdounlinkfork(SMgrRelation reln, ForkNumber *forknum, int nforks,
> bool isRedo)
> 
> smgrdounlinkfork() is dead code. Per the discussion [1], this unused function
> was left intentionally. But it's still dead code since 2012, so I'd like to
> remove it. Or, even if we decide to keep that function for some reasons, I
> don't think that we need to update that so that it can unlink multiple forks
> at once. So, what about keeping
> smgrdounlinkfork() as it is?
> 
> [1]
> https://www.postgresql.org/message-id/1471.1339106...@sss.pgh.pa.us 

I also mentioned it from my first post if we can just remove this dead code.
If not, it would require to modify the function because it would also
need nforks as input argument when calling DropRelFileNodeBuffers. I kept my
changes in the latest patch.
So should I remove the function now or keep my changes?


> + for (int i = 0; i < nforks; i++)
> 
> The variable "i" should not be declared in for loop per PostgreSQL coding
> style.

Fixed.


> + /* Check with the lower bound block number and skip the loop */ if
> + (bufHdr->tag.blockNum < minBlock) continue; /* skip checking the
> + buffer pool scan */
> 
> Because of the above code, the following source comment in bufmgr.c should
> be updated.
> 
> * We could check forkNum and blockNum as well as the rnode, but the
> * incremental win from doing so seems small.
> 
> And, first of all, is this check really useful for performance?
> Since firstDelBlock for FSM fork is usually small, minBlock would also be
> small. So I'm not sure how much this is helpful for performance.

This was a suggestion from Sawada-san in the previous email,
but he also thought that the performance benefit might be small..
so I just removed the related code block in this patch.


> When relation is completely truncated at all (i.e., the number of block to
> delete first is zero), can RelationTruncate() and smgr_redo()  just call
> smgrdounlinkall() like smgrDoPendingDeletes() does, instead of calling
> MarkFreeSpaceMapTruncateRel(), visibilitymap_truncate_prepare() and
> smgrtruncate()? ISTM that smgrdounlinkall() is faster and simpler.

I haven't applied this in my patch yet.
If my understanding is correct, smgrdounlinkall() is used for deleting
relation forks. However, we only truncate (not delete) relations
in RelationTruncate() and smgr_redo(). I'm not sure if it's correct to
use it here. Could you expound more your idea on using smgrdounlinkall?


Regards,
Kirk Jamison


v6-0001-Speedup-truncates-of-relation-forks.patch
Description: v6-0001-Speedup-truncates-of-relation-forks.patch


RE: Resume vacuum and autovacuum from interruption and cancellation

2019-08-26 Thread Jamison, Kirk
On Monday, August 19, 2019 10:39 AM (GMT+9), Masahiko Sawada wrote:
> Fixed.
> 
> Attached the updated version patch.

Hi Sawada-san,

I haven't tested it with heavily updated large tables, but I think the patch
is reasonable as it helps to shorten the execution time of vacuum by removing
the redundant vacuuming and prioritizing reclaiming the garbage instead.
I'm not sure if it's commonly reported to have problems even when we repeat
vacuuming the already-vacuumed blocks, but I think it's a reasonable 
improvement.

I skimmed the patch and have few comments. If they deem fit, feel free to
follow, but it's also ok if you don't.
1.
>+ Block number to resume vacuuming from
Perhaps you could drop "from".

2.
>+  . This behavior is helpful
>+  when to resume vacuuming from interruption and cancellation.The default
when resuming vacuum run from interruption and cancellation.
There should also be space between period and "The".

3.
>+  set to true. This option is ignored if either the 
>FULL,
>+  the FREEZE or 
>DISABLE_PAGE_SKIPPING
>+  option is used.
..if either of the FULL, FREEZE, or 
DISABLE_PAGE_SKIPPING options is used.

4.
>+  next_fsm_block_to_vacuum,
>+  next_block_to_resume;
Clearer one would be "next_block_to_resume_vacuum"?
You may disregard if that's too long.

5.
>+  Assert(start_blkno <= nblocks); /* both are the same iif it's empty */
iif -> if there are no blocks / if nblocks is 0

6.
>+   * If not found a valid saved block number, resume from the
>+   * first block.
>+   */
>+  if (!found ||
>+  tabentry->vacuum_resume_block >= 
>RelationGetNumberOfBlocks(onerel))
This describes when vacuum_resume_block > RelationGetNumberOfBlocks.., isn't it?
Perhaps a better framing would be
"If the saved block number is found invalid,...",

7.
>+  boolvacuum_resume;  /* enables vacuum to resuming 
>from last
>+   * 
>vacuumed block. */
resuming --> resume


Regards,
Kirk Jamison


RE: Multivariate MCV list vs. statistics target

2019-07-31 Thread Jamison, Kirk
Hi, 

> From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com]
> Sent: Monday, July 29, 2019 10:53 AM
> To: 'Tomas Vondra' 
> Cc: Dean Rasheed ; PostgreSQL Hackers
> 
> Subject: RE: Multivariate MCV list vs. statistics target
> 
> On Saturday, July 27, 2019 7:06 AM(GMT+9), Tomas Vondra wrote:
> > On Fri, Jul 26, 2019 at 07:03:41AM +, Jamison, Kirk wrote:
> > >On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:
> > >
> > >> >+   /* XXX What if the target is set to 0? Reset the 
> > >> >statistic?
> > >> */
> > >> >
> > >> >This also makes me wonder. I haven't looked deeply into the code,
> > >> >but since 0 is a valid value, I believe it should reset the stats.
> > >>
> > >> I agree resetting the stats after setting the target to 0 seems
> > >> quite reasonable. But that's not what we do for attribute stats,
> > >> because in that case we simply skip the attribute during the future
> > >> ANALYZE runs
> > >> - we don't reset the stats, we keep the existing stats. So I've
> > >> done the same thing here, and I've removed the XXX comment.
> > >>
> > >> If we want to change that, I'd do it in a separate patch for both
> > >> the regular and extended stats.
> > >
> > >Hi, Tomas
> > >
> > >Sorry for my late reply.
> > >You're right. I have no strong opinion whether we'd want to change
> > >that
> > behavior.
> > >I've also confirmed the change in the patch where setting statistics
> > >target 0 skips the statistics.
> > >
> >
> > OK, thanks.
> >
> > >Maybe only some minor nitpicks in the source code comments below:
> > >1. "it's" should be "its":
> > >> + * Compute statistic target, based on what's set for the
> > statistic
> > >> + * object itself, and for it's attributes.
> > >
> > >2. Consistency whether you'd use either "statistic " or "statisticS ".
> > >Ex. statistic target vs statisticS target, statistics object vs
> > >statistic
> > object, etc.
> > >
> > >> Attached is v4 of the patch, with a couple more improvements:
> > >>
> > >> 1) I've renamed the if_not_exists flag to missing_ok, because
> > >> that's more consistent with the "IF EXISTS" clause in the grammar
> > >> (the old flag was kinda the exact opposite), and I've added a NOTICE
> about the skip.
> > >
> > >+  boolmissing_ok;  /* do nothing if statistics does
> > not exist */
> > >
> > >Confirmed. So we ignore if statistic does not exist, and skip the error.
> > >Maybe to make it consistent with other data structures in
> > >parsernodes.h, you can change the comment of missing_ok to:
> > >/* skip error if statistics object does not exist */
> > >
> >
> > Thanks, I've fixed all those places in the attached v5.
> 
> I've confirmed the fix.
> 
> > >> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows,
> > >> because that's what the function was doing anyway (computing sample
> size).
> > >>
> > >> 3) I've added a couple of regression tests to stats_ext.sql
> > >>
> > >> Aside from that, I've cleaned up a couple of places and improved a
> > >> bunch of comments. Nothing huge.
> > >
> > >I have a question though regarding ComputeExtStatisticsRows.
> > >I'm just curious with the value 300 when computing sample size.
> > >Where did this value come from?
> > >
> > >+  /* compute sample size based on the statistic target */
> > >+  return (300 * result);
> > >
> > >Overall, the patch is almost already in good shape for commit.
> > >I'll wait for the next update.
> > >
> >
> > That's how we compute number of rows to sample, based on the statistics
> target.
> > See std_typanalyze() in analyze.c, which also cites the paper where
> > this comes from.
> Noted. Found it. Thank you for the reference.
> 
> 
> There's just a small whitespace (extra space) below after running git diff
> --check.
> >src/bin/pg_dump/pg_dump.c:7226: trailing whitespace.
> >+
> It would be better to post an updated patch, but other than that, I've 
> confirmed
> the fixes.
> The patch passed the make-world and regression tests as well.
> I've marked this as "ready for committer".

The patch does not apply anymore.
In addition to the whitespace detected,
kindly rebase the patch as there were changes from recent commits
in the following files:
   src/backend/commands/analyze.c
   src/bin/pg_dump/pg_dump.c
   src/bin/pg_dump/t/002_pg_dump.pl
   src/test/regress/expected/stats_ext.out
   src/test/regress/sql/stats_ext.sql

Regards,
Kirk Jamison




RE: Multivariate MCV list vs. statistics target

2019-07-28 Thread Jamison, Kirk
On Saturday, July 27, 2019 7:06 AM(GMT+9), Tomas Vondra wrote:
> On Fri, Jul 26, 2019 at 07:03:41AM +0000, Jamison, Kirk wrote:
> >On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:
> >
> >> >+ /* XXX What if the target is set to 0? Reset the statistic?
> >> */
> >> >
> >> >This also makes me wonder. I haven't looked deeply into the code,
> >> >but since 0 is a valid value, I believe it should reset the stats.
> >>
> >> I agree resetting the stats after setting the target to 0 seems quite
> >> reasonable. But that's not what we do for attribute stats, because in
> >> that case we simply skip the attribute during the future ANALYZE runs
> >> - we don't reset the stats, we keep the existing stats. So I've done
> >> the same thing here, and I've removed the XXX comment.
> >>
> >> If we want to change that, I'd do it in a separate patch for both the
> >> regular and extended stats.
> >
> >Hi, Tomas
> >
> >Sorry for my late reply.
> >You're right. I have no strong opinion whether we'd want to change that
> behavior.
> >I've also confirmed the change in the patch where setting statistics
> >target 0 skips the statistics.
> >
> 
> OK, thanks.
> 
> >Maybe only some minor nitpicks in the source code comments below:
> >1. "it's" should be "its":
> >> +   * Compute statistic target, based on what's set for the
> statistic
> >> +   * object itself, and for it's attributes.
> >
> >2. Consistency whether you'd use either "statistic " or "statisticS ".
> >Ex. statistic target vs statisticS target, statistics object vs statistic
> object, etc.
> >
> >> Attached is v4 of the patch, with a couple more improvements:
> >>
> >> 1) I've renamed the if_not_exists flag to missing_ok, because that's
> >> more consistent with the "IF EXISTS" clause in the grammar (the old
> >> flag was kinda the exact opposite), and I've added a NOTICE about the skip.
> >
> >+boolmissing_ok;  /* do nothing if statistics does
> not exist */
> >
> >Confirmed. So we ignore if statistic does not exist, and skip the error.
> >Maybe to make it consistent with other data structures in
> >parsernodes.h, you can change the comment of missing_ok to:
> >/* skip error if statistics object does not exist */
> >
> 
> Thanks, I've fixed all those places in the attached v5.

I've confirmed the fix.

> >> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because
> >> that's what the function was doing anyway (computing sample size).
> >>
> >> 3) I've added a couple of regression tests to stats_ext.sql
> >>
> >> Aside from that, I've cleaned up a couple of places and improved a
> >> bunch of comments. Nothing huge.
> >
> >I have a question though regarding ComputeExtStatisticsRows.
> >I'm just curious with the value 300 when computing sample size.
> >Where did this value come from?
> >
> >+/* compute sample size based on the statistic target */
> >+return (300 * result);
> >
> >Overall, the patch is almost already in good shape for commit.
> >I'll wait for the next update.
> >
> 
> That's how we compute number of rows to sample, based on the statistics 
> target.
> See std_typanalyze() in analyze.c, which also cites the paper where this comes
> from.
Noted. Found it. Thank you for the reference.


There's just a small whitespace (extra space) below after running git diff 
--check.
>src/bin/pg_dump/pg_dump.c:7226: trailing whitespace.
>+ 
It would be better to post an updated patch,
but other than that, I've confirmed the fixes.
The patch passed the make-world and regression tests as well.
I've marked this as "ready for committer".


Regards,
Kirk Jamison




RE: Multivariate MCV list vs. statistics target

2019-07-26 Thread Jamison, Kirk
On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:

> >+/* XXX What if the target is set to 0? Reset the statistic?
> */
> >
> >This also makes me wonder. I haven't looked deeply into the code, but
> >since 0 is a valid value, I believe it should reset the stats.
> 
> I agree resetting the stats after setting the target to 0 seems quite
> reasonable. But that's not what we do for attribute stats, because in that
> case we simply skip the attribute during the future ANALYZE runs - we don't
> reset the stats, we keep the existing stats. So I've done the same thing here,
> and I've removed the XXX comment.
> 
> If we want to change that, I'd do it in a separate patch for both the regular
> and extended stats.

Hi, Tomas

Sorry for my late reply.
You're right. I have no strong opinion whether we'd want to change that 
behavior.
I've also confirmed the change in the patch where setting statistics target 0
skips the statistics. 

Maybe only some minor nitpicks in the source code comments below:
1. "it's" should be "its":
> +  * Compute statistic target, based on what's set for the 
> statistic
> +  * object itself, and for it's attributes.

2. Consistency whether you'd use either "statistic " or "statisticS ".
Ex. statistic target vs statisticS target, statistics object vs statistic 
object, etc.

> Attached is v4 of the patch, with a couple more improvements:
>
> 1) I've renamed the if_not_exists flag to missing_ok, because that's more
> consistent with the "IF EXISTS" clause in the grammar (the old flag was kinda
> the exact opposite), and I've added a NOTICE about the skip.

+   boolmissing_ok;  /* do nothing if statistics does not 
exist */

Confirmed. So we ignore if statistic does not exist, and skip the error.
Maybe to make it consistent with other data structures in parsernodes.h,
you can change the comment of missing_ok to: 
/* skip error if statistics object does not exist */

> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because that's
> what the function was doing anyway (computing sample size).
>
> 3) I've added a couple of regression tests to stats_ext.sql
> 
> Aside from that, I've cleaned up a couple of places and improved a bunch of
> comments. Nothing huge.

I have a question though regarding ComputeExtStatisticsRows.
I'm just curious with the value 300 when computing sample size.
Where did this value come from?

+   /* compute sample size based on the statistic target */
+   return (300 * result);

Overall, the patch is almost already in good shape for commit.
I'll wait for the next update.

Regards,
Kirk Jamison




RE: [PATCH] Speedup truncates of relation forks

2019-07-23 Thread Jamison, Kirk
Hi,

I repeated the recovery performance test before, and found out that I made a
wrong measurement.
Using the same steps indicated in the previous email (24GB shared_buffers for 
my case),
the recovery time still significantly improved compared to head
from "13 minutes" to "4 minutes 44 seconds"  //not 30 seconds.
It's expected because the measurement of vacuum execution time (no failover)
which I reported in the first email is about the same (although 5 minutes):
> HEAD results
> 3) 24GB shared_buffers = 14 min 13.598 s
> PATCH results
> 3) 24GB shared_buffers = 5 min 35.848 s


Reattaching the patch here again. The V5 of the patch fixed the compile error
mentioned before and mainly addressed the comments/advice of Sawada-san.
- updated more accurate comments describing only current behavior, not history
- updated function name: visibilitymap_truncate_prepare()
- moved the setting of values for smgr_{fsm,vm}_nblocks inside the 
smgrtruncate()

I'd be grateful if anyone could provide comments, advice, or insights.
Thank you again in advance.

Regards,
Kirk Jamison


v5-0001-Speedup-truncates-of-relation-forks.patch
Description: v5-0001-Speedup-truncates-of-relation-forks.patch


RE: Multivariate MCV list vs. statistics target

2019-07-19 Thread Jamison, Kirk
On Tuesday, July 9, 2019, Tomas Vondra wrote:
> >apparently v1 of the ALTER STATISTICS patch was a bit confused about
> >length of the VacAttrStats array passed to statext_compute_stattarget,
> >causing segfaults. Attached v2 patch fixes that, and it also makes sure
> >we print warnings about ignored statistics only once.
> >
> 
> v3 of the patch, adding pg_dump support - it works just like when you tweak
> statistics target for a column, for example. When the value stored in the
> catalog is not -1, pg_dump emits a separate ALTER STATISTICS command setting
> it (for the already created statistics object).
> 

Hi Tomas, I stumbled upon your patch.

According to the CF bot, your patch applies cleanly, builds successfully, and
passes make world. Meaning, the pg_dump tap test passed, but there was no
test for the new SET STATISTICS yet. So you might want to add a regression
test for that and integrate it in the existing alter_generic file.

Upon quick read-through, the syntax and docs are correct because it's similar
to the format of ALTER TABLE/INDEX... SET STATISTICS... :
ALTER [ COLUMN ] column_name SET STATISTICS integer

+   /* XXX What if the target is set to 0? Reset the statistic? */

This also makes me wonder. I haven't looked deeply into the code, but since 0 is
a valid value, I believe it should reset the stats.
After lookup though, this is how it's tested in ALTER TABLE:
/test/regress/sql/stats_ext.sql:-- Ensure things work sanely with SET 
STATISTICS 0

> I've considered making it part of CREATE STATISTICS itself, but it seems a
> bit cumbersome and we don't do it for columns either. I'm not against adding
> it in the future, but at this point I don't see a need.

I agree. Perhaps that's for another patch should you decide to add it in the 
future.

Regards,
Kirk Jamison




RE: [PATCH] Speedup truncates of relation forks

2019-07-08 Thread Jamison, Kirk
Hi Thomas,

Thanks for checking.

> On Fri, Jul 5, 2019 at 3:03 PM Jamison, Kirk  wrote:
> > I updated the patch which is similar to V3 of the patch, but
> > addressing my problem in (5) in the previous email regarding
> FreeSpaceMapVacuumRange.
> > It seems to pass the regression test now. Kindly check for validation.
> 
> Hi Kirk,
> 
> FYI there are a couple of compiler errors reported:

Attached is the updated patch (V5) fixing the compiler errors.

Comments and reviews about the patch/tests are very much welcome.

Regards,
Kirk Jamison


v5-0001-Speedup-truncates-of-relation-forks.patch
Description: v5-0001-Speedup-truncates-of-relation-forks.patch


RE: [PATCH] Speedup truncates of relation forks

2019-07-04 Thread Jamison, Kirk
Hi,

> I updated the patch based from comments, but it still fails the regression
> test as indicated in (5) above.
> Kindly verify if I correctly addressed the other parts as what you intended.
> 
> Thanks again for the review!
> I'll update the patch again after further comments.

I updated the patch which is similar to V3 of the patch,
but addressing my problem in (5) in the previous email regarding 
FreeSpaceMapVacuumRange.
It seems to pass the regression test now. Kindly check for validation.
Thank you!

Regards,
Kirk Jamison


v4-0001-Speedup-truncates-of-relation-forks.patch
Description: v4-0001-Speedup-truncates-of-relation-forks.patch


RE: [PATCH] Speedup truncates of relation forks

2019-07-04 Thread Jamison, Kirk
On Tuesday, July 2, 2019 4:59 PM (GMT+9), Masahiko Sawada wrote:
> Thank you for updating the patch. Here is the review comments for v2 patch.

Thank you so much for review!
I indicated the addressed parts below and attached the updated patch.

---
visibilitymap.c: visibilitymap_truncate()

> I don't think we should describe about the previous behavior here.
> Rather we need to describe what visibilitymap_mark_truncate does and what
> it returns to the caller.
>
> I'm not sure that visibilitymap_mark_truncate function name is appropriate
> here since it actually truncate map bits, not only marking. Perhaps we can
> still use visibilitymap_truncate or change to
> visibilitymap_truncate_prepare, or something? Anyway, this function
> truncates only tail bits in the last remaining map page and we can have a
> rule that the caller must call smgrtruncate() later to actually truncate
> pages.
> 
> The comment of second paragraph is now out of date since this function no
> longer sends smgr invalidation message.

(1) I updated function name to "visibilitymap_truncate_prepare()" as suggested.
I think that parameter name fits, unless other reviewers suggest a better name.
I also updated its description more accurately: describing current behavior,
caller must eventually call smgrtruncate() to actually truncate vm pages,
and removed the outdated description.


> Is it worth to leave the current visibilitymap_truncate function as a shortcut
> function, instead of replacing? That way we don't need to change
> pg_truncate_visibility_map function.

(2) Yeah, it's kinda displeasing that I had to add lines in 
pg_truncate_visibility_map.
By any chance, re: shortcut function, you meant to retain the function
visibilitymap_truncate() and just add another visibilitymap_truncate_prepare(),
isn't it? I'm not sure if it's worth the additional lines of adding
another function in visibilitymap.c, that's why I just updated the function 
itself
which just adds 10 lines to pg_truncate_visibility_map anyway.
Hmm. If it's not wrong to do it this way, then I will retain this change.
OTOH, if pg_visibility.c *must* not be modified, then I'll follow your advice.



pg_visibility.c: pg_truncate_visibility_map()

> @@ -383,6 +383,10 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
>  {
> Oid relid = PG_GETARG_OID(0);
> Relationrel;
> +   ForkNumber  forks[MAX_FORKNUM];
> +   BlockNumber blocks[MAX_FORKNUM];
> +   BlockNumber newnblocks = InvalidBlockNumber;
> +   int nforks = 0;
> 
> Why do we need the array of forks and blocks here? I think it's enough to
> have one fork and one block number.

(3) Thanks for the catch. Updated.



freespace.c: MarkFreeSpaceMapTruncateRel()

> The same comments are true for MarkFreeSpaceMapTruncateRel.

> +   BlockNumber new_nfsmblocks = InvalidBlockNumber;/* FSM
> blocks */
> +   BlockNumber newnblocks = InvalidBlockNumber;/* VM
> blocks */
> +   int nforks = 0;
> 
> I think that we can have new_nfsmblocks and new_nvmblocks and wipe out the
> comments.

(4) I updated the description accordingly, describing only the current behavior.
The caller must eventually call smgrtruncate() to actually truncate fsm pages.
I also removed the outdated description and irrelevant comments.


--
storage.c: RelationTruncate()

> +* We might as well update the local smgr_fsm_nblocks and
> smgr_vm_nblocks
> +* setting. smgrtruncate sent an smgr cache inval message,
> which will cause
> +* other backends to invalidate their copy of smgr_fsm_nblocks and
> +* smgr_vm_nblocks, and this one too at the next command
> boundary. But this
> +* ensures it isn't outright wrong until then.
> +*/
> +   if (rel->rd_smgr)
> +   {
> +   rel->rd_smgr->smgr_fsm_nblocks = new_nfsmblocks;
> +   rel->rd_smgr->smgr_vm_nblocks = newnblocks;
> +   }
> 
> new_nfsmblocks and newnblocks could be InvalidBlockNumber when the forks are
> already enough small. So the above code sets incorrect values to
> smgr_{fsm,vm}_nblocks.
> 
> Also, I wonder if we can do the above code in smgrtruncate. Otherwise we 
> always
> need to set smgr_{fsm,vm}_nblocks after smgrtruncate, which is inconvenient.

(5) 
In my patch, did you mean that there's a possibility that these values
were already set to InvalidBlockNumber even before I did the setting, is it? 
I'm not sure if IIUC, the point of the above code is to make sure that
smgr_{fsm,vm}_nblocks are not InvalidBlockNumber until the next command
boundary, and while we don't reach that boundary yet, we make sure
these values are valid within that window. Is my understanding correct?
Maybe following your advice that putting it inside the smgrtruncate loop
will make these values correct.
For example, below?

void smgrtruncate
{
...
CacheInvalidateSmgr(reln->smgr_rnode);

/* Do 

RE: [PATCH] Speedup truncates of relation forks

2019-07-01 Thread Jamison, Kirk
On Wednesday, June 26, 2019 6:10 PM(GMT+9), Adrien Nayrat wrote:
> As far as I remember, you should see "relation" wait events (type lock) on
> standby server. This is due to startup process acquiring AccessExclusiveLock
> for the truncation and other backend waiting to acquire a lock to read the
> table.

Hi Adrien, thank you for taking time to reply.

I understand that RelationTruncate() can block read-only queries on
standby during redo. However, it's difficult for me to reproduce the 
test case where I need to catch that wait for relation lock, because
one has to execute SELECT within the few milliseconds of redoing the
truncation of one table.

Instead, I just measured the whole recovery time, smgr_redo(),
to show the recovery improvement compared to head. Please refer below.

[Recovery Test]
I used the same stored functions and configurations in the previous email
& created "test" db.

$ createdb test
$ psql -d test

1. [Primary] Create 10,000 relations.
test=# SELECT create_tables(1);

2. [P] Insert one row in each table.
test=# SELECT insert_tables(1);

3. [P] Delete row of each table.
test=# SELECT delfrom_tables(1);

4. [Standby] WAL application is stopped at Standby server.
test=# SELECT pg_wal_replay_pause();

5. [P] VACUUM is executed at Primary side, and measure its execution time.  

test=# \timing on
test=# VACUUM;

Alternatively, you may use:
$ time psql -d test -c 'VACUUM;'
(Note: WAL has not replayed on standby because it's been paused.)

6. [P] Wait until VACUUM has finished execution. Then, stop primary server. 
test=# pg_ctl stop -w

7. [S] Resume WAL replay, then promote standby (failover).
I used a shell script to execute recovery & promote standby server
because it's kinda difficult to measure recovery time. Please refer to the 
script below.
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is 
resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is 
measured.

shell script:

PGDT=/path_to_storage_directory/

if [ "$1" = "resume" ]; then
psql -c "SELECT pg_wal_replay_resume();" test
date +%Y/%m/%d_%H:%M:%S.%3N
pg_ctl promote -D ${PGDT}
set +x
date +%Y/%m/%d_%H:%M:%S.%3N
while [ 1 ]
do
RS=`psql -Atc "select pg_is_in_recovery();" test`   
if [ ${RS} = "f" ]; then
break
fi
done
date +%Y/%m/%d_%H:%M:%S.%3N
set -x
exit 0
fi


[Test Results]
shared_buffers = 24GB

1. HEAD
(wal replay resumed)
2019/07/01_08:48:50.326
server promoted
2019/07/01_08:49:50.482
2019/07/01_09:02:41.051

 Recovery Time:
 13 min 50.725 s -> Time difference from WAL replay to complete recovery
 12 min 50.569 s -> Time difference of "select pg_is_in_recovery();" from "t" 
to "f"

2. PATCH
(wal replay resumed)
2019/07/01_07:34:26.766
server promoted
2019/07/01_07:34:57.790
2019/07/01_07:34:57.809

 Recovery Time: 
 31.043 s -> Time difference from WAL replay to complete recovery
 00.019 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
 
[Conclusion]
The recovery time significantly improved compared to head
from 13 minutes to 30 seconds.

Any thoughts?
I'd really appreciate your comments/feedback about the patch and/or test.


Regards,
Kirk Jamison


RE: [PATCH] Speedup truncates of relation forks

2019-06-17 Thread Jamison, Kirk
Hi all,

Attached is the v2 of the patch. I added the optimization that Sawada-san
suggested for DropRelFileNodeBuffers, although I did not acquire the lock
when comparing the minBlock and target block. 

There's actually a comment written in the source code that we could
pre-check buffer tag for forkNum and blockNum, but given that FSM and VM
blocks are small compared to main fork's, the additional benefit of doing so 
would be small.

>* We could check forkNum and blockNum as well as the rnode, but the
>* incremental win from doing so seems small.

I personally think it's alright not to include the suggested pre-checking.
If that's the case, we can just follow the patch v1 version.

Thoughts?

Comments and reviews from other parts of the patch are also very much welcome.

Regards,
Kirk Jamison


v2-0001-Speedup-truncates-of-relation-forks.patch
Description: v2-0001-Speedup-truncates-of-relation-forks.patch


RE: [PATCH] Speedup truncates of relation forks

2019-06-13 Thread Jamison, Kirk
Hi Sawada-san,

On Thursday, June 13, 2019 8:01 PM, Masahiko Sawada wrote:
> On Thu, Jun 13, 2019 at 6:30 PM Jamison, Kirk 
> wrote:
> >
> > On Wednesday, June 12, 2019 4:29 PM (GMT+9), Masahiko Sawada wrote:
> > > ...
> > > I've not look at this patch deeply but in DropRelFileNodeBuffer I
> > > think we can get the min value of all firstDelBlock and use it as
> > > the lower bound of block number that we're interested in. That way
> > > we can skip checking the array during scanning the buffer pool.
> >
> > I'll take note of this suggestion.
> > Could you help me expound more on this idea, skipping the internal
> > loop by comparing the min and buffer descriptor (bufHdr)?
> >
> 
> Yes. For example,
> 
> BlockNumber minBlock = InvalidBlockNumber;
> (snip)
> /* Get lower bound block number we're interested in */
> for (i = 0; i < nforks; i++)
> {
> if (!BlockNumberIsValid(minBlock) ||
> minBlock > firstDelBlock[i])
> minBlock = firstDelBlock[i];
> }
> 
> for (i = 0; i < NBuffers; i++)
> {
> (snip)
> buf_state = LockBufHdr(bufHdr);
> 
> /* check with the lower bound and skip the loop */
> if (bufHdr->tag.blockNum < minBlock)
> {
> UnlockBufHdr(bufHdr, buf_state);
> continue;
> }
> 
> for (k = 0; k < nforks; k++)
> {
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> bufHdr->tag.forkNum == forkNum[k] &&
> bufHdr->tag.blockNum >= firstDelBlock[k])
> 
> But since we acquire the buffer header lock after all and the number of the
> internal loops is small (at most 3 for now)  the benefit will not be big.

Thank you very much for your kind and detailed explanation.
I'll still consider your suggestions in the next patch and optimize it more
so that we could possibly not need to acquire the LockBufHdr anymore.


> > > Don't we use each elements of nblocks for each fork? That is, each
> > > fork uses an element at its fork number in the nblocks array and
> > > sets InvalidBlockNumber for invalid slots, instead of passing the
> > > valid number of elements. That way the following code that exist at
> > > many places,
> > >
> > > blocks[nforks] = visibilitymap_mark_truncate(rel, nblocks);
> > >if (BlockNumberIsValid(blocks[nforks]))
> > >{
> > >forks[nforks] = VISIBILITYMAP_FORKNUM;
> > >nforks++;
> > >}
> > >
> > > would become
> > >
> > > blocks[VISIBILITYMAP_FORKNUM] = visibilitymap_mark_truncate(rel,
> > > nblocks);
> >
> > In the patch, we want to truncate all forks' blocks simultaneously, so
> > we optimize the invalidation of buffers and reduce the number of loops
> > using those values.
> > The suggestion above would have to remove the forks array and its
> > forksize (nforks), is it correct? But I think we’d need the fork array
> > and nforks to execute the truncation all at once.
> 
> I meant that each forks can use the its forknumber'th element of
> firstDelBlock[]. For example, if firstDelBlock = {1000, InvalidBlockNumber,
> 20, InvalidBlockNumber}, we can invalid buffers pertaining both greater than
> block number 1000 of main and greater than block number 20 of vm. Since
> firstDelBlock[FSM_FORKNUM] == InvalidBlockNumber we don't invalid buffers
> of fsm.
> 
> As Tsunakawa-san mentioned, since your approach would reduce the loop count
> your idea might be better than mine which always takes 4 loop counts.

Understood. Thank you again for the kind and detailed explanations. 
I'll reconsider these approaches.

Regards,
Kirk Jamison


RE: [PATCH] Speedup truncates of relation forks

2019-06-13 Thread Jamison, Kirk
On Wednesday, June 12, 2019 4:29 PM (GMT+9), Masahiko Sawada wrote:
> On Wed, Jun 12, 2019 at 12:25 PM Tsunakawa, Takayuki
>  wrote:
> >
> > From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> > > Years ago I've implemented an optimization for many DROP TABLE
> > > commands in a single transaction - instead of scanning buffers for
> > > each relation, the code now accumulates a small number of relations
> > > into an array, and then does a bsearch for each buffer.
> > >
> > > Would something like that be applicable/useful here? That is, if we
> > > do multiple TRUNCATE commands in a single transaction, can we
> > > optimize it like this?
> >
> > Unfortunately not.  VACUUM and autovacuum handles each table in a different
> transaction.
> 
> We do RelationTruncate() also when we truncate heaps that are created in the
> current transactions or has a new relfilenodes in the current transaction.
> So I think there is a room for optimization Thomas suggested, although I'm
> not sure it's a popular use case.

I couldn't think of a use case too.

> I've not look at this patch deeply but in DropRelFileNodeBuffer I think we
> can get the min value of all firstDelBlock and use it as the lower bound of
> block number that we're interested in. That way we can skip checking the array
> during scanning the buffer pool.

I'll take note of this suggestion.
Could you help me expound more on this idea, skipping the internal loop by
comparing the min and buffer descriptor (bufHdr)?

In the current patch, I've implemented the following in DropRelFileNodeBuffers:
for (i = 0; i < NBuffers; i++)
{
...
buf_state = LockBufHdr(bufHdr);
for (k = 0; k < nforks; k++)
{
if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
bufHdr->tag.forkNum == forkNum[k] &&
bufHdr->tag.blockNum >= firstDelBlock[k])
{
InvalidateBuffer(bufHdr); /* releases spinlock 
*/
break;
}

> Don't we use each elements of nblocks for each fork? That is, each fork uses
> an element at its fork number in the nblocks array and sets InvalidBlockNumber
> for invalid slots, instead of passing the valid number of elements. That way
> the following code that exist at many places,
> 
> blocks[nforks] = visibilitymap_mark_truncate(rel, nblocks);
>if (BlockNumberIsValid(blocks[nforks]))
>{
>forks[nforks] = VISIBILITYMAP_FORKNUM;
>nforks++;
>}
> 
> would become
> 
> blocks[VISIBILITYMAP_FORKNUM] = visibilitymap_mark_truncate(rel,
> nblocks);

In the patch, we want to truncate all forks' blocks simultaneously, so
we optimize the invalidation of buffers and reduce the number of loops
using those values.
The suggestion above would have to remove the forks array and its
forksize (nforks), is it correct? But I think we’d need the fork array
and nforks to execute the truncation all at once.
If I'm missing something, I'd really appreciate your further comments.

--
Thank you everyone for taking a look at my thread.
I've also already added this patch to the CommitFest app.

Regards,
Kirk Jamison


RE: [PATCH] Speedup truncates of relation forks

2019-06-12 Thread Jamison, Kirk
On Tuesday, June 11, 2019 7:23 PM (GMT+9), Adrien Nayrat wrote:

> > Attached is a patch to speed up the performance of truncates of relations.
> 
> Thanks for working on this!

Thank you also for taking a look at my thread. 

> > If you want to test with large number of relations,
> > you may use the stored functions I used here:
> > http://bit.ly/reltruncates
> 
> You should post these functions in this thread for the archives ;)
This is noted. Pasting it below: 

create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
query_string := 'create table tab_' || i::text || ' (a int);';
execute query_string;
  end loop;
end;
$$ language plpgsql;

create or replace function delfrom_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
query_string := 'delete from tab_' || i::text;
execute query_string;
  end loop;
end;
$$ language plpgsql;

create or replace function insert_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
query_string := 'insert into tab_' || i::text || ' VALUES (5);' ;
execute query_string;
  end loop;
end;
$$ language plpgsql;


> From a user POW, the main issue with relation truncation is that it can block
> queries on standby server during truncation replay.
> 
> It could be interesting if you can test this case and give results of your
> path.
> Maybe by performing read queries on standby server and counting wait_event
> with pg_wait_sampling?

Thanks for the suggestion. I tried using the extension pg_wait_sampling,
But I wasn't sure that I could replicate the problem of blocked queries on 
standby server.
Could you advise?
Here's what I did for now, similar to my previous test with hot standby setup,
but with additional read queries of wait events on standby server.

128MB shared_buffers
SELECT create_tables(1);
SELECT insert_tables(1);
SELECT delfrom_tables(1);

[Before VACUUM]
Standby: SELECT the following view from pg_stat_waitaccum

wait_event_type |   wait_event| calls | microsec
-+-+---+--
 Client  | ClientRead  | 2 | 20887759
 IO  | DataFileRead|   175 | 2788
 IO  | RelationMapRead | 4 |   26
 IO  | SLRURead| 2 |   38

Primary: Execute VACUUM (induces relation truncates)

[After VACUUM]
Standby:
 wait_event_type |   wait_event| calls | microsec
-+-+---+--
 Client  | ClientRead  | 7 | 77662067
 IO  | DataFileRead|   284 | 4523
 IO  | RelationMapRead |10 |   51
 IO  | SLRURead| 3 |   57

Regards,
Kirk Jamison


RE: libpq debug log

2019-04-23 Thread Jamison, Kirk
Hi Aya-san,

I tested your v3 patch and it seemed to work on my Linux environment.
However, the CF Patch Tester detected a build failure (probably on Windows).
Refer to: http://commitfest.cputube.org/

Docs:
It would be better to have reference to the documentations of 
Frontend/Backend Protocol's "Message Format".

Code:
There are some protocol message types from frontend
that you missed indicating (non BYTE1 types):
CancelRequest (F), StartupMessage (F), SSLRequest (F).

Although I haven't tested those actual protocols,
I assume it will be printed as the following since the length and
message will still be recognized.
ex. Timestamp 8 80877103

So you need to indicate these protocol message types as intended.
ex. Timestamp > SSLRequest 8 80877103


Regards,
Kirk Jamison





RE: Speedup of relation deletes during recovery

2019-04-15 Thread Jamison, Kirk
Hello Fujii-san, 

On April 18, 2018, Fujii Masao wrote: 

> On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki 
>  wrote:
>> Furthermore, TRUNCATE has a similar and worse issue.  While DROP TABLE 
>> scans the shared buffers once for each table, TRUNCATE does that for 
>> each fork, resulting in three scans per table.  How about changing the 
>> following functions
>>
>> smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber 
>> nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
>>BlockNumber firstDelBlock)
>>
>> to
>>
>> smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber 
>> *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, 
>> ForkNumber *forkNum,
>>BlockNumber *firstDelBlock, 
>> int nforks)
>>
>> to perform the scan only once per table?  If there doesn't seem to be a 
>> problem,
>> I think I'll submit the patch next month.  I'd welcome if anyone could do 
>> that.
>
> Yeah, it's worth working on this problem. To decrease the number of scans of
> shared_buffers, you would need to change the order of truncations of files 
> and WAL
> logging. In RelationTruncate(), currently WAL is logged after FSM and VM are 
> truncated.
> IOW, with the patch, FSM and VM would need to be truncated after WAL logging. 
> You would
> need to check whether this reordering is valid.

I know it's been almost a year since the previous message, but if you could 
still
recall your suggestion above, I'd like to ask question.
Could you explain your idea a bit further on how would the reordering of WAL 
writing and
file truncation possibly reduce the number of shared_buffers scan?
Thanks a lot in advance.

Regards,
Kirk Jamison


Minor fix in reloptions.c comments

2019-04-11 Thread Jamison, Kirk
Hi,

I found some minor grammar mistake while reading reloptions.c code comments.
Attached is the fix.
I just changed "affect" to "effect", for both n_distinct and vacuum_truncate.
  - * values has no affect until the ...
  + * values has no effect until the ...

Regards,
Kirk Jamison


reloptions-comment-fix.patch
Description: reloptions-comment-fix.patch


RE: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-04-07 Thread Jamison, Kirk
On Monday, April 8, 2019 9:04 AM (GMT+9), Haribabu Kommi wrote:

>On Thu, Apr 4, 2019 at 3:29 PM Amit Kapila 
>mailto:amit.kapil...@gmail.com>> wrote:
>On Wed, Apr 3, 2019 at 10:45 AM Haribabu Kommi 
>mailto:kommi.harib...@gmail.com>> wrote:
>>
>> Thanks for the review.
>>
>> While changing the approach to use the is_parallel_worker_flag, I thought
>> that the rest of the stats are also not required to be updated and also those
>> are any way write operations and those values are zero anyway for parallel
>> workers.
>>
>> Instead of expanding the patch scope, I just changed to avoid the commit
>> or rollback stats as discussed, and later we can target the handling of all 
>> the
>> internal transactions and their corresponding stats.
>>

> The patch looks good to me.  I have changed the commit message and ran
> the pgindent in the attached patch.  Can you once see if that looks
> fine to you?  Also, we should backpatch this till 9.6.  So, can you
> once verify if the change is fine in all bank branches?   Also, test
> with a force_parallel_mode option.  I have already tested it with
> force_parallel_mode = 'regress' in HEAD, please test it in back
> branches as well.
>
>
> Thanks for the updated patch.
> I tested in back branches even with force_parallelmode and it is working
> as expected. But the patches apply is failing in back branches, so attached
> the patches for their branches. For v11 it applies with hunks.


There are 3 patches for this thread:
_v5: for PG v11 to current head
_10: for PG10 branch
_96: for PG9.6

I have also tried applying these latest patches, .
The patch set works correctly from patch application, build to compilation.
I also tested with force_parallel_mode, and it works as intended.

So I am marking this thread as “Ready for Committer”.
I hope this makes it on v12 before the feature freeze.


Regards,
Kirk Jamison


RE: Timeout parameters

2019-04-04 Thread Jamison, Kirk
On Thursday, April 4, 2019 5:20PM (GMT+9), Ryohei Nagaura wrote:

> > From: Fabien COELHO 
> > * About socket_timeout v12 patch, I'm not sure there is a consensus.
> I think so too. I just made the patch being able to be committed anytime.
> 
> Finally, I rebased all the patches because they have come not to be applied 
> to the updated PG.

I just checked and confirmed that the TCP USER TIMEOUT patch set v20 works.
Although you should capitalize "linux" to "Linux" as already mentioned before.
The committer can also just fix that very minor part, if patch is deemed 
committable.

Note to committer: The "Ready for Committer" status is mainly intended for
tcp user timeout parameter.

OTOH, unless there is consensus with the socket_timeout,
for now the socket_timeout patch status still remains as "Needs Review".

Regards,
Kirk Jamison





RE: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-04-02 Thread Jamison, Kirk
On Thursday, March 28, 2019 3:13 PM (GMT+9), Haribabu Kommi wrote:
> I tried the approach as your suggested as by not counting the actual parallel 
> work
> transactions by just releasing the resources without touching the counters,
> the counts are not reduced much.
>
> HEAD - With 4 parallel workers running query generates 13 stats ( 4 * 3  + 1)
> Patch -  With 4 parallel workers running query generates 9 stats ( 4 * 2 + 1)
> Old approach patch - With 4 parallel workers running query generates 1 stat 
> (1)
>
> Currently the parallel worker start transaction 3 times in the following 
> places.
> 1. InitPostgres
> 2. ParallelWorkerMain (2)
>
> with the attached patch, we reduce one count from ParallelWorkerMain.

I'm sorry for the late review of the patch.
Patch applies, compiles cleanly, make check passes too.
I tested the recent patch again using the same method above
and confirmed the increase of generated stats by 9 w/ 4 parallel workers.

postgres=# BEGIN;
postgres=# select xact_commit from pg_stat_database where datname = 'postgres';
xact_commit
-
  60
(1 row)
postgres=# explain analyze select * from tab where b = 6;
[snipped]
postgres=# COMMIT;
postgres=# select xact_commit from pg_stat_database where datname = 'postgres';
xact_commit
-
  69

The intention of the latest patch is to fix the stat of
(IOW, do not count) parallel cooperating transactions,
or the transactions started by StartParallelWorkerTransaction,
based from the advice of Amit.
After testing, the current patch works as intended.

However, also currently being discussed in the latter mails
is the behavior of how parallel xact stats should be shown.
So the goal eventually is to improve how we present the
stats for parallel workers by differentiating between the
internal-initiated (bgworker xact, autovacuum, etc) and
user/application-initiated transactions, apart from keeping
the overall xact stats.
..So fixing the latter one will change this current thread's fix?

I personally have no problems with committing this fix first
before fixing the latter problem discussed above.
But I think there should be a consensus regarding that one.

Regards,
Kirk Jamison


RE: Timeout parameters

2019-04-01 Thread Jamison, Kirk
Hi again,

Since Nagaura-san seems to have addressed the additional comments on
tcp user timeout patches, I still retain the status for the patch set as
ready for committer.

As for socket_timeout, I agree that this can be discussed further.
>Fabien Coelho wrote:
>> I still think that this parameter should be preservered on psql's 
>> reconnections when explicitely set to non zero.

The default value of socket_timeout parameter is 0.
Shouldn't it be back to default value on psql reconnection?
May I ask why and how should the previous non-zero setting be preserved
on psql reconnection?

Regards,
Kirk Jamison





RE: Timeout parameters

2019-03-29 Thread Jamison, Kirk
Hi Nagaura-san,

Thank you for the updated patches.
It became a long thread now, but it's okay, you've done a good amount of work.
There are 3 patches in total: 2 for tcp_user_timeout parameter, 1 for 
socket_timeout.


A. TCP_USER_TIMEOUT
Since I've been following the updates, I compiled a summary of addressed changes
you made from all the reviewers' comments for TCP_USER_TIMEOUT: 

For both client and server: apply ok, build ok, compile ok.

[DOCS]
I confirmed that docs were patterned from keepalives in both
config.sgml (backend) and libpq.sgml (interface). 
- changed "forcefully" to "forcibly"

It seems OK now to me.

[CODE]
1. SERVER  TCP_backend patch (v19)

- as per advice of Fabien, default postgres configuration file
has been modified (postgresql.conf.sample);
- as per advice of Horiguchi-san, I confirmed that the variable
is now grouped along with keepalives* under the TCP timeout settings.

- confirmed the removal of "errno != ENOPROTOOPT" in backend's 
pq_settcpusertimeout()

- confirmed the hook from show_tcp_user_timeout to pq_gettcpusertimeout(), 
based from the comment of Horiguchi-san to follow the tcp_keepalive* options' 
behavior.

- as per advice of Tsunakawa-san, GUC_NOT_IN_SAMPLE has been removed
and the default value is changed to 0.

- confirmed the modification of pq_settcpusertimeout():
a. to follow keepalives' pq_set* behavior as well
b. correct the error of sizeof(timeout)) << 0 to just a single '<'; 
'port->keepalives_count = count' to 'port->tcp_user_timeout = timeout'


2. CLIENT   TCP_interface patch (v18)

- confirmed that tcp_user_timeout is placed after keepalives options
in src/interfaces/libpq/fe-connect.c, 

- confirmed the change / integration for the comment below in 
fe-connect.c:PQconnectPoll():
>I don't see a point in the added part having own "if
>(!IS_AF_UNIX" block separately from tcp_keepalive options.

- confirmed the removal of the following:
a. "errno != ENOPROTOOPT" in setTCPUserTimeout()
b. unused parameter: char *tcp_user_timeout;
c. error for non-supported systems


Overall, I tested the patch using the same procedure I did in the above email,
and it worked well. Given that you addressed the comments from all
the reviewers, I've judged these 2 patches as committable now.



B. socket_timeout parameter

> FYI, I summarized a use case of this parameter.
> The connection is built successfully.
> Suppose that the server is hit by some kind of accident(e,g,. I or 
> Tsunakawa-san suggested).
> At this time, there is no guarantee that the server OS can pass control to 
> the postmaster.
> Therefore, it is considered natural way to disconnect the connection from the 
> client side.
> The place of implement of disconnection is pqWait() because it is where 
> users' infinite wait occurs.

apply, build, compile --> OK
Doc seems OK now to me.
The patch looks good to me as you addressed the code comments from 
Tsunakawa-san.
Although it's still the committer who will have the final say.



That said, I am now marking this one "Ready for Committer".
If others have new comments by weekend, feel free to change the status.


Regards,
Kirk Jamison


RE: Timeout parameters

2019-03-28 Thread Jamison, Kirk
Hi, 

>The socket_timeout patch needs the following fixes.  Now that others have 
>already tested these patches >successfully, they appear committable to me.

In addition, regarding socket_timeout parameter.
I referred to the doc in libpq.sgml, corrected misspellings,
and rephrased the doc a little bit as below:


Maximum wait in seconds (write as a decimal integer, e.g. 10) for socket 
read/write operation before closing the connection. A value of zero (the 
default) turns this off, which means wait indefinitely. The minimum allowed 
timeout is 2 seconds, so a value of 1 is interpreted as 2.

Although this can be used as a stopgap timeout measure, it is recommended
to set a value higher than the other timeout parameters
(connect_timeout, statement_timeout,
TCP_KEEP_ALIVES, TCP_USER_TIMEOUT)
because setting a smaller value will make the other configured timeout
parameters meaningless and can cause undesirable disconnection.

Regards,
Kirk Jamison






RE: Timeout parameters

2019-03-28 Thread Jamison, Kirk
Hi Nagaura-san,

>I updated my patches.
Thanks.

>In TCP_USER_TIMEOUT backend patch:
> 1) linux ver 2.6.26 -> 2.6.36
"Linux" should be capitalized.


I confirmed that you followed Horiguchi-san's advice
to base the doc from keepalives*. 
About your question:
> 3) Same as keepalives*, I used both "0" and zero.
> I can't recognize why to use both of them.
> Is there anyone who know it?

In config.sgml it uses both "zero" and "0",
while in libpq.sgml it only uses "zero".
Since you patterned it from keepalives* docs for both files,
I think it's alright doing it that way.

Regards,
Kirk Jamison





RE: Timeout parameters

2019-03-26 Thread Jamison, Kirk
On Tuesday, March 26, 2019 2:35 PM (GMT+9), Ryohei Nagaura wrote:

>> Your patch applies, however in TCP_backend_v10 patch, your 
>> documentation is missing a closing tag  so it could not be 
>> tested.
>> When that's fixed, it passes the make check.
>Oops! Fixed.

Ok. Confirmed the fix.
Minor nitpick, the semicolon is separated to another line when it shouldn't be.
+This parameter is supported only on systems that support 
TCP_USER_TIMEOUT
+; on other systems such as Windows, it has no effect and must be zero.

Anyway, for now maybe wait for other reviewers to confirm the latest doc
before you update the patches again.

Documentations aside, I tested the patch and below are the results.

1.) apply OK, build OK, make check ok, install ok.

2.) Testing
A. Configuration Setting
[pg_hba.conf]   Allow remote connection from client address
[postgresql.conf]
  listen_addresses = '*'
  # Optional just for checking logs in server 
  logging_collector = on

B. Tests
[Client-Side]
1. $ psql 
postgresql://USERNAME:PASSWORD(at)server_host:server_port/dbname?tcp_user_timeout=15000

2. postgres=# select inet_client_port();
 inet_client_port
--
59396
3. (Via root user of client, other console window)
root# iptables -I INPUT -p tcp --dport 59396 -j DROP

4. postgres=# select pg_sleep(10);

5. Error output is displayed. Query is cancelled.
could not receive data from server: Connection timed out
postgres#=

#Inputting a query would attempt to reconnect to server

postgres=# select inet_client_port();
no connection to the server
The connection to the server was lost. Attempting reset: Succeeded.
postgres=# 

6. Next test; tested again but switching #3 & #4
(Note: There should be a new client port number by then.)
postgres=# select inet_client_port();
postgres=# set tcp_user_timeout=15000;
postgres=# select pg_sleep(10);
root# iptables -I INPUT -p tcp --dport 59397 -j DROP

# Client hangs. Server could not receive data from client, so connection timed 
out.
# Need to reconnect again from client.
^Z
[3]+  Stopped


Below are the logs in the server.

[Server-Side Logs]
#Test#1
[12949] LOG:  statement: select inet_client_port();
[12949] LOG:  statement: select pg_sleep(10);
[12949] LOG:  could not receive data from client: Connection timed out
#Test#2
[13163] LOG:  statement: select inet_client_port();
[13163] LOG:  statement: set tcp_user_timeout=15000;
[13163] LOG:  statement: select pg_sleep(10);
[13163] LOG:  could not receive data from client: Connection timed out

---
I also tested invalid keywords for tcp_user_timeout and errors were detected.
ERROR:  invalid value for parameter "tcp_user_timeout": "kkk"
ERROR:  -999 ms is outside the valid range for parameter "tcp_user_timeout" (0 
.. 2147483647)


Regards,
Kirk Jamison


RE: Timeout parameters

2019-03-25 Thread Jamison, Kirk
Hi Nagaura-san,

On Monday, March 25, 2019 2:26 PM (GMT+9), Ryohei Nagaura wrote:

>Yes, I want to commit TCP_USER_TIMEOUT patches in PG12.
>Also I'd like to continue to discuss about socket_timeout after this CF.
Ok. So I'd only take a look at TCP_USER_TIMEOUT parameter for now (this 
CommitFest),
and maybe we could resume the discussion on socket_timeout in the future.

> About TCP_USER_TIMEOUT:
> 1) Documentation and checking on UNIX system As far as I surveyed 
> (solaris and BSD family), these UNIX OS don't have "TCP_USER_TIMEOUT" 
> parameter.
> Accordingly I have not checked on real machine.
> Also, I modified documentations to remove "equivalent to socket option"

Your patch applies, however in TCP_backend_v10 patch,
your documentation is missing a closing tag 
so it could not be tested.
When that's fixed, it passes the make check.

Also regarding docs,
what's the reason to emphasize Windows not supported in a separate paragraph?
How about simplifying and combining the 2 paragraphs such as below?

Specifies the number of milliseconds after which a TCP connection can be
aborted by the operating system due to network problems when sending or
receiving data through this connection. A value of zero uses the system default.
For connections made via a Unix-domain socket, this parameter is ignored. This
parameter is supported only on systems that support 
TCP_USER_TIMEOUT;
on other systems such as Windows, it has no effect and must be zero.

Or if you really want to emphasize which systems are supported and which are 
not,
you may remove the last sentence above, then indicate that in the note tag.
Example:
This parameter is supported only on systems that support 
TCP_USER_TIMEOUT such as Linux version 2.6.37 or later;
on other systems such as Windows, it has no effect and must be zero.

(Note: I haven't checked which Linux versions are supported,
I got it from your previous patch version.)

Regards,
Kirk Jamison


RE: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-19 Thread Jamison, Kirk
I tried to confirm the patch with the following configuration:
max_parallel_workers_per_gather = 2
autovacuum = off


postgres=# BEGIN;
BEGIN
postgres=# select xact_commit from pg_stat_database where datname = 'postgres';
xact_commit
-
 118
(1 row)

postgres=# explain analyze select * from tab where b = 6;
QUERY PLAN

---

Gather  (cost=1000.00..102331.58 rows=5 width=8) (actual 
time=0.984..1666.932 rows
=99815 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tab  (cost=0.00..96331.58 rows=20833 width=8) 
(actual time=
0.130..1587.463 rows=33272 loops=3)
 Filter: (b = 6)
 Rows Removed by Filter: 3300062
Planning Time: 0.146 ms
Execution Time: 1682.155 ms
(8 rows)

postgres=# COMMIT;
COMMIT
postgres=# select xact_commit from pg_stat_database where datname = 'postgres';
xact_commit
-
 119
(1 row)

-

[Conclusion]

I have confirmed that with the patch (with autovacuum=off,
max_parallel_workers_per_gather = 2), the increment is only 1.
Without the patch, I have also confirmed that
the increment in xact_commit > 1.

It seems Amit also confirmed that this should be the case,
so the patch works as intended.

>> Is the same applied to parallel worker transaction
>> commits also?
>
> I don't think so.  It seems to me that we should consider it as a
> single transaction.

However, I cannot answer if the consideration of parallel worker activity
as a single xact_commit relates to good performance.
But ISTM, with this improvement we have a more accurate stats for xact_commit.

Regards,
Kirk Jamison


RE: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-19 Thread Jamison, Kirk
Hi Hari-san,

On Sunday, February 10, 2019 2:25 PM (GMT+9), Haribabu Kommi wrote:
> I try to fix it by adding a check for parallel worker or not and based on it
> count them into stats. Patch attached.
>
> With this patch, currently it doesn't count parallel worker transactions, and
> rest of the stats like seqscan and etc are still get counted. IMO they still
> may need to be counted as those stats represent the number of tuples
> returned and etc.
>
> Comments?

I took a look at your patch, and it’s pretty straightforward.
However, currently the patch does not apply, so I reattached an updated one
to keep the CFbot happy.

The previous patch also had a missing header to detect parallel workers
so I added that. --> #include "access/parallel.h"

Regards,
Kirk Jamison


0001-Avoid-counting-parallel-worker-transactions-stats_v2.patch
Description: 0001-Avoid-counting-parallel-worker-transactions-stats_v2.patch


RE: Timeout parameters

2019-03-17 Thread Jamison, Kirk
On Saturday, March 16, 2019 5:40 PM (GMT+9), Fabien COELHO wrote:

> > Fabien, I was wondering whether you can apply TCP_USER_TIMEOUT patch 
> > and continue discussion about 'socket_timeout'?

> I can apply nothing, I'm just a small-time reviewer.

> Committers on the thread are Michaël-san and Robert, however I'm not sure 
> that they are very sensitive to "please apply this patch" requests: they 
> are the lone judges of their own priorities.


Regarding user timeout parameters:

Based from previous reviews of the code (it seems good) and reviewers'
comments, everyone seems to agree that user timeout parameters are
needed, so we can just waitfor the updated patch.
The author, Nagaura-san, has gotten feedback from Robert for the doc part.
So if an updated patch is posted with addressed comments, then I think we
can review it again for the final round.

---
As for socket_timeout parameters:

The use case for socket timeout parameter is that it's a stop-gap approach
to prevent applications from infinite waiting for the DB server when other
timeout parameters such as keepalives and tcp_user_timeout fail to detect
the connection error. (That's why I thought it's a network problem detector?)

The main argument here is about the security risk of allowing socket timeout
to cancel valid connections, right? Then to address that, I agree with
Tsunakawa-san to document that the value should at least be (equal? or) higher
than the other timeout parameters.
If documenting is not enough, then we can limit that within the code by making
sure that socket_timeout value must be greater than the other timeout parameters
(keepalives, tcp user timeout, statement timeout, etc.). Otherwise, 
socket_timeout
parameter should not work even if was switched on. Or is that too much 
enforcing?

Regards,
Kirk Jamison


RE: pg_upgrade: Pass -j down to vacuumdb

2019-03-12 Thread Jamison, Kirk
Hi Jesper,

> Thanks Kirk !
>
> > On 3/12/19 2:20 PM, Robert Haas wrote:
> > The words 'by default' should be removed here, because there is also 
> > no non-default way to get that behavior, either.
> > 
> 
> Here is v9 based on Kirk's and your input.

Thanks! Although there were trailing whitespaces.
After fixing that, it works as intended (built & passed tests successfully).
That aside, the v9 patch looks good and comment was addressed.


Regards,
Kirk Jamison


RE: pg_upgrade: Pass -j down to vacuumdb

2019-03-11 Thread Jamison, Kirk
Hi Jesper,

Sorry I almost completely forgot to get back to you on this.
Actually your patch works when I tested it before,
and I understand the intention.
. 
Although a point was raised by other developers by making
--jobs optional in the suggested line by using the env variable instead.

> > Kirk, can you provide a delta patch to match what you are thinking ?

I was thinking of something like the attached, 
but the user needs to edit the variable value though. 
I am not sure how to implement the suggestion to allow the 
script to absorb a value from the environment,
so that it doesn't need to be edited at all.

Regards,
Kirk Jamison


v8_0001-Use-vacuumdb-opt-env-variable-to-pass-vacuumdb-options.patch
Description: v8_0001-Use-vacuumdb-opt-env-variable-to-pass-vacuumdb-options.patch


analyze_new_cluster.sh
Description: analyze_new_cluster.sh


RE: pgbench - doCustom cleanup

2019-03-04 Thread Jamison, Kirk
Hi Fabien, 

>> See the attached latest patch.
>> The attached patch applies, builds cleanly, and passes the regression 
>> tests.
>
> No problems on my part as I find the changes logical.
> This also needs a confirmation from Alvaro.
>
> Ok.
>
> You switched the patch to "waiting on author": What are you waiting from me?
>
> If you think that it is ok and that it should be considered by a committer,
> probably Alvaro, it can be marked as "ready".

Indeed, it was my mistake.
I have already marked it as "Ready for Committer".

Regards,
Kirk Jamison




RE: pgbench - doCustom cleanup

2019-03-03 Thread Jamison, Kirk
Hi Fabien and Alvaro, 

I found that I have already reviewed this thread before,
so I tried to apply the patch, but part of the chunk failed,
because of the unused line below which was already removed in the
recent related commit.
>   PGResult*res;
I removed the line and fixed the other trailing whitespaces.
See the attached latest patch.
The attached patch applies, builds cleanly,
and passes the regression tests.

On Saturday, November 24, 2018 5:58PM (GMT+9), Fabien Coelho wrote:
> About the patch you committed, a post-commit review:
>
>  - the state and function renamings are indeed a good thing.
>
>  - I'm not fond of "now = func(..., now)", I'd have just passed a
>reference.
>
>  - I'd put out the meta commands, but keep the SQL case and the state
>assignment in the initial function, so that all state changes are in
>one function… which was the initial aim of the submission.
>Kind of a compromise:-)

I have confirmed the following changes:

1.
>  - I'm not fond of "now = func(..., now)", I'd have just passed a
>reference. 

1.1. advanceConnectionState(): 
Removed > now = doExecuteCommand(thread, st, now);
1.2. executeMetaCommand(): direct reference to state
Before:
>-  st->state = CSTATE_ABORTED;
>-  return now;
After:
>+  return CSTATE_ABORTED;

2. SQL_COMMAND type is executed in initial function advanceConnectionState(),
while META_COMMAND is executed in the subroutine executeMetaCommand().
This seems reasonable to me.

3. The function name also changed, which describes the subroutine better.
-static instr_time doExecuteCommand(TState *thread, CState *st,
-instr_time now);
+static ConnectionStateEnum executeMetaCommand(TState *thread, CState *st, 
instr_time *now);

No problems on my part as I find the changes logical.
This also needs a confirmation from Alvaro.


Regards,
Kirk Jamison


pgbench-state-change-followup-2.patch
Description: pgbench-state-change-followup-2.patch


RE: Timeout parameters

2019-03-03 Thread Jamison, Kirk
On Sunday, March 3, 2019 4:09PM (GMT+9), Fabien COELHO wrote:
>Basically same thing about the tcp_user_timeout guc v8, especially:
> do you have any advice about how I can test the feature, i.e. 
> trigger a timeout?
>
>> Patch applies & compiles cleanly. Global check is ok, although there 
>> are no specific tests.
>
>> Documentation English can be improved. Could a native speaker help, please?
+1 This needs a help from a native English speaker, cause I am not,
although my name sounds like one. ;)

>> ISTM that the documentation both states that it works and does not 
>> work on Windows. One assertion must be false.
>>
>> Syntax error, eg "tcp_user_timeout=2bad", are detected, good.
>
>> I could not really test the feature, i.e. I could not trigger a 
>> timeout. Do you have a suggestion on how to test it?

I have also tested the previous patches and most recent one.
I just followed the test instructions above. And maybe you can also do it too.
Here's how I did it.

Setting:
- Before the test via server root user, add the port and client source
using the firewall-cmd to allow client to connect to DB server.
> systemctl start firewalld 
> firewall-cmd --add-source={clientipaddress}/32 --add-port={server_port}/tcp 
> --zone=public --permanent
> firewall-cmd --reload

Testing (v8 of user timeout parameters):

[Client-Side]
1. $ psql 
postgresql://USERNAME:PASSWORD@server_host:server_port/dbname?tcp_user_timeout=15000
2. postgres=# select inet_client_port();
 inet_client_port
--
34819
3. (Via root user of client, other console window)
root# iptables -I INPUT -p tcp --dport 34819 -j DROP
4. postgres=# select pg_sleep(10);
5. Error output is displayed.
could not receive data from server: Connection timed out
--
Tested again but switching #3 & #4.
There should be a new client port number by then.
Below are the logs in the server.

[Server-Side Logs]
Test#1
[4736] LOG:  statement: select inet_client_port();
[4736] LOG:  statement: select pg_sleep(10);
[4736] LOG:  could not receive data from client: Connection timed out
Test#2
[5594] LOG:  statement: select inet_client_port();
[5594] LOG:  statement: set tcp_user_timeout=15000;
[5594] LOG:  statement: select pg_sleep(10);
[5594] LOG:  could not receive data from client: Connection timed out

Regards,
Kirk Jamison




RE: Timeout parameters

2019-02-26 Thread Jamison, Kirk
Hi Nagaura-san,

Your socket_timeout patch still does not apply either with
git or patch command. It says it's still corrupted.
I'm not sure about the workaround, because the --ignore-space-change
and --ignore-whitespace did not work for me.
Maybe it might have something to do with your editor when creating
the patch. Could you confirm?
The CFbot is also another way to check if your latest patches work.
http://commitfest.cputube.org/

On Tuesday, February 26, 2019 5:16PM (GMT+9), Nagaura, Ryohei wrote:
> "communication inactivity" seems to be a little extreme.
> If the communication layer is truly dead you will use keepalive.
> This use case is when socket option is not available for some reason.
> So it would be better "terminating the connection" in my thought.

About the doc. Alright. 
- There's a typo in the doc: connectoin --> connection
- In the code, there's a part where between 0-2 seconds
is interpreted as 2 because it's the minimum time. I think
the comment should be improved, and use insist (force) instead
of "need". This should also be added in the documentation.

From:
/*
 * Rounding could cause communication to fail; need at least 2 secs
 */

To:
/*
 * Rounding could cause communication to fail;
 * insist on at least two seconds.
 */

Docs
- added how it should be written as decimal integer, similar
to connect_timeout
- used closing instead of terminating, to be consistent with
other params
- added the minimum allowed timeout

socket_timeout

Controls the number of second (write as a decimal integer, e.g. 10) of
client's waiting time for individual socket read/write operations
before closing the connection. This can be used both as a force global
query timeout and network problems detector. A value of zero (the
default) turns this off, which means wait indefinitely. The minimum
allowed timeout is 2 seconds, so a value of 1 is interpreted as 2.

> About TCP_USER_TIMEOUT patches, there are only miscellaneous changes:
> removing trailing spaces and making comments of parameters lower case
> as you pointed out.
Thanks for the update.
Confirmed the change.


Regards,
Kirk Jamison



RE: Timeout parameters

2019-02-25 Thread Jamison, Kirk
On Monday, February 25, 2019 1:49 PM (GMT+9), Nagaura, Ryohei wrote:

> Thank you for discussion.
> I made documentation about socket_timeout and reflected Tsunakawa-san's
> comment in the new patch.
> # Mainly only fixing documentation...
> The current documentation of socket_timeout is as follows:
> socket_timeout
> Controls the number of second of client's waiting time for individual socket
> read/write operations. This can be used both as a force global query timeout
> and network problems detector. A value of zero (the default) turns this off.

Thanks for the update.
However, your socket_timeout patch currently does not apply.
$ git apply ../socket_timeout_v5.patch
fatal: corrupt patch at line 24
--> l24: diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c

>socket_timeout
>
> Controls the number of second of client's waiting time for individual socket 
> read/write
> operations. This can be used both as a force global query timeout and network 
> problems
> detector. A value of zero (the default) turns this off.

Got the doc fix. I wonder if we need to document what effect the parameter does:
terminating the connection. How about:

Controls the number of seconds of client-server communication inactivity
before forcibly closing the connection in order to prevent client from
infinite waiting for individual socket read/write operations. This can
be used both as a force global query timeout and network problems 
detector, i.e. hardware failure and dead connection. A value of zero 
(the default) turns this off.

Well, you may remove the "i.e. hardware failure and dead connection" if
that's not necessary.

I know you've only added the doc recommendation.
But regarding the code, you may also fix other parts:
a. Use parse_int_param instead of atoi
>+  conn->socket_timeout = atoi(conn->pgsocket_timeout);

b. proper spacing after comma
>+  {"socket_timeout", NULL, NULL, NULL,
>+  "Socket-timeout","",10, /* strlen(INT32_MAX) == 10 */
>+  offsetof(struct pg_conn, pgsocket_timeout)},
... 
>+  result = pqSocketCheck(conn,forRead,forWrite,finish_time);

There are probably other parts I missed.

Regards,
Kirk Jamison



RE: libpq debug log

2019-02-21 Thread Jamison, Kirk
On Wednesday, February 20, 2019 12:56 PM GMT+9, Robert Haas wrote:

> On Mon, Feb 18, 2019 at 10:06 PM Jamison, Kirk  
> wrote:
> > It sounds more logical to me if there is a parameter that switches 
> > on/off the logging similar to other postgres logs. I suggest trace_log as 
> > the parameter name.

> I'm not really sure that I like the design of this patch in any way.
> But leaving that aside, trace_log seems like a poor choice of name, 
> because it's got two words telling you what kind of thing we are doing 
> (tracing, logging) and zero words describing the nature of the thing 
> being traced or logged (the wire protocol communication with the client).
> A tracing facility for the planner, or the executor, or any other part
> of the system would have an equally good claim on the same name.  That
> can't be right.

Agreed about the argument with trace_log parameter name. I just shootout
a quick idea. I didn't think about it too deeply, but just thought of a
switch that will enable or disable the feature. So there are 
definitely better names other than that. And as you suggested, should 
describe specifically what the feature does. 

Regards,
Kirk Jamison


RE: Timeout parameters

2019-02-21 Thread Jamison, Kirk
On Friday, February 22, 2019 9:46 AM (GMT+9), Tsunakawa, Takayuki wrote:

> > Terminate any session that has been idle for more than the 
> > specified number of seconds to prevent client from infinite 
> > waiting for server due to dead connection. This can be used both as a 
> > brute force global query timeout and detecting network problems.
> > A value of zero (the default) turns this off.
>
> Looks better than the original one.  However,
> 
> * The unit is not millisecond, but second.
> * Doesn't restart the session.
> * Choose words that better align with the existing libpq parameters.
> e.g. "connection" should be used instead of "session" as in 
> keepalive_xxx parameters, because there's not a concept of database
> session at socket level.
> * Indicate that the timeout is for individual socket operations.

My bad. I was looking at the wrong documentations, so seconds should be used.
I guess we should just choose socket_timeout as parameter name.


socket_timeout (integer)

Terminate any connection that has been inactive for more than the specified 
number of seconds to prevent client from infinite waiting for individual socket 
read operations due to dead connection. This can be used both as a force global 
query timeout and network problems detector. A value of zero (the default) 
turns this off.

or

Controls the number of seconds of connection inactivity to prevent client from 
infinite waiting for individual socket read operations due to dead connection. 
This can be used both as a force global query timeout and network problems 
detector. A value of zero (the default) turns this off. 


Regards,
Kirk Jamison



RE: Timeout parameters

2019-02-21 Thread Jamison, Kirk
Hi,

> > tcp_socket_timeout (integer)
> >
> > Terminate and restart any session that has been idle for more than
> > the specified number of milliseconds to prevent client from infinite
> > waiting for server due to dead connection. This can be used both as
> > a brute force global query timeout and detecting network problems.
> > A value of zero (the default) turns this off.

On Friday, February 22, 2019 12:01 AM (GMT+9), 
mikalaike...@ibagroup.eu
wrote:
> I am not very  familiar with the PostgreSQL source code. Nevertheless,
> the main idea of this parameter is clear for me - closing a connection
> when the PostgreSQL server does not response due to any reason. However,
> I have not found  in the discussion a reference that this parameter can
> be applied to the TCP as well as to the UNIX-domain sockets. Moreover,
> this parameter works out of communication layer. When we consider TCP
> communication, the failover is covered by keep_alive and tpc_user_timeout
> parameters.
>
> According to it, we should not use 'tcp' prefix in this parameter name,
> 'socket' sub string is not suitable too.
>
> 'timeout' is OK.
>
> This parameter works on the client side. So the word 'client' is a good
> candidate for using in this parameter name.
> This parameter affects only when we send a 'query' to the pg server.

#Shookedt. Yeah, there should be no “tcp”in that parameter, so it was
my mistake.
Regarding whether we use the “socket_timeout” or “client_query_timeout”,
why is socket not suitable?
Although I’m not arguing against client_query_timeout as it’s also
a good candidate parameter name.

> Based on it, we can build a name for this parameter 'client_query_timeout'.
>
> The suggested explanation of this parameter does not follow the aim of
> integrating this parameter:
>
> client_query_timeout
>
> Specifies the number of seconds to prevent client from infinite waiting
> for server acknowledge to the sent query due to dead connection. This
> can be used both as a force global query timeout and network problems
> detector. A value of zero (the default) turns this off.

Thanks for the fix. In the client side though, other parameters are specified
in milliseconds, so I used the same.

Regards,
Kirk Jamison



RE: Timeout parameters

2019-02-21 Thread Jamison, Kirk
On Thursday, February 21, 2019 2:56 PM (GMT+9), Tsunakawa, Takayuki wrote:

>> 1) tcp_user_timeout parameter
>> I think this can be "committed" separately when it's finalized.

> Do you mean you've reviewed and tested the patch by simulating a 
> communication failure in the way Nagaura-san suggested?

Although I did review and followed the suggested way in previous email
way back (which uses root user) and it worked as intended, I'd also like
to hear feedback also from Fabien whether it's alright without the test
script, or if there's another way we can test it (maybe none?).
Then I think it's in good shape.
However, what I meant by my statement above was to have a separate
decision in committing the two parameters, because based from the thread
tcp_user_timeout has got more progress when it comes to reviews.
So once it's ready, it can be committed separately without waiting for
the socket_timeout param to be ready. (I dont think that requires a
separate thread)

>> 2) tcp_socket_timeout parameter
>> - whether (a) it should abort the connection from pqWait() or
>>   other means, or
>> (b) cancel the statement similar to how psql
>>   does it as suggested by Fabien

> We have no choice but to terminate the connection, because we can't tell
> whether we can recover from the problem and continue to use the connection
> (e.g. long-running query) or not (permanent server or network failure).
>
> Regarding the place, pqWait() is the best (and possibly only) place. 
> The purpose of this feature is to avoid waiting for response from the
> server forever (or too long) in any case, as a last resort.

Thank you for explaining further. I think that clarifies the implementation
argument on why it needs to terminate the connection over cancelling the query.
In short, I think it's intended purpose is to prevent dead connection.
It addresses the case for when network or server failure occurs and
when DBMS is terminated abruptly. So far, no existing parameter covers that yet.

As for the place, my knowledge is limited so I won't be able to provide
substantial help on it. Fabien pointed out some comments about it that needs
clarification though.
Quoting what Fabien wrote:
"The implementation looks awkward, because part of the logic of 
pqWaitTimed 
is reimplemented in pqWait. Also, I do not understand the computation 
of finish_time, which seems to assume that pqWait is going to be called 
immediately after sending a query, which may or may not be the case, 
and 
if it is not the time() call there is not the start of the statement."

> Oracle has similar parameters called SQLNET.RECV_TIMEOUT and 
> SQLNET.SEND_TIMEOUT.
> From those names, I guess they use SO_RCVTIMEO and SO_SNDTIMEO socket 
> options. 
> However, we can't use them because use non-blocking sockets and poll(), while
> SO_RCV/SND_TIMEO do ont have an effect for poll():
> [..snipped]
>
>> - proper parameter name
>
> I think the name is good because it indicates the socket-level timeout.
> That's just like PgJDBC and Oracle, and I didn't feel strange when I read 
> their manuals.

Agree. I'm actually +1 with tcp_socket_timeout


>> Perhaps you could also clarify a bit more through documentation on how 
>> socket_timeout handles the timeout differently from statement_timeout 
>> and tcp_user_timeout.

> Maybe.  Could you suggest good description?

Ok. I just realized that requires careful explanation.
How about the one below? 
I got it from combined explanations above. I did not include the 
differences though. This can be improved as the discussion
continues along the way.

tcp_socket_timeout (integer)

Terminate and restart any session that has been idle for more than
the specified number of milliseconds to prevent client from infinite
waiting for server due to dead connection. This can be used both as
a brute force global query timeout and detecting network problems.
A value of zero (the default) turns this off.


Regards,
Kirk Jamison



RE: Timeout parameters

2019-02-19 Thread Jamison, Kirk
Hi,

I tried to re-read the whole thread.
Based from what I read, there are two proposed timeout parameters,
which I think can be discussed and commited separately:
(1) tcp_user_timeout 
(2) tcp_socket_timeout (or suggested client_statement_timeout,
send_timeout/recv_timeout)

Regarding the use-case of each parameter, Tsunakawa-san briefly
explained them above. Quoting again:
> * TCP keepalive and TCP user (retransmission) timeout: for network problems
> * statement_timeout: for long-running queries
> * socket_timeout (or send/recv_timeout): for saturated servers

The already existing statement_timeout mainly limits how long
each statement should run. [1]
However, even if statement_timeout was configured, it does not
handle the timeout for instances that a network failure occurs,
so the application would not recover from error. 
Therefore, there's a need for these features, to meet the cases
that statement_timeout currently does not handle.

1) tcp_user_timeout parameter
As for user_timeout param, there seems to be a common agreement
with regards to its need.

Just minor nitpick:
+   char   *tcp_user_timeout;   /* TCP USER TIMEOUT */
I think that's unnecessary capitalization in the user timeout part.

The latest tcp_user_timeout patch seems to be almost in good shape,
feedback about doc (clearer description from Andrei)
and code (whitespace, C-style, parse_int_param, handling old kernel
version) were addressed.

I think this can be "committed" separately when it's finalized.


2) tcp_socket_timeout parameter
On the other hand, there needs to be a further discussion and design
improvement with regards to the implementation of socket_timeout:
- whether (a) it should abort the connection from pqWait() or
  other means, or (b) cancel the statement similar to how psql
  does it as suggested by Fabien
- proper parameter name
 
Based from your description below, I agree with Fabien that it's somehow
the application/client side query timeout
> "socket_timeout" is the application layer timeout parameter from when 
> frontend issues SQL query to when frontend receives the execution result 
> from backend. When this parameter is active and timeout occurs, frontend 
> close the socket. It is a merit for client to set the maximum time to 
> wait for SQL.  

In PgJDBC, it serves two purpose though: query timeout and network problem
detection. The use of socketTimeout aborts the connection. [2]
> The timeout value used for socket read operations. If reading from 
> the server takes longer than this value, the connection is closed. 
> This can be used as both a brute force global query timeout and a
> method of detecting network problems. The timeout is specified in 
> seconds and a value of zero means that it is disabled.
 
Perhaps you could also clarify a bit more through documentation on how
socket_timeout handles the timeout differently from statement_timeout
and tcp_user_timeout.
Then we can decide on the which parameter name is better once the
implementation becomes clearer.

[1] https://www.postgresql.org/docs/devel/runtime-config-client.html
[2] 
https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters


Regards,
Kirk Jamison



RE: libpq debug log

2019-02-18 Thread Jamison, Kirk
Hi Iwata-san,

Currently, the patch fails to build according to CF app.
As you know, it has something to do with the misspelling of function.
GetTimezoneInformation --> GetTimeZoneInformation

It sounds more logical to me if there is a parameter that switches on/off the 
logging
similar to other postgres logs. I suggest trace_log as the parameter name.
Like, this parameter needs to be enabled for logdir, logsize and loglevel, etc. 
to work.
The default is off.

If you're going to introduce that parameter, then the docs should be updated as 
well.
i.e. "When trace_log is enabled, this parameter..."

How about changing the following parameter names:
logdir --> log_directory
logsize --> log_size
logminlevel --> log_min_level

If it's helpful, you might want to look into how the other postgres logs (i.e. 
syslogger)
allow setting either absolute or relative path for log directory, and how the 
parameters
cover some of the comments above by Ramanarayana.

Regards,
Kirk Jamison


RE: libpq debug log

2019-02-17 Thread Jamison, Kirk
On February 14, 2019 6:16 PM +, Andres Freund wrote:

> Hi,

> On 2018-11-28 23:20:03 +0100, Peter Eisentraut wrote:
> > This does not excite me.  It seems mostly redundant with using tcpdump.

> I think the one counter-argument to this is that using tcpdump in real-world
> scenarios has become quite hard, due to encryption. Even with access to the
> private key you cannot decrypt the stream.  Wonder if the solution to that
> would be an option to write out the decrypted data into a .pcap or such.

I agree that network debug trace logs would be useful for users not 
knowledgeable
of postgres internals, so I understand the value of the feature,
as long as only necessary/digestible information is outputted.
I'll also check the patch later.

For Andres, I haven't looked into tcpdump yet, but I'd like to ask whether
or not the decrypted output to .pcap (if implemented) may be useful to
application users. What could be the limitations?
Could you explain a bit further on the idea?

Regards,
Kirk Jamison




RE: idle-in-transaction timeout error does not give a hint

2019-02-14 Thread Jamison, Kirk
Hi,

On Monday, February 4, 2019 2:15 AM +, Michael Paquier wrote:
> On Tue, Dec 04, 2018 at 04:07:34AM +, Ideriha, Takeshi wrote:
> > Sure. I didn't have a strong opinion about it, so it's ok.

> From what I can see this is waiting input from a native English speaker, so 
> for now I have moved this entry to next CF.

Although I also use English in daily life,
I am not from a native English-speaking country.
But I'm giving it a try, since the current patch is not that complex to check.

> HINT: In a moment you should be able to reconnect to the
>   database and restart your transaction.

I think this error hint message makes sense for idle-in-transaction timeout.
It's grammatically correct and gives a clearer hint for that.

I agree that "reconnect to database" implies "restart session",
so it may not be the best word for errhint message.

Now the next point raised by Ideriha-san is whether the
other places in the code should also be changed.

I also checked the source code where the other errhints appear,
besides idle-in-transaction.
(ERRCODE_CRASH_SHUTDOWN and ERRCODE_T_R_SERIALIZATION_FAILURE)

Those errcodes use "repeat your command" for the hint.
> errhint("In a moment you should be able to reconnect to the"
> " database and repeat your command.")));

(1) (errcode(ERRCODE_CRASH_SHUTDOWN)
 As per its errdetail, the transaction is rollbacked,
 so "restart transaction" also makes sense.

(2) ERRCODE_T_R_SERIALIZATION_FAILURE
 The second one is under the clause below...
 > if (RecoveryConflictPending && DoingCommandRead)
 ...so using "repeat your command" makes sense.
 
I'm fine with retaining the other two hint messages as they are.
Regardless if we want to change the other similarly written
errhint messages or add errhint messages for other errcodes
that don't have it, I think the latest patch that provides errhint
message for idle-in-transaction timeout may be changed to
"Ready for Committer", as it still applies and the tests
successfully passed.

Regards,
Kirk Jamison



RE: Cache relation sizes?

2019-02-12 Thread Jamison, Kirk
On February 6, 2019, 8:57 AM +, Andres Freund wrote:
> Maybe I'm missing something here, but why is it actually necessary to
> have the sizes in shared memory, if we're just talking about caching
> sizes?  It's pretty darn cheap to determine the filesize of a file that
> has been recently stat()/lseek()/ed, and introducing per-file shared
> data adds *substantial* complexity, because the amount of shared memory
> needs to be pre-determined.  The reason I want to put per-relation data
> into shared memory is different, it's about putting the buffer mapping
> into shared memory, and that, as a prerequisite, also need per-relation
> data. And there's a limit of the number of relations that need to be
> open (one per cached page at max), and space can be freed by evicting
> pages.

Ahh.. You are right about the logic of putting it in the shared memory.
As for Thomas' toy patch, multiple files share one counter in shmem.
Although it currently works, it might likely to miss. 
Though his eventual plan of the idea is to use an array of N counters
and map relation OIDs onto them. 
But as your point about complexity says, in shared memory we cannot
share the same area with multiple files, so that needs an area to
allocate depending on the number of files.

Regarding the allocation of per-relation data in shared memory, I
thought it can be a separated component at first so I asked for
validity of the idea. But now I consider the point raised.

Regards,
Kirk Jamison




RE: Cache relation sizes?

2019-02-06 Thread Jamison, Kirk
On February 6, 2019, 08:25AM +, Kyotaro HORIGUCHI wrote:

>At Wed, 6 Feb 2019 06:29:15 +, "Tsunakawa, Takayuki" 
> wrote:
>> Although I haven't looked deeply at Thomas's patch yet, there's currently no 
>> place to store the size per relation in shared memory.  You have to wait for 
>> the global metacache that Ideriha-san is addressing.  Then, you can store 
>> the relation size in the RelationData structure in relcache.

>Just one counter in the patch *seems* to give significant gain comparing to 
>the complexity, given that lseek is so complex or it brings latency, 
>especially on workloads where file is scarcely changed. Though I didn't run it 
>on a test bench.

> > > (2) Is the MdSharedData temporary or permanent in shared memory?
> > Permanent in shared memory.
> I'm not sure the duration of the 'permanent' there, but it disappears when 
> server stops. Anyway it doesn't need to be permanent beyond a server restart.


Thank you for the insights. 
I did a simple test in the previous email using simple syscall tracing,
the patch significantly reduced the number of lseek syscall.
(but that simple test might not be enough to describe the performance benefit)

Regarding Tsunakawa-san's comment,
in Thomas' patch, he made a place in shared memory that stores the
relsize_change_counter, so I am thinking of utilizing the same,
but for caching the relsize itself.

Perhaps I should explain further the intention for the design.

First step, to cache the file size in the shared memory. Similar to the
intention or purpose of the patch written by Thomas, to reduce the
number of lseek(SEEK_END) by caching the relation size without using
additional locks. The difference is by caching a rel size on the shared
memory itself. I wonder if there are problems that you can see with
this approach.

Eventually, the next step is to have a structure in shared memory
that caches file addresses along with their sizes (Andres' idea of
putting an open relations table in the shared memory). With a
structure that group buffers into file relation units, we can get
file size directly from shared memory, so the assumption is it would
be faster whenever we truncate/extend our relations because we can
track the offset of the changes in size and use range for managing
the truncation, etc..
The next step is a complex direction that needs serious discussion,
but I am wondering if we can proceed with the first step for now if
the idea and direction are valid.

Regards,
Kirk Jamison




RE: pg_upgrade: Pass -j down to vacuumdb

2019-02-03 Thread Jamison, Kirk
Hi, 

On February 1, 2019 8:14 PM +, Jesper Pedersen wrote:

> On 2/1/19 4:58 AM, Alvaro Herrera wrote:
>> On 2019-Feb-01, Jamison, Kirk wrote:
>>> I'm not sure if misunderstood the purpose of $VACUUMDB_OPTS. I 
>>> thought what the other developers suggested is to utilize it so that 
>>> --jobs for vacuumdb would be optional and just based from user-specified 
>>> option $VACUUMDB_OPTS.
>>> By which it would not utilize the amount of jobs used for pg_upgrade.
>>> To address your need of needing a parallel, the script would just 
>>> then suggest if the user wants a --jobs option. The if/else for 
>>> number of jobs is not needed in that case, maybe only for ifndef WIN32 else 
>>> case.
>> 
>> No Kirk, you got it right -- (some of) those ifdefs are not needed, as 
>> adding the --jobs in the command line is not needed.  I do think some 
>> extra whitespace in the format string is needed.


> The point of the patch is to highlight to the user that even though he/she 
> does "pg_upgrade -j 8" the "-j 8" option isn't passed down to vacuumdb.
> Default value is 1, so in that case the echo command doesn't highlight that 
> fact, otherwise it is. The user can then cancel the script and use the 
> suggested command line.
> The script then executes the vaccumdb command with the $VACUUMDB_OPTS 
> argument as noted in the documentation.
> Sample script attached as well.

Sorry I think I might have understood now where you are coming from,
where your script clarifies that it's not necessarily passed down.
If committers can let it pass, maybe it's necessary to highlight in the script,
Something like:
"If you would like default statistics as quickly as possible (e.g. run in 
parallel)..."

The difference is still perhaps your use of --jobs option
as somehow "explicitly" embedded in the code, compared to the suggestion of
making it optional by using the $VACUUMDB_OPTS variable, so that
jobs need not to be explicitly written, unless the user wants to.
(which I also think it's a better improvement because it's optional)

Some quick code check. sorry, don't have much time to write a patch for it 
today.
Perhaps you could write it similar to what you did in the --analyze-in-stages 
part.
Maybe something like (not sure if correct):
+   fprintf(script, "echo %s\"%s/vacuumdb\" %s $VACUUMDB_OPTS --all 
%s%s\n", ECHO_QUOTE,
+   new_cluster.bindir, user_specification.data,
+   /* Did we copy the free space files? */
+   (GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+   "--analyze-only" : "--analyze", ECHO_QUOTE);

I'll continue to review though from time to time.

Regards,
Kirk Jamison


RE: pg_upgrade: Pass -j down to vacuumdb

2019-02-01 Thread Jamison, Kirk
On January 31, 2019, 9:29PM +, Jesper Pedersen wrote:

>>> I added most of the documentation back, as requested by Kirk.
>> 
>> Actually, I find it useful to be documented. However, major contributors 
>> have higher opinions in terms of experience, so I think it's alright with me 
>> not to include the doc part if they finally say so. 
>
> I think we need to leave it to the Committer to decide, as both Peter and 
> Michael are committers; provided the patch reaches RfC.

Agreed.

>>> 1) You still enforce -j to use the number of jobs that the caller of 
>>> pg_upgrade provides, and we agreed that both things are separate 
>>> concepts upthread, no?  What has been suggested by Alvaro is to add 
>>> a comment so as one can use VACUUM_OPTS with -j optionally, instead 
>>> of suggesting a full-fledged vacuumdb command which depends on what 
>>> pg_upgrade uses.  So there is no actual need for the if/else 
>>> complication business.
> 
>> I think it is ok for the echo command to highlight to the user that 
>> running --analyze-only using the same amount of jobs will give a faster 
>> result.

Since you used user_opts.jobs (which is coming from pg_upgrade, is it not?),
could you clarify more the statement above? Or did you mean somehow that
it won't be a problem for vacuumdb to use the same?
Though correctness-wise is arguable, if the committers can let it pass
from your answer, then I think it's alright.

I'm not sure if misunderstood the purpose of $VACUUMDB_OPTS. I thought what
the other developers suggested is to utilize it so that --jobs for vacuumdb
would be optional and just based from user-specified option $VACUUMDB_OPTS.
By which it would not utilize the amount of jobs used for pg_upgrade.
To address your need of needing a parallel, the script would just then
suggest if the user wants a --jobs option. The if/else for number of jobs is
not needed in that case, maybe only for ifndef WIN32 else case.

Regards,
Kirk Jamison




RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-01-31 Thread Jamison, Kirk
On February 1, 2019, Tsunakawa, Takayuki wrote: 

>> As most people seem to agree adding the reloption, here's the patch.  
>> It passes make check, and works like this:

>Sorry, I forgot to include the modified header file.  Revised patch attached.

Thank you for this.
I applied the patch. It builds successfully, and passed the regression tests.
I also tried testing with the parameter when its enabled and disabled,
and it works as intended for CREATE TABLE and ALTER TABLE a SET (shrink_enabled 
= on/off) and RESET (shrink_enabled).
I have yet to test the performance benchmark.

I wonder if there is a better reloption name for shrink_enabled. 
(truncate_enabled, vacuum_enabled? Hmm. No?)
On the other hand, shrink_enabled seems to describe well what it's supposed to 
do when vacuuming tables.
Besides there's a similarly-named autovacuum_enabled option.

I think if most seem to agree to have this solution in place
and to review this further and cover what might be missing,
then shall we register this to next CF?

Regards,
Kirk Jamison


RE: pg_upgrade: Pass -j down to vacuumdb

2019-01-30 Thread Jamison, Kirk
On January 29, 2019 8:19 PM +, Jesper Pedersen wrote:
>On 1/29/19 12:08 AM, Michael Paquier wrote:
>> I would document the optional VACUUM_OPTS on the page of pg_upgrade.
>> If Peter thinks it is fine to not do so, that's fine for me as well.
>> 
..
>I added most of the documentation back, as requested by Kirk.

Actually, I find it useful to be documented. However, major contributors have 
higher opinions in terms of experience, so I think it's alright with me not to 
include the doc part if they finally say so.

>> It seems to me that the latest patch sent is incorrect for multiple
>> reasons:
>> 1) You still enforce -j to use the number of jobs that the caller of 
>> pg_upgrade provides, and we agreed that both things are separate 
>> concepts upthread, no?  What has been suggested by Alvaro is to add a 
>> comment so as one can use VACUUM_OPTS with -j optionally, instead of 
>> suggesting a full-fledged vacuumdb command which depends on what 
>> pg_upgrade uses.  So there is no actual need for the if/else 
>> complication business.

> I think it is ok for the echo command to highlight to the user that running 
> --analyze-only using the same amount of jobs will give a faster result.

I missed that part. 
IIUC, using the $VACUUMDB_OPTS variable would simplify and correct the usage of 
jobs for vacuumdb. (pg_upgrade jobs is not equal to vacuumdb jobs) Plus, it 
might simplify and reduce the number of additional lines.
Tom Lane also suggested above to make the script absorb the value from env.
Would that address your concern of getting a faster result, yes?


Regards,
Kirk Jamison




RE: pg_upgrade: Pass -j down to vacuumdb

2019-01-28 Thread Jamison, Kirk
Hi Jesper,

Thanks for updating your patches quickly.

>On 1/28/19 3:50 PM, Peter Eisentraut wrote:
>>> Done, and v4 attached.
>> 
>> I would drop the changes in pgupgrade.sgml.  We don't need to explain 
>> what doesn't happen, when nobody before now ever thought that it would 
>> happen.
>> 
>> Also, we don't need the versioning stuff.  The new cluster in 
>> pg_upgrade is always of the current version, and we know what that one 
>> supports.
>> 

>Done.

I just checked the patch.
As per advice, you removed the versioning and specified --jobs. 
The patch still applies, builds and passed the tests successfully.

I'd argue though that it's helpful to find a short documentation to clarify
that it's not pass down by to vacuumdb default. I think the initial doc 
change from the previous patch would be good.
+ machine. Note, that this option isn't passed to the
+ vacuumdb application by default.
Just remove the comma symbol from "Note, that..."


Regards,
Kirk Jamison


RE: pg_upgrade: Pass -j down to vacuumdb

2019-01-27 Thread Jamison, Kirk
Hi Jesper,

On Friday, January 25, 2019, Jesper Pedersen  
> Thanks for your feedback !
>
> As per Peter's comments I have changed the patch (v2) to not pass down the -j 
> option to vacuumdb.
>
> Only an update to the documentation and console output is made in order to 
> make it more clear.

Yeah, I understood from the references that the parallel option use
for vacuumdb and pg_upgrade are different. I was also getting
clarification whether pg_upgrade gets to pass it down to vacuumdb.
Based from other Michael's and Peter's replies, I now understand it.

There were also helpful comments from the developers above,
pointing it to the right direction.
In addition to Peter's comment, quoting "...if you want to do this
as fast as possible, don't use this script.  That comment could be
enhanced to suggest the use of the -j option.", so I think you 
should also update the following script in create_script_for_cluster_analyze():
 fprintf(script, "echo %sIf you would like default statistics as quickly as 
possible, cancel%s\n",
 ECHO_QUOTE, ECHO_QUOTE);

There were also advice from Tom and Alvaro that you can incorporate
further in your next patch.

>Alvaro Herrera  writes:
>> So let's have it write with a $VACUUMDB_OPTS variable, which is by 
>> default defined as empty but with a comment suggesting that maybe the 
>> user wants to add the -j option.  This way, if they have to edit it, 
>> they only have to edit the VACUUMDB_OPTS line instead of each of the 
>> two vacuumdb lines.
>
Tom Lane wrote:
>Even better, allow the script to absorb a value from the environment, so that 
>it needn't be edited at all.

Regards,
Kirk Jamison


RE: pg_upgrade: Pass -j down to vacuumdb

2019-01-24 Thread Jamison, Kirk
Hi,

According to CF app, this patch needs review so I took a look at it.
Currently, your patch applies and builds cleanly, and all tests are also 
successful
based from the CF bot patch tester.

I'm not sure if I have understood the argument raised by Peter correctly. 
Quoting Peter, "it's not clear that pg_upgrade and vacuumdb are bound the same 
way, so it's not a given that the same -j number should be used."
I think it's whether the # jobs for pg_upgrade is used the same way for 
parallel vacuum. 

According to the official docs, the recommended setting for pg_upgrade --j 
option is the maximum of the number of CPU cores and tablespaces. [1]
As for vacuumdb, parallel vacuum's (-j) recommended setting is based from your 
max_connections [2], which is the max # of concurrent connections to your db 
server.

[1] https://www.postgresql.org/docs/current/pgupgrade.html
[2] https://www.postgresql.org/docs/current/app-vacuumdb.html


Regards,
Kirk Jamison


RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-01-16 Thread Jamison, Kirk
>On Thu, Nov 15, 2018 at 2:30 PM Alvaro Herrera  
>wrote:
>>
>> On 2018-Nov-15, Laurenz Albe wrote:
>>
> > > This new option would not only mitigate the long shared_buffers 
> > > scan, it would also get rid of the replication conflict caused by 
> > > the AccessExclusiveLock taken during truncation, which is discussed 
> > > in 
> > > https://www.postgresql.org/message-id/c9374921e50a5e8fb1ecf04eb8c6eb
> > > c3%40postgrespro.ru and seems to be a more difficult problem than 
> > > anticipated.
> >
> > FWIW I was just reminded yesterday that the AEL-for-truncation has 
> > been diagnosed to be a severe problem in production, and with no other 
> > solution in sight, I propose to move forward with the stop-gap.

I just want to ask whether or not we could proceed with this approach for now 
and 
if it is possible that we could have this solution integrated before PG12 
development ends?

Regards,
Kirk Jamison


some minor comment fix in md.c

2019-01-09 Thread Jamison, Kirk
Hi,

Here are few minor fix in md.c comments
src/backend/storage/smgr/md.c

1. @L174 - removed the unnecessary word "is".
- […] Note that this is breaks mdnblocks() and related functionality [...]
+ […] Note that this breaks mdnblocks() and related functionality [...]

2. @L885 - grammar fix
- We used to pass O_CREAT here, but that's has the disadvantage that it might 
[...]
+ We used to pass O_CREAT here, but that has the disadvantage that it might 
[...]

Regards,
Kirk J.


0001-minor-comment-fix-in-md.c.patch
Description: 0001-minor-comment-fix-in-md.c.patch


RE: Cache relation sizes?

2019-01-08 Thread Jamison, Kirk
Hi Thomas,

On Friday, December 28, 2018 6:43 AM Thomas Munro 
 wrote:
> [...]if you have ideas about the validity of the assumptions, the reason it 
> breaks initdb, or any other aspect of this approach (or alternatives), please 
> don't let me stop you, and of course please feel free to submit this, an 
> improved version or an alternative proposal [...]

Sure. Thanks. I'd like to try to work on the idea. I also took a look at the 
code, and I hope you don't mind if I ask for clarifications 
(explanation/advice/opinions) on the following, since my postgres experience is 
not substantial enough yet.

(1) I noticed that you used a "relsize_change_counter" to store in 
MdSharedData. Is my understanding below correct?

The intention is to cache the rel_size per-backend (lock-free), with an array 
of relsize_change_counter to skip using lseek syscall when the counter does not 
change.
In _mdnblocks(), if the counter did not change, the cached rel size (nblocks) 
is used and skip the call to FileSize() (lseek to get and cache rel size). That 
means in the default Postgres master, lseek syscall (through FileSize()) is 
called whenever we want to get the rel size (nblocks).

On the other hand, the simplest method I thought that could also work is to 
only cache the file size (nblock) in shared memory, not in the backend process, 
since both nblock and relsize_change_counter are uint32 data type anyway. If 
relsize_change_counter can be changed without lock, then nblock can be changed 
without lock, is it right? In that case, nblock can be accessed directly in 
shared memory. In this case, is the relation size necessary to be cached in 
backend?

(2) Is the MdSharedData temporary or permanent in shared memory?
from the patch:
 typedef struct MdSharedData
 {
/* XXX could have an array of these, and use rel OID % nelements? */ 
pg_atomic_uint32relsize_change_counter;
 } MdSharedData;
 
 static MdSharedData *MdShared;

What I intend to have is a permanent hashtable that will keep the file size 
(eventually/future dev, including table addresses) in the shared memory for 
faster access by backend processes. The idea is to keep track of the 
unallocated blocks, based from how much the relation has been extended or 
truncated. Memory for this hashtable will be dynamically allocated.

Thanks, 
Kirk Jamison


RE: Cache relation sizes?

2018-12-26 Thread Jamison, Kirk
Hello,

I also find this proposed feature to be beneficial for performance, especially 
when we want to extend or truncate large tables.
As mentioned by David, currently there is a query latency spike when we make 
generic plan for partitioned table with many partitions.
I tried to apply Thomas' patch for that use case. Aside from measuring the 
planning and execution time,
I also monitored the lseek calls using simple strace, with and without the 
patch.

Below are the test results.
Setup 8192 table partitions.
(1) set plan_cache_mode = 'force_generic_plan';

  [Without Patch]
prepare select_stmt(int) as select * from t where id = $1;
explain (timing off, analyze) execute select_stmt(8192);
[…]
Planning Time: 1678.680 ms
Execution Time: 643.495 ms

$ strace -p [pid] -e trace=lseek -c
% timeseconds  usecs/call  calls   errors   syscall
---
100.000.017247  1 16385  lseek

  [With Patch]
[…]
Planning Time: 1596.566 ms
Execution Time: 653.646 ms

$ strace -p [pid] -e trace=lseek -c
% timeseconds  usecs/call  calls   errors   syscall
---
100.000.009196  1 8192   lseek

It was mentioned in the other thread [1] that creating a generic plan for the 
first time is very expensive.
Although this patch did not seem to reduce the cost of planning time for 
force_generic_plan,
it seems that number of lseek calls was reduced into half during the first 
execution of generic plan.


(2) plan_cache_mode = 'auto’
reset plan_cache_mode; -- resets to auto / custom plan

  [Without Patch]
[…]
Planning Time: 768.669 ms
Execution Time: 0.776 ms

$ strace -p [pid] -e trace=lseek -c
% timeseconds  usecs/call  calls   errors   syscall
---
100.000.015117 2   8193 lseek

  [With Patch]
[…]
Planning Time: 181.690 ms
Execution Time: 0.615 ms

$ strace -p [pid] -e trace=lseek -c
[…]
NO (zero) lseek calls.

Without the patch, there were around 8193 lseek calls.
With the patch applied, there were no lseek calls when creating the custom plan.


(3) set plan_cache_mode = 'force_generic_plan';
-- force it to generic plan again to use the cached plan (no re-planning)

  [Without Patch]
[…]
Planning Time: 14.294 ms
Execution Time: 601.141 ms

$ strace -p [pid] -e trace=lseek -c
% timeseconds  usecs/call  calls   errors   syscall
---
0.000.0001  lseek

  [With Patch]
[…]
Planning Time: 13.976 ms
Execution Time: 570.518 ms

$ strace -p [pid] -e trace=lseek -c
[…]
NO (zero) lseek calls.


If I did the test correctly, I am not sure though as to why the patch did not 
affect the generic planning performance of table with many partitions.
However, the number of lseek calls was greatly reduced with Thomas’ patch.
I also did not get considerable speed up in terms of latency average using 
pgbench –S (read-only, unprepared).
I am assuming this might be applicable to other use cases as well.
(I just tested the patch, but haven’t dug up the patch details yet).

Would you like to submit this to the commitfest to get more reviews for 
possible idea/patch improvement?


[1] 
https://www.postgresql.org/message-id/flat/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com


Regards,
Kirk Jamison


RE: pgbench - doCustom cleanup

2018-10-10 Thread Jamison, Kirk
Hello Fabien,

I have decided to take a look into this patch.
--
patching file src/bin/pgbench/pgbench.c
Hunk #1 succeeded at 296 (offset 29 lines).
[…Snip…]
Hunk #21 succeeded at 5750 (offset 108 lines).
patching file src/include/portability/instr_time.h
….
===
 All 189 tests passed. 
===
Build success

It applies cleanly, passes the tests, and builds successfully.

The simplification and refactoring of code also seems logical to me.
That said, the code looks clean and the comments are clear.
Agree that it makes sense to move all the state changes from threadRun() to 
doCustom() for code consistency.

I noticed in your refactored code that CSTATE_CHOOSE_SCRIPT may now also change 
to CSTATE_FINISHED when timer is exceeded. (Before, it only changes to either 
CSTATE_START_THROTTLE or CSTATE_START_TX)
But the change has not been indicated YET in the typedef enum part for 
CSTATE_CHOOSE_SCRIPT.
[snip]
/*
 * The client must first choose a script to execute.  Once chosen, it 
can
 * either be throttled (state CSTATE_START_THROTTLE under --rate) or 
start
 * right away (state CSTATE_START_TX).
 */
CSTATE_CHOOSE_SCRIPT,

As for the behavioral change, it is assumed that there will be 1 script per 
transaction run. 
After code reading, I interpreted the "modified" state changes below: 

1. CSTATE_CHOOSE_SCRIPT may now switch CSTATE_FINISHED
Previously, it may only switch to CSTATE_THROTTLE or CSTATE_START_TX.

2. CSTATE_START_THROTTLE (Similar to #1) may now switch to CSTATE_FINISHED.
Previously, it may only switch to CSTATE_THROTTLE or CSTATE_START_TX.

3. CSTATE_START_TX: Transactions, once started, cannot be interrupted while 
running, and now proceeds to first command. Interrupt may only be allowed 
either after tx error or when tx run ends.

4. CSTATE_SKIP_COMMAND
No particular change in code logic, but clarified in the typedef that this 
state can move forward several commands until it meets next commands or 
finishes script.

5. CSTATE_END_TX: either starts over with new script (CSTATE_CHOOSE_SCRIPT) or 
finishes (CSTATE_FINISHED).
Previously, it either defaults to CSTATE_START_THROTTLE when choosing a new 
script, or finishes.


Regards,
Kirk Jamison




-Original Message-
From: Fabien COELHO [mailto:coe...@cri.ensmp.fr] 
Sent: Saturday, August 11, 2018 6:14 PM
To: PostgreSQL Developers 
Subject: pgbench - doCustom cleanup


Hello pgdev,

This patch rework & clarifies pgbench internal state machine.

It was indirectly triggered by Heikki review of pgbench tap tests 
(https://commitfest.postgresql.org/19/1306/), and by Marina's patch about tx 
retry on some errors (https://commitfest.postgresql.org/19/1645/) which I am 
reviewing.

- it adds more comments to the enum state definitions and to doCustom.

- there is some code cleanup/simplifications:
   . a few useless intermediate variables are removed,
   . a macro is added to avoid a repeated pattern to set the current time,
   . performance data are always collected instead of trying to be clever
 and not collect some data in some cases.

- more fundamentally, all state changes are performed within doCustom,
   prior that there was one performed by threadRun which made undertanding
   the end of run harder. Now threadRun only look at the current state
   to make decisions about a client.

- doCustom is made to always return at the end of a script to avoid
   an infinite loop on backslash-command only script, instead of hack
   with a variable to detect loops, which made it return every two
   script runs in such cases.

- there is a small behavioral change:

   prior to the patch, a script would always run to its end once started,
   with the exception of \sleep commands which could be interrupted by
   threadRun.

   Marina's patch should enforce somehow one script = one transaction so
   that a retry makes sense, so this exception is removed to avoid
   aborting a tx implicitely.

--
Fabien.




RE: Recovery performance of standby for multiple concurrent truncates on large tables

2018-08-02 Thread Jamison, Kirk
Hi, 
I appreciate the feedback and suggestions.

On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas  wrote:
>> How would this work if a relfilenode number that belonged to an old 
>> relation got recycled for a new relation?
>> ..
>> I think something like this could be made to work -- both on the 
>> master and the standby, and not just while waiting for a failover --
>> ...
>> It's not clear to me whether it would be worth the overhead of doing 
>> something like this.  Making relation drops faster at the cost of 
>> making buffer cleaning slower could be a loser.

The deferred list has the information such as relfilenode and the head/top 
page number of invalid pages. The standby in the promote mode only registers
info in the deferred list when redoing the COMMIT record. However, it scans
the shared buffer cache only once before the end of the promote, and discards
the page related to the table included in the deferred list. After removing, 
increment the head page number of the abovementioned invalid page.
In this way, if ever while promoting an INSERT progresses (I'm not sure if 
it's possible), the discard progresses too, & the app can refer to it.

As mentioned by Tsunakawa-san above, the purpose of the design is to make
the failover process as shorter as possible and not really to make the 
drop of relations faster. My initial thought was not to touch the regular 
buffer invalidation process, but I am also not sure of the level of 
complexity that the proposed design would affect and how crucial the
requirement (shorter failover) would make, so I asked for community's feedback.


On Tue, July 31, 2018 8:27 AM, Thomas Munro wrote:
> Anyway, it's a lot of complexity, and it falls back to a worst cases 
> like today, and can also transfer work to innocent foreground processes.
> I see why Andres says we should just get a better data structure so we 
> can make the guy doing the dropping pay for it up front, but more 
> efficiently.  I suspect we may also want an ordered data structure for 
> write clustering purposes one day.

I also understand the value of solving the root cause of the problem that's why 
I asked Andres if we could expect a development from the community for V12
regarding the radix tree approach for buffer management, or even just from 
anyone who
could start a WIP patch regardless if it's radix tree or not.
And perhaps we'd like to get involved as well as this is also our customer's 
problem.

So I am just curious about the radix tree approach's design. Maybe we should
start discussing what kind of data structures, processing, etc. are involved?

I also read other design solutions from another thread [1].
a. Fujii-san's proposal
Add $SUBJECT for performance improvement; reloption to prevent vacuum 
from truncating empty pages
b. Pavan's comment
> What if we remember the buffers as seen by count_nondeletable_pages() and
> then just discard those specific buffers instead of scanning the entire
> shared_buffers again? Surely we revisit all to-be-truncated blocks before
> actual truncation. So we already know which buffers to discard. And we're
> holding exclusive lock at that point, so nothing can change underneath. Of
> course, we can't really remember a large number of buffers, so we can do
> this in small chunks. Scan last K blocks, remember those K buffers, discard
> those K buffers, truncate the relation and then try for next K blocks. If
> another backend requests lock on the table, we give up or retry after a
> while.

[1] 
https://www.postgresql.org/message-id/flat/20180419203802.hqrb4o2wexjnb2ft%40alvherre.pgsql#7d4a8c56a01392a3909b2150371e6495

Now, how do we move forward?


Thank you everyone.

Regards,
Kirk Jamison


RE: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-29 Thread Jamison, Kirk
Hi Andres,

> I think this is a case where the potential work arounds are complex enough to 
> use significant resources to get right, and are likely to make properly 
> fixing the issue harder. I'm willing to comment on proposals that claim not 
> to be problmatic in those regards, but I have *SERIOUS* doubts they exist.

Alright. But I'd still try and ask your thoughts about it (below).
The proposed design touches the buffer invalidation during recovery process of 
standby server.

The question was about "how" to remember those buffers that contain 
truncate/drop tables, right?

1. Because the multiple scans of the whole shared buffer per concurrent 
truncate/drop table was the cause of the time-consuming behavior, DURING the 
failover process while WAL is being applied, we temporary delay the scanning 
and invalidating of shared buffers. At the same time, we remember the 
relations/relfilenodes (of dropped/truncated tables) by adding them in a hash 
table called "skip list". 
2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer 
only ONCE, compare the pages against the skip list, and invalidates the 
relevant pages. After deleting the relevant pages on the shared memory, it will 
not be written back to the disk.

Assuming the theory works, this design will only affect the behavior of 
checkpointer (or maybe bg writer) during recovery process / failover. Any 
feedback, thoughts?

BTW, are there any updates whether the community will push through anytime soon 
regarding the buffer mapping implementation you mentioned?


Regards,
Kirk Jamison




RE: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-18 Thread Jamison, Kirk
Hi,
Thank you for your replies.

On Tue, July 10, 2018 4:15 PM, Andres Freund wrote:
>I think you'd run into a lot of very hairy details with this approach. 
>Consider what happens if client processes need fresh buffers and need to write 
>out a victim buffer. You'll need to know that the relevant buffer is actually 
>invalid. Thus the knowledge about the "delayed" drops would need to be in 
>shared buffers and scanned on every dirty buffer writeout.

Yes, you're right. There are problems associated with checkpoint as you pointed 
out. I just thought of touching the involved process on writing dirty pages to 
the disk, which are both the functions of checkpoint and background writer.
> How about using the background writer for this? ...
> ...
And now that I think about it, the suggestion of Ants above on "background 
writer" would seem to work better, as bg writer is more active and continuously 
write dirty pages but checkpointer only does it in interval. This seems to be a 
more viable solution and I'd appreciate advice. I'll explore around this idea, 
and if possible, I'd like to develop a solution for the next commitfest.

> I personally think we should rather just work towards a ordered buffer 
> mapping implementation.

I understand that the main problem lies on shared_buffers scanning and that a 
buffer mapping implementation is underway(?) for PG12. I am not sure if the 
community has arrived at a consensus for that long-term fix. Would the 
community also welcome any minor/smaller-scale improvements (just for this 
particular problem on wal recovery for truncate table)? If yes, then I'd like 
to work on a possible solution.

Regards,
Kirk Jamison


Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-10 Thread Jamison, Kirk
Hello hackers,

Recently, the problem on improving performance of multiple drop/truncate tables 
in a single transaction with large shared_buffers (as shown below) was solved 
by commit b416691.
  BEGIN;
  truncate tbl001;
  ...
  truncate tbl050;
  COMMIT;

However, we have a customer that needs to execute multiple concurrent TRUNCATEs 
(40~50) on different large tables (as shown below) in a shorter amount of time. 
This one is not covered by the previous commit's improvement.
  BEGIN;
  truncate tbl001;
  COMMIT;
  ...
  BEGIN;
  truncate tbl050;
  COMMIT;

[Problem]
Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads 
to separate scans of the whole shared buffer in sequence to check whether or 
not the table to be deleted is cached in the shared buffer. Moreover, if the 
size of shared_buffers is large (i.e. 300GB) and the primary server fails 
during the replay, it would take a long while for the standby to complete 
recovery.

[Idea]
Since in the current implementation, the replay of each TRUNCATE/DROP TABLE 
scans the whole shared buffer.
One approach (though idea is not really developed yet) is to improve the 
recovery by delaying the shared buffer scan and invalidation 
(DropRelFileNodeBuffers) and to put it after the next checkpoint (after 
failover completion). The replay of TRUNCATE/DROP TABLE just make the 
checkpointer process remember what relations should be invalidated in the 
shared buffer during subsequent checkpoint. The checkpointer then scans the 
shared buffer only once to invalidate the buffers of relations that was dropped 
and truncated.

However, this is still a rough idea, so I am not sure if it’s feasible. I would 
like to know if the community has advice or other alternative solutions on how 
to work around this.
Any insights, advice, feedback?

Thank you in advance.


Regards,
Kirk Jamison


RE: Recovery performance of DROP DATABASE with many tablespaces

2018-07-04 Thread Jamison, Kirk
Hi, Fujii-san

I came across this post and I got interested in it,
 so I tried to apply/test the patch but I am not sure if I did it correctly.
I set-up master-slave sync, 200GB shared_buffers, 2 
max_locks_per_transaction,
1 DB with 500 table partitions shared evenly across 5 tablespaces.

After dropping the db, with or without patch, 
there were no difference in recovery performance when dropping database, 
so maybe I made a mistake somewhere. But anyway, here's the results.

==WITHOUT PATCH===
[200GB shared buffers]
DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
2018/07/04_13:35:00.161
dropdb
2018/07/04_13:35:05.591 5.591 sec

[200GB shared_buffers]
DROPDB (including DROP TABLE and DROP TABLESPACE)
real3m19.717s
user0m0.001s
sys 0m0.001s

==WITH PATCH===
[200GB shared_buffers]
DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
2018/07/04_14:19:47.128
dropdb
2018/07/04_14:19:53.177 6.049 sec

[200GB shared_buffers]
DROPDB (included the DROP TABLE and DROP TABLESPACE commands)
real3m51.834s
user0m0.001s
sys 0m0.002s

Just in case, do you also have some performance test numbers/case 
to show the recovery perf improvement when dropping database that contain 
multiple tablespaces?

Regards,
Kirk Jamison