Re: Pluggable Storage - Andres's take

2019-06-06 Thread Alvaro Herrera
On 2019-Jun-06, Heikki Linnakangas wrote:

> Pushed the first patch now. Andres already fixed the second issue in commit
> b8b94ea129.

Please don't omit the "Discussion:" tag in commit messages.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Pluggable Storage - Andres's take

2019-06-06 Thread Heikki Linnakangas

On 18/05/2019 01:19, Ashwin Agrawal wrote:
On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas > wrote:


On 08/04/2019 20:37, Andres Freund wrote:
 > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
 >> There's a little bug in index-only scan executor node, where it
mixes up the
 >> slots to hold a tuple from the index, and from the table. That
doesn't cause
 >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy
AM, which
 >> uses a virtual slot, it caused warnings like this from
index-only scans:
 >
 > Hm. That's another one that I think I had fixed previously :(,
and then
 > concluded that it's not actually necessary for some reason. Your fix
 > looks correct to me.  Do you want to commit it? Otherwise I'll
look at
 > it after rebasing zheap, and checking it with that.

I found another slot type confusion bug, while playing with
zedstore. In
an Index Scan, if you have an ORDER BY key that needs to be rechecked,
so that it uses the reorder queue, then it will sometimes use the
reorder queue slot, and sometimes the table AM's slot, for the scan
slot. If they're not of the same type, you get an assertion:

TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
"execExprInterp.c", Line: 1905)

Attached is a test for this, again using the toy table AM, extended to
be able to test this. And a fix.


It seems the two patches from email [1] fixing slot confusion in Index 
Scans are pending to be committed.


1] 
https://www.postgresql.org/message-id/e71c4da4-3e82-cc4f-32cc-ede387fac8b0%40iki.fi


Pushed the first patch now. Andres already fixed the second issue in 
commit b8b94ea129.


- Heikki




Re: Pluggable Storage - Andres's take

2019-05-17 Thread Andres Freund
Hi,

On 2019-05-17 14:49:19 -0700, Ashwin Agrawal wrote:
> On Fri, May 17, 2019 at 12:54 PM Andres Freund  wrote:
> 
> > Hi,
> >
> > On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> > > Highlevel this looks good to me. Will look into full details tomorrow.
> >
> > Ping?
> >
> > I'll push the first of the patches soon, and unless you'll comment on
> > the second soon, I'll also push ahead. There's a beta upcoming...
> >
> 
> Sorry for the delay, didn't get to it yesterday. Looked into both the
> patches. They both look good to me, thank you.

Pushed both now.


> Relation size API still doesn't address the analyze case as you mentioned
> but sure something we can improve on later.

I'm much less concerned about that. You can just return a reasonable
block size from the size callback, and it'll work for block sampling
(and you can just skip pages in the analyze callback if needed, e.g. for
zheap's tpd pages). And we assume that a reasonable block size is
returned by the size callback anyway, for planning purposes (both in
relpages and for estimate_rel_size).  We'll probably want to improve
that some day, but it doesn't strike me as hugely urgent.

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-05-17 Thread Andres Freund
Hi,

On 2019-05-17 16:56:04 -0700, Ashwin Agrawal wrote:
> Question on the patch, if not too late
> Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ?
> Seems cleaner to have it in ExecInitTidScan().

Largely because it's symmetrical to where most other scans are started (
c.f. nodeSeqscan.c, nodeIndexscan.c). But also, there's no need to incur
the cost of a smgrnblocks() etc when the node might never actually be
reached during execution.

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-05-17 Thread Ashwin Agrawal
On Wed, May 15, 2019 at 11:54 AM Andres Freund  wrote:

> Attached is a prototype of a variation of this. I added a
> table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
> callback / wrapper. Currently it just takes a "plain" scan, but we could
> add a separate table_beginscan variant too.
>
> For heap that just means we can just use HeapScanDesc's rs_nblock to
> filter out invalid tids, and we only need to call
> RelationGetNumberOfBlocks() once, rather than every
> table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
> improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
> CURRENT OF) - which previously computed the relation size once per
> tuple.
>

Question on the patch, if not too late
Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ?
Seems cleaner to have it in ExecInitTidScan().


Re: Pluggable Storage - Andres's take

2019-05-17 Thread Ashwin Agrawal
On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas  wrote:

> On 08/04/2019 20:37, Andres Freund wrote:
> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> >> There's a little bug in index-only scan executor node, where it mixes
> up the
> >> slots to hold a tuple from the index, and from the table. That doesn't
> cause
> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM,
> which
> >> uses a virtual slot, it caused warnings like this from index-only scans:
> >
> > Hm. That's another one that I think I had fixed previously :(, and then
> > concluded that it's not actually necessary for some reason. Your fix
> > looks correct to me.  Do you want to commit it? Otherwise I'll look at
> > it after rebasing zheap, and checking it with that.
>
> I found another slot type confusion bug, while playing with zedstore. In
> an Index Scan, if you have an ORDER BY key that needs to be rechecked,
> so that it uses the reorder queue, then it will sometimes use the
> reorder queue slot, and sometimes the table AM's slot, for the scan
> slot. If they're not of the same type, you get an assertion:
>
> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
> "execExprInterp.c", Line: 1905)
>
> Attached is a test for this, again using the toy table AM, extended to
> be able to test this. And a fix.
>

It seems the two patches from email [1] fixing slot confusion in Index
Scans are pending to be committed.

1]
https://www.postgresql.org/message-id/e71c4da4-3e82-cc4f-32cc-ede387fac8b0%40iki.fi


Re: Pluggable Storage - Andres's take

2019-05-17 Thread Ashwin Agrawal
On Fri, May 17, 2019 at 12:54 PM Andres Freund  wrote:

> Hi,
>
> On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> > Highlevel this looks good to me. Will look into full details tomorrow.
>
> Ping?
>
> I'll push the first of the patches soon, and unless you'll comment on
> the second soon, I'll also push ahead. There's a beta upcoming...
>

Sorry for the delay, didn't get to it yesterday. Looked into both the
patches. They both look good to me, thank you.

Relation size API still doesn't address the analyze case as you mentioned
but sure something we can improve on later.


Re: Pluggable Storage - Andres's take

2019-05-17 Thread Andres Freund
Hi,

On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> Highlevel this looks good to me. Will look into full details tomorrow.

Ping?

I'll push the first of the patches soon, and unless you'll comment on
the second soon, I'll also push ahead. There's a beta upcoming...

- Andres




Re: Pluggable Storage - Andres's take

2019-05-16 Thread Ashwin Agrawal
On Wed, May 15, 2019 at 11:54 AM Andres Freund  wrote:

> Hi,
>
> On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
>
> > > 3) nodeTidscan, skipping over too large tids
> > >I think this should just be moved into the AMs, there's no need to
> > >have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> >   /*
> >* We silently discard any TIDs that are out of range at the time
> of scan
> >* start.  (Since we hold at least AccessShareLock on the table,
> it won't
> >* be possible for someone to truncate away the blocks we intend to
> >* visit.)
> >*/
> >   nblocks =
> RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
> >
> >
> > The obvious answer would be to just move that check into the
> > table_fetch_row_version implementation (currently just calling
> > heap_fetch()) - but that doesn't seem OK from a performance POV, because
> > we'd then determine the relation size once for each tid, rather than
> > once per tidscan.  And it'd also check in cases where we know the tid is
> > supposed to be valid (e.g. fetching trigger tuples and such).
> >
> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a parameter.  But it seems we'd have to introduce that as a
> > separate tableam callback, because we'd not want to incur the overhead
> > of creating an additional scan / RelationGetNumberOfBlocks() checks for
> > triggers et al.
>
> Attached is a prototype of a variation of this. I added a
> table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
> callback / wrapper. Currently it just takes a "plain" scan, but we could
> add a separate table_beginscan variant too.
>
> For heap that just means we can just use HeapScanDesc's rs_nblock to
> filter out invalid tids, and we only need to call
> RelationGetNumberOfBlocks() once, rather than every
> table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
> improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
> CURRENT OF) - which previously computed the relation size once per
> tuple.
>
> Needs a bit of polishing, but I think this is the right direction?
>

Highlevel this looks good to me. Will look into full details tomorrow. This
alligns with the high level thought I made but implemented in much better
way, to consult with the AM to perform the optimization or not. So, now
using the new callback table_tuple_tid_valid() AM either can implement some
way to perform the validation for TID to optimize the scan, or if has no
way to check based on scan descriptor then can decide to always pass true
and let table_fetch_row_version() handle the things.


Re: Pluggable Storage - Andres's take

2019-05-15 Thread Andres Freund
Hi,

On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).

> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> thinkt that's OK.
> 
> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size.  I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
> somehow.

I'm still reasonably happy with this.  I'll polish it a bit and push.


> > 3) nodeTidscan, skipping over too large tids
> >I think this should just be moved into the AMs, there's no need to
> >have this in nodeTidscan.c
> 
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
> 
>   /*
>* We silently discard any TIDs that are out of range at the time of 
> scan
>* start.  (Since we hold at least AccessShareLock on the table, it 
> won't
>* be possible for someone to truncate away the blocks we intend to
>* visit.)
>*/
>   nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> 
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
> 
> 
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan.  And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
> 
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter.  But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Needs a bit of polishing, but I think this is the right direction?
Unless somebody protests, I'm going to push something along those lines
quite soon.

Greetings,

Andres Freund
>From 5c8edd1803619120c9ac856e7353943281f2d407 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 14 May 2019 23:50:41 -0700
Subject: [PATCH v2 1/2] tableam: Don't assume that an AM uses md.c style
 storage.

---
 src/backend/access/heap/heapam_handler.c | 27 +++
 src/backend/access/table/tableamapi.c|  3 ++
 src/backend/storage/buffer/bufmgr.c  | 43 ++--
 src/include/access/tableam.h | 40 ++
 4 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 00505ec3f4d..96211c673e1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1975,6 +1975,31 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
 }
 
 
+/* 
+ * Miscellaneous callbacks for the heap AM
+ * 
+ */
+
+static uint64
+heapam_relation_size(Relation rel, ForkNumber forkNumber)
+{
+	uint64		nblocks = 0;
+
+	/* Open it at the smgr level if not already done */
+	RelationOpenSmgr(rel);
+
+	/* InvalidForkNumber indicates the size of all forks */
+	if (forkNumber == InvalidForkNumber)
+	{
+		for (int i = 0; i < MAX_FORKNUM; i++)
+			nblocks += smgrnblocks(rel->rd_smgr, i);
+	}
+	else
+	

Re: Pluggable Storage - Andres's take

2019-05-09 Thread Ashwin Agrawal
On Wed, May 8, 2019 at 2:46 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote:
> > On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal  wrote:
> > > Also wish to point out, while working on Zedstore, we realized that
> > > TupleDesc from Relation object can be trusted at AM layer for
> > > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> > > catalog is updated first and hence the relation object passed to AM
> > > layer reflects new TupleDesc. For heapam its fine as it doesn't use
> > > the TupleDesc today during scans in AM layer for scan_getnextslot().
> > > Only TupleDesc which can trusted and matches the on-disk layout of the
> > > tuple for scans hence is from TupleTableSlot. Which is little
> > > unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> > > and not in scan_begin(). Means if AM wishes to do some initialization
> > > based on TupleDesc for scans can't be done in scan_begin() and forced
> > > to delay till has access to TupleTableSlot. We should at least add
> > > comment for scan_begin() to strongly clarify not to trust Relation
> > > object TupleDesc. Or maybe other alternative would be have separate
> > > API for rewrite case.
> >
> > Just to correct my typo, I wish to say, TupleDesc from Relation object can't
> > be trusted at AM layer for scan_begin() API.
> >
> > Andres, any thoughts on above. I see you had proposed "change the
> > table_beginscan* API so it
> > provides a slot" in [1], but seems received no response/comments that time.
> > [1]
> > https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de
>
> I don't think passing a slot at beginscan time is a good idea. There's
> several places that want to use different slots for the same scan, and
> we probably want to increase that over time (e.g. for batching), not
> decrease it.
>
> What kind of initialization do you want to do based on the tuple desc at
> beginscan time?

For Zedstore (column store) need to allocate map (array or bitmask) to
mark which columns to project for the scan. Also need to allocate AM
internal scan descriptors corresponding to number of attributes for
the scan. Hence, need access to number of attributes involved in the
scan. Currently, not able to trust Relation's TupleDesc, for Zedstore
we worked-around the same by allocating these things on first call to
getnextslot, when have access to slot (by switching to memory context
used during scan_begin()).




Re: Pluggable Storage - Andres's take

2019-05-08 Thread Andres Freund
Hi,

On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote:
> On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal  wrote:
> > Also wish to point out, while working on Zedstore, we realized that
> > TupleDesc from Relation object can be trusted at AM layer for
> > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> > catalog is updated first and hence the relation object passed to AM
> > layer reflects new TupleDesc. For heapam its fine as it doesn't use
> > the TupleDesc today during scans in AM layer for scan_getnextslot().
> > Only TupleDesc which can trusted and matches the on-disk layout of the
> > tuple for scans hence is from TupleTableSlot. Which is little
> > unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> > and not in scan_begin(). Means if AM wishes to do some initialization
> > based on TupleDesc for scans can't be done in scan_begin() and forced
> > to delay till has access to TupleTableSlot. We should at least add
> > comment for scan_begin() to strongly clarify not to trust Relation
> > object TupleDesc. Or maybe other alternative would be have separate
> > API for rewrite case.
> 
> Just to correct my typo, I wish to say, TupleDesc from Relation object can't
> be trusted at AM layer for scan_begin() API.
> 
> Andres, any thoughts on above. I see you had proposed "change the
> table_beginscan* API so it
> provides a slot" in [1], but seems received no response/comments that time.
> [1]
> https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de

I don't think passing a slot at beginscan time is a good idea. There's
several places that want to use different slots for the same scan, and
we probably want to increase that over time (e.g. for batching), not
decrease it.

What kind of initialization do you want to do based on the tuple desc at
beginscan time?

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-05-08 Thread Andres Freund
Hi,

On 2019-04-29 16:17:41 -0700, Ashwin Agrawal wrote:
> On Thu, Apr 25, 2019 at 3:43 PM Andres Freund  wrote:
> > Hm. I think some of those changes would be a bit bigger than I initially
> > though. Attached is a more minimal fix that'd route
> > RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> > think it's definitely the right answer for 1), probably the pragmatic
> > answer to 2), but certainly not for 3).
> >
> > I've for now made the AM return the size in bytes, and then convert that
> > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> > are going to continue to want it internally as pages (otherwise there's
> > going to be way too much churn, without a benefit I can see). So I
> > think that's OK.
> 
> I will provide my inputs, Heikki please correct me or add your inputs.
>
> I am not sure how much gain this practically provides, if rest of the
> system continues to use the value returned in-terms of blocks. I
> understand things being block based (and not just really block based
> but all the blocks of relation are storing data and full tuple) is
> engraved in the system. So, breaking out of it is yes much larger
> change and not just limited to table AM API.

I don't think it's that ingrained in all that many parts of the
system. Outside of the places I listed upthread, and the one index case
that stashes extra info, which places are that "block based"?


> I feel most of the issues discussed here should be faced by zheap as
> well, as not all blocks/pages contain data like TPD pages should be
> excluded from sampling and TID scans, etc...

It's not a problem so far, and zheap works on tableam. You can just skip
such blocks during sampling / analyze, and return nothing for tidscans.


> > > 2) commands/analyze.c, computing pg_class.relpages
> > >
> > >This should imo be moved to the tableam callback. It's currently done
> > >a bit weirdly imo, with fdws computing relpages the callback, but
> > >then also returning the acquirefunc. Seems like it should entirely be
> > >computed as part of calling acquirefunc.
> >
> > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> > tableam wouldn't be the right minimal approach too. It has the
> > disadvantage of implying certain values for the
> > RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
> > would be to return the desire sampling range in
> > table_beginscan_analyze() - but that'd require some duplication because
> > currently that just uses the generic scan_begin() callback.
> 
> Yes, just routing relation size via AM layer and using its returned
> value in terms of blocks still and performing sampling based on blocks
> based on it, doesn't feel resolves the issue. Maybe need to delegate
> sampling completely to AM layer. Code duplication can be avoided by
> similar AMs (heap and zheap) possible using some common utility
> functions to achieve intended result.

I don't know what this is actually proposing.



> > I suspect - as previously mentioned- that we're going to have to extend
> > statistics collection beyond the current approach at some point, but I
> > don't think that's now. At least to me it's not clear how to best
> > represent the stats, and how to best use them, if the underlying storage
> > is fundamentally not block best.  Nor how we'd avoid code duplication...
> 
> Yes, will have to give more thoughts into this.
> 
> >
> > > 3) nodeTidscan, skipping over too large tids
> > >I think this should just be moved into the AMs, there's no need to
> > >have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> > /*
> >  * We silently discard any TIDs that are out of range at the time 
> > of scan
> >  * start.  (Since we hold at least AccessShareLock on the table, it 
> > won't
> >  * be possible for someone to truncate away the blocks we intend to
> >  * visit.)
> >  */
> > nblocks = 
> > RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
> 
> Agree, its not nice to have that optimization being performed based on
> number of block in generic layer. I feel its not efficient either for
> zheap too due to TPD pages as mentioned above, as number of blocks
> returned will be higher compared to actually data blocks.

I don't think there's a problem for zheap. The blocks are just
interspersed.

Having pondered this a lot more, I think this is the way to go for
12. Then we can improve this for v13, to be nice.


> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a 

Re: Pluggable Storage - Andres's take

2019-05-08 Thread Ashwin Agrawal
On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal  wrote:
>
> Also wish to point out, while working on Zedstore, we realized that
> TupleDesc from Relation object can be trusted at AM layer for
> scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> catalog is updated first and hence the relation object passed to AM
> layer reflects new TupleDesc. For heapam its fine as it doesn't use
> the TupleDesc today during scans in AM layer for scan_getnextslot().
> Only TupleDesc which can trusted and matches the on-disk layout of the
> tuple for scans hence is from TupleTableSlot. Which is little
> unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> and not in scan_begin(). Means if AM wishes to do some initialization
> based on TupleDesc for scans can't be done in scan_begin() and forced
> to delay till has access to TupleTableSlot. We should at least add
> comment for scan_begin() to strongly clarify not to trust Relation
> object TupleDesc. Or maybe other alternative would be have separate
> API for rewrite case.

Just to correct my typo, I wish to say, TupleDesc from Relation object can't
be trusted at AM layer for scan_begin() API.

Andres, any thoughts on above. I see you had proposed "change the
table_beginscan* API so it
provides a slot" in [1], but seems received no response/comments that time.

[1]
https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de


Re: Pluggable Storage - Andres's take

2019-05-07 Thread Rafia Sabih
On Mon, 6 May 2019 at 22:39, Ashwin Agrawal  wrote:
>
> On Mon, May 6, 2019 at 7:14 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On May 6, 2019 3:40:55 AM PDT, Rafia Sabih  
> > wrote:
> > >I was trying the toyam patch and on make check it failed with
> > >segmentation fault at
> > >
> > >static void
> > >toyam_relation_set_new_filenode(Relation rel,
> > > char persistence,
> > > TransactionId *freezeXid,
> > > MultiXactId *minmulti)
> > >{
> > > *freezeXid = InvalidTransactionId;
> > >
> > >Basically, on running create table t (i int, j int) using toytable,
> > >leads to this segmentation fault.
> > >
> > >Am I missing something here?
> >
> > I assume you got compiler warmings compiling it? The API for some callbacks 
> > changed a bit.
>
> Attached patch gets toy table AM implementation to match latest master API.
> The patch builds on top of patch from Heikki in [1].
> Compiles and works but the test still continues to fail with WARNING
> for issue mentioned in [1]
>
Thanks Ashwin, this works fine with the mentioned warnings of course.

-- 
Regards,
Rafia Sabih




Re: Pluggable Storage - Andres's take

2019-05-07 Thread Rafia Sabih
On Mon, 6 May 2019 at 16:14, Andres Freund  wrote:
>
> Hi,
>
> On May 6, 2019 3:40:55 AM PDT, Rafia Sabih  wrote:
> >On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas 
> >wrote:
> >>
> >> On 08/04/2019 20:37, Andres Freund wrote:
> >> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> >> >> There's a little bug in index-only scan executor node, where it
> >mixes up the
> >> >> slots to hold a tuple from the index, and from the table. That
> >doesn't cause
> >> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy
> >AM, which
> >> >> uses a virtual slot, it caused warnings like this from index-only
> >scans:
> >> >
> >> > Hm. That's another one that I think I had fixed previously :(, and
> >then
> >> > concluded that it's not actually necessary for some reason. Your
> >fix
> >> > looks correct to me.  Do you want to commit it? Otherwise I'll look
> >at
> >> > it after rebasing zheap, and checking it with that.
> >>
> >> I found another slot type confusion bug, while playing with zedstore.
> >In
> >> an Index Scan, if you have an ORDER BY key that needs to be
> >rechecked,
> >> so that it uses the reorder queue, then it will sometimes use the
> >> reorder queue slot, and sometimes the table AM's slot, for the scan
> >> slot. If they're not of the same type, you get an assertion:
> >>
> >> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
> >> "execExprInterp.c", Line: 1905)
> >>
> >> Attached is a test for this, again using the toy table AM, extended
> >to
> >> be able to test this. And a fix.
> >>
> >> >> Attached is a patch with the toy implementation I used to test
> >this. I'm not
> >> >> suggesting we should commit that - although feel free to do that
> >if you
> >> >> think it's useful - but it shows how I bumped into these issues.
> >> >
> >> > Hm, probably not a bad idea to include something like it. It seems
> >like
> >> > we kinda would need non-stub implementation of more functions for
> >it to
> >> > test much / and to serve as an example.  I'm mildy inclined to just
> >do
> >> > it via zheap / externally, but I'm not quite sure that's good
> >enough.
> >>
> >> Works for me.
> >>
> >> >> +static Size
> >> >> +toyam_parallelscan_estimate(Relation rel)
> >> >> +{
> >> >> +ereport(ERROR,
> >> >> +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> >> + errmsg("function %s not implemented yet",
> >__func__)));
> >> >> +}
> >> >
> >> > The other stubbed functions seem like we should require them, but I
> >> > wonder if we should make the parallel stuff optional?
> >>
> >> Yeah, that would be good. I would assume it to be optional.
> >>
> >I was trying the toyam patch and on make check it failed with
> >segmentation fault at
> >
> >static void
> >toyam_relation_set_new_filenode(Relation rel,
> > char persistence,
> > TransactionId *freezeXid,
> > MultiXactId *minmulti)
> >{
> > *freezeXid = InvalidTransactionId;
> >
> >Basically, on running create table t (i int, j int) using toytable,
> >leads to this segmentation fault.
> >
> >Am I missing something here?
>
> I assume you got compiler warmings compiling it? The API for some callbacks 
> changed a bit.
>
Oh yeah it does.


-- 
Regards,
Rafia Sabih




Re: Pluggable Storage - Andres's take

2019-05-06 Thread Ashwin Agrawal
On Mon, May 6, 2019 at 7:14 AM Andres Freund  wrote:
>
> Hi,
>
> On May 6, 2019 3:40:55 AM PDT, Rafia Sabih  wrote:
> >I was trying the toyam patch and on make check it failed with
> >segmentation fault at
> >
> >static void
> >toyam_relation_set_new_filenode(Relation rel,
> > char persistence,
> > TransactionId *freezeXid,
> > MultiXactId *minmulti)
> >{
> > *freezeXid = InvalidTransactionId;
> >
> >Basically, on running create table t (i int, j int) using toytable,
> >leads to this segmentation fault.
> >
> >Am I missing something here?
>
> I assume you got compiler warmings compiling it? The API for some callbacks 
> changed a bit.

Attached patch gets toy table AM implementation to match latest master API.
The patch builds on top of patch from Heikki in [1].
Compiles and works but the test still continues to fail with WARNING
for issue mentioned in [1]


Noticed the typo in recently added comment for relation_set_new_filenode().

 * Note that only the subset of the relcache filled by
 * RelationBuildLocalRelation() can be relied upon and that the
relation's
 * catalog entries either will either not yet exist (new
relation), or
 * will still reference the old relfilenode.

seems should be

 * Note that only the subset of the relcache filled by
 * RelationBuildLocalRelation() can be relied upon and that the
relation's
 * catalog entries will either not yet exist (new relation), or
still
 * reference the old relfilenode.

Also wish to point out, while working on Zedstore, we realized that
TupleDesc from Relation object can be trusted at AM layer for
scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
catalog is updated first and hence the relation object passed to AM
layer reflects new TupleDesc. For heapam its fine as it doesn't use
the TupleDesc today during scans in AM layer for scan_getnextslot().
Only TupleDesc which can trusted and matches the on-disk layout of the
tuple for scans hence is from TupleTableSlot. Which is little
unfortunate as TupleTableSlot is only available in scan_getnextslot(),
and not in scan_begin(). Means if AM wishes to do some initialization
based on TupleDesc for scans can't be done in scan_begin() and forced
to delay till has access to TupleTableSlot. We should at least add
comment for scan_begin() to strongly clarify not to trust Relation
object TupleDesc. Or maybe other alternative would be have separate
API for rewrite case.


1] 
https://www.postgresql.org/message-id/9a7fb9cc-2419-5db7-8840-ddc10c93f122%40iki.fi
diff --git a/src/test/modules/toytable/toytableam.c b/src/test/modules/toytable/toytableam.c
index 4cb2b5d75db..9f1ab534822 100644
--- a/src/test/modules/toytable/toytableam.c
+++ b/src/test/modules/toytable/toytableam.c
@@ -398,6 +398,7 @@ toyam_finish_bulk_insert(Relation rel, int options)
 
 static void
 toyam_relation_set_new_filenode(Relation rel,
+const RelFileNode *newrnode,
 char persistence,
 TransactionId *freezeXid,
 MultiXactId *minmulti)
@@ -410,7 +411,7 @@ toyam_relation_set_new_filenode(Relation rel,
 	 * RelationGetNumberOfBlocks, from index_update_stats(), and that
 	 * fails if the underlying file doesn't exist.
 	 */
-	RelationCreateStorage(rel->rd_node, persistence);
+	RelationCreateStorage(*newrnode, persistence);
 }
 
 static void
@@ -422,7 +423,7 @@ toyam_relation_nontransactional_truncate(Relation rel)
 }
 
 static void
-toyam_relation_copy_data(Relation rel, RelFileNode newrnode)
+toyam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 {
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -435,8 +436,8 @@ toyam_relation_copy_for_cluster(Relation NewHeap,
 Relation OldIndex,
 bool use_sort,
 TransactionId OldestXmin,
-TransactionId FreezeXid,
-MultiXactId MultiXactCutoff,
+TransactionId *xid_cutoff,
+MultiXactId *multi_cutoff,
 double *num_tuples,
 double *tups_vacuumed,
 double *tups_recently_dead)


Re: Pluggable Storage - Andres's take

2019-05-06 Thread Andres Freund
Hi,

On May 6, 2019 3:40:55 AM PDT, Rafia Sabih  wrote:
>On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas 
>wrote:
>>
>> On 08/04/2019 20:37, Andres Freund wrote:
>> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>> >> There's a little bug in index-only scan executor node, where it
>mixes up the
>> >> slots to hold a tuple from the index, and from the table. That
>doesn't cause
>> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy
>AM, which
>> >> uses a virtual slot, it caused warnings like this from index-only
>scans:
>> >
>> > Hm. That's another one that I think I had fixed previously :(, and
>then
>> > concluded that it's not actually necessary for some reason. Your
>fix
>> > looks correct to me.  Do you want to commit it? Otherwise I'll look
>at
>> > it after rebasing zheap, and checking it with that.
>>
>> I found another slot type confusion bug, while playing with zedstore.
>In
>> an Index Scan, if you have an ORDER BY key that needs to be
>rechecked,
>> so that it uses the reorder queue, then it will sometimes use the
>> reorder queue slot, and sometimes the table AM's slot, for the scan
>> slot. If they're not of the same type, you get an assertion:
>>
>> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
>> "execExprInterp.c", Line: 1905)
>>
>> Attached is a test for this, again using the toy table AM, extended
>to
>> be able to test this. And a fix.
>>
>> >> Attached is a patch with the toy implementation I used to test
>this. I'm not
>> >> suggesting we should commit that - although feel free to do that
>if you
>> >> think it's useful - but it shows how I bumped into these issues.
>> >
>> > Hm, probably not a bad idea to include something like it. It seems
>like
>> > we kinda would need non-stub implementation of more functions for
>it to
>> > test much / and to serve as an example.  I'm mildy inclined to just
>do
>> > it via zheap / externally, but I'm not quite sure that's good
>enough.
>>
>> Works for me.
>>
>> >> +static Size
>> >> +toyam_parallelscan_estimate(Relation rel)
>> >> +{
>> >> +ereport(ERROR,
>> >> +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> >> + errmsg("function %s not implemented yet",
>__func__)));
>> >> +}
>> >
>> > The other stubbed functions seem like we should require them, but I
>> > wonder if we should make the parallel stuff optional?
>>
>> Yeah, that would be good. I would assume it to be optional.
>>
>I was trying the toyam patch and on make check it failed with
>segmentation fault at
>
>static void
>toyam_relation_set_new_filenode(Relation rel,
> char persistence,
> TransactionId *freezeXid,
> MultiXactId *minmulti)
>{
> *freezeXid = InvalidTransactionId;
>
>Basically, on running create table t (i int, j int) using toytable,
>leads to this segmentation fault.
>
>Am I missing something here?

I assume you got compiler warmings compiling it? The API for some callbacks 
changed a bit.

Andred
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Pluggable Storage - Andres's take

2019-05-06 Thread Rafia Sabih
On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas  wrote:
>
> On 08/04/2019 20:37, Andres Freund wrote:
> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> >> There's a little bug in index-only scan executor node, where it mixes up 
> >> the
> >> slots to hold a tuple from the index, and from the table. That doesn't 
> >> cause
> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
> >> uses a virtual slot, it caused warnings like this from index-only scans:
> >
> > Hm. That's another one that I think I had fixed previously :(, and then
> > concluded that it's not actually necessary for some reason. Your fix
> > looks correct to me.  Do you want to commit it? Otherwise I'll look at
> > it after rebasing zheap, and checking it with that.
>
> I found another slot type confusion bug, while playing with zedstore. In
> an Index Scan, if you have an ORDER BY key that needs to be rechecked,
> so that it uses the reorder queue, then it will sometimes use the
> reorder queue slot, and sometimes the table AM's slot, for the scan
> slot. If they're not of the same type, you get an assertion:
>
> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
> "execExprInterp.c", Line: 1905)
>
> Attached is a test for this, again using the toy table AM, extended to
> be able to test this. And a fix.
>
> >> Attached is a patch with the toy implementation I used to test this. I'm 
> >> not
> >> suggesting we should commit that - although feel free to do that if you
> >> think it's useful - but it shows how I bumped into these issues.
> >
> > Hm, probably not a bad idea to include something like it. It seems like
> > we kinda would need non-stub implementation of more functions for it to
> > test much / and to serve as an example.  I'm mildy inclined to just do
> > it via zheap / externally, but I'm not quite sure that's good enough.
>
> Works for me.
>
> >> +static Size
> >> +toyam_parallelscan_estimate(Relation rel)
> >> +{
> >> +ereport(ERROR,
> >> +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> + errmsg("function %s not implemented yet", 
> >> __func__)));
> >> +}
> >
> > The other stubbed functions seem like we should require them, but I
> > wonder if we should make the parallel stuff optional?
>
> Yeah, that would be good. I would assume it to be optional.
>
I was trying the toyam patch and on make check it failed with
segmentation fault at

static void
toyam_relation_set_new_filenode(Relation rel,
 char persistence,
 TransactionId *freezeXid,
 MultiXactId *minmulti)
{
 *freezeXid = InvalidTransactionId;

Basically, on running create table t (i int, j int) using toytable,
leads to this segmentation fault.

Am I missing something here?


-- 
Regards,
Rafia Sabih




Re: Pluggable Storage - Andres's take

2019-04-29 Thread Ashwin Agrawal
On Thu, Apr 25, 2019 at 3:43 PM Andres Freund  wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).
>
> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> think that's OK.

I will provide my inputs, Heikki please correct me or add your inputs.

I am not sure how much gain this practically provides, if rest of the
system continues to use the value returned in-terms of blocks. I
understand things being block based (and not just really block based
but all the blocks of relation are storing data and full tuple) is
engraved in the system. So, breaking out of it is yes much larger
change and not just limited to table AM API.

I feel most of the issues discussed here should be faced by zheap as
well, as not all blocks/pages contain data like TPD pages should be
excluded from sampling and TID scans, etc...

> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size.  I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
> somehow.

Yes, I do think we should have mechanism to get total size and just
size for specific purpose. Zedstore currently doesn't use forks. Just
a thought, instead of calling it forknum as argument, call it
something like data and meta-data or main-data and auxiliary-data
size. Though I don't know if usage exists where wish to get size of
just some non MAIN fork for heap/zheap, those pieces of code shouldn't
be in generic areas instead in AM specific code only.

>
> > 2) commands/analyze.c, computing pg_class.relpages
> >
> >This should imo be moved to the tableam callback. It's currently done
> >a bit weirdly imo, with fdws computing relpages the callback, but
> >then also returning the acquirefunc. Seems like it should entirely be
> >computed as part of calling acquirefunc.
>
> Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> tableam wouldn't be the right minimal approach too. It has the
> disadvantage of implying certain values for the
> RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
> would be to return the desire sampling range in
> table_beginscan_analyze() - but that'd require some duplication because
> currently that just uses the generic scan_begin() callback.

Yes, just routing relation size via AM layer and using its returned
value in terms of blocks still and performing sampling based on blocks
based on it, doesn't feel resolves the issue. Maybe need to delegate
sampling completely to AM layer. Code duplication can be avoided by
similar AMs (heap and zheap) possible using some common utility
functions to achieve intended result.

>
> I suspect - as previously mentioned- that we're going to have to extend
> statistics collection beyond the current approach at some point, but I
> don't think that's now. At least to me it's not clear how to best
> represent the stats, and how to best use them, if the underlying storage
> is fundamentally not block best.  Nor how we'd avoid code duplication...

Yes, will have to give more thoughts into this.

>
> > 3) nodeTidscan, skipping over too large tids
> >I think this should just be moved into the AMs, there's no need to
> >have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
> /*
>  * We silently discard any TIDs that are out of range at the time of 
> scan
>  * start.  (Since we hold at least AccessShareLock on the table, it 
> won't
>  * be possible for someone to truncate away the blocks we intend to
>  * visit.)
>  */
> nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.

Agree, its not nice to have that optimization being performed based on
number of block in generic layer. I feel its not efficient either for
zheap too due to TPD pages as mentioned above, as number of blocks
returned will be higher compared to actually data blocks.

>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> 

Re: Pluggable Storage - Andres's take

2019-04-25 Thread Andres Freund
Hi Heikki, Ashwin, Tom,

On 2019-04-23 15:52:01 -0700, Andres Freund wrote:
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> > index_update_stats() calls RelationGetNumberOfBlocks(). If the AM
> > doesn't use normal data files, that won't work. I bumped into that with my
> > toy implementation, which wouldn't need to create any data files, if it
> > wasn't for this.
> 
> There are a few more of these:

> I'm not sure about doing so for v12 though. 1) and 3) are fairly
> trivial, but 2) would involve changing the FDW interface, by changing
> the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> we're not even in beta1.

Hm. I think some of those changes would be a bit bigger than I initially
though. Attached is a more minimal fix that'd route
RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
think it's definitely the right answer for 1), probably the pragmatic
answer to 2), but certainly not for 3).

I've for now made the AM return the size in bytes, and then convert that
into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
are going to continue to want it internally as pages (otherwise there's
going to be way too much churn, without a benefit I can see). So I
thinkt that's OK.

There's also a somewhat weird bit of returning the total relation size
for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
postgres' current forks, but have equivalent concepts. And without that
there'd be no way to get that size.  I'm not sure I like this, input
welcome. But it seems good to offer the ability to get the entire size
somehow.

Btw, isn't RelationGetNumberOfBlocksForFork() currently weirdly placed?
I don't see why bufmgr.c would be appropriate? Although I don't think
it's particulary clear where it'd best reside - I'd tentatively say
storage.c.

Heikki, Ashwin, your inputs would be appreciated here, in particular the
tid fetch bit below.


The attached patch isn't intended to be applied as-is, just basis for
discussion.


> 1) index_update_stats(), computing pg_class.relpages
> 
>Feels like the number of both heap and index blocks should be
>computed by the index build and stored in IndexInfo. That'd also get
>a bit closer towards allowing indexams not going through smgr (useful
>e.g. for memory only ones).

Due to parallel index builds that'd actually be hard. Given the number
of places wanting to compute relpages for pg_class I think the above
patch routing RelationGetNumberOfBlocksForFork() through tableam is the
right fix.


> 2) commands/analyze.c, computing pg_class.relpages
> 
>This should imo be moved to the tableam callback. It's currently done
>a bit weirdly imo, with fdws computing relpages the callback, but
>then also returning the acquirefunc. Seems like it should entirely be
>computed as part of calling acquirefunc.

Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
tableam wouldn't be the right minimal approach too. It has the
disadvantage of implying certain values for the
RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
would be to return the desire sampling range in
table_beginscan_analyze() - but that'd require some duplication because
currently that just uses the generic scan_begin() callback.

I suspect - as previously mentioned- that we're going to have to extend
statistics collection beyond the current approach at some point, but I
don't think that's now. At least to me it's not clear how to best
represent the stats, and how to best use them, if the underlying storage
is fundamentally not block best.  Nor how we'd avoid code duplication...


> 3) nodeTidscan, skipping over too large tids
>I think this should just be moved into the AMs, there's no need to
>have this in nodeTidscan.c

I think here it's *not* actually correct at all to use the relation
size. It's currently doing:

/*
 * We silently discard any TIDs that are out of range at the time of 
scan
 * start.  (Since we hold at least AccessShareLock on the table, it 
won't
 * be possible for someone to truncate away the blocks we intend to
 * visit.)
 */
nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);

which is fine (except for a certain abstraction leakage) for an AM like
heap or zheap, but I suspect strongly that that's not ok for Ashwin &
Heikki's approach where tid isn't tied to physical representation.


The obvious answer would be to just move that check into the
table_fetch_row_version implementation (currently just calling
heap_fetch()) - but that doesn't seem OK from a performance POV, because
we'd then determine the relation size once for each tid, rather than
once per tidscan.  And it'd also check in cases where we know the tid is
supposed to be valid (e.g. fetching trigger tuples and such).

The proper fix seems to be to introduce a new scan variant
(e.g. table_beginscan_tid()), and then 

Re: Pluggable Storage - Andres's take

2019-04-24 Thread Robert Haas
On Tue, Apr 23, 2019 at 6:55 PM Tom Lane  wrote:
> Andres Freund  writes:
> > ... I think none of these are critical issues for tableam, but we should fix
> > them.
>
> > I'm not sure about doing so for v12 though. 1) and 3) are fairly
> > trivial, but 2) would involve changing the FDW interface, by changing
> > the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> > we're not even in beta1.
>
> Probably better to fix those API issues now rather than later.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Pluggable Storage - Andres's take

2019-04-23 Thread Tom Lane
Andres Freund  writes:
> ... I think none of these are critical issues for tableam, but we should fix
> them.

> I'm not sure about doing so for v12 though. 1) and 3) are fairly
> trivial, but 2) would involve changing the FDW interface, by changing
> the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> we're not even in beta1.

Probably better to fix those API issues now rather than later.

regards, tom lane




Re: Pluggable Storage - Andres's take

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> index_update_stats() calls RelationGetNumberOfBlocks(). If the AM
> doesn't use normal data files, that won't work. I bumped into that with my
> toy implementation, which wouldn't need to create any data files, if it
> wasn't for this.

There are a few more of these:

1) index_update_stats(), computing pg_class.relpages

   Feels like the number of both heap and index blocks should be
   computed by the index build and stored in IndexInfo. That'd also get
   a bit closer towards allowing indexams not going through smgr (useful
   e.g. for memory only ones).

2) commands/analyze.c, computing pg_class.relpages

   This should imo be moved to the tableam callback. It's currently done
   a bit weirdly imo, with fdws computing relpages the callback, but
   then also returning the acquirefunc. Seems like it should entirely be
   computed as part of calling acquirefunc.

3) nodeTidscan, skipping over too large tids
   I think this should just be moved into the AMs, there's no need to
   have this in nodeTidscan.c

4) freespace.c, used for the new small-rels-have-no-fsm paths.
   That's being revised currently anyway. But I'm not particularly
   concerned even if it stays as is - freespace use is optional
   anyway. And I can't quite see an AM that doesn't want to use
   postgres' storage mechanism wanting to use freespace.c

   Therefore I'm inclined not to thouch this independent of fixing the
   others.

I think none of these are critical issues for tableam, but we should fix
them.

I'm not sure about doing so for v12 though. 1) and 3) are fairly
trivial, but 2) would involve changing the FDW interface, by changing
the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
we're not even in beta1.

Comments?

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-04-18 Thread Andres Freund
Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> The comments for relation_set_new_relfilenode() callback say that the AM can
> set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits
> this assertion:
> 
> TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
> 3)))", File: "vacuum.c", Line: 1323)

Hm, that necessary change unfortunately escaped into the the zheap tree
(which indeed doesn't set relfrozenxid). That's why I'd not noticed
this.  How about something like the attached?



I found a related problem in VACUUM FULL / CLUSTER while working on the
above, not fixed in the attached yet. Namely even if a relation doesn't
yet have a valid relfrozenxid/relminmxid before a VACUUM FULL / CLUSTER,
we'll set one after that. That's not great.

I suspect the easiest fix would be to to make the relevant
relation_copy_for_cluster() FreezeXid, MultiXactCutoff arguments into
pointer, and allow the AM to reset them to an invalid value if that's
the appropriate one.

It'd probably be better if we just moved the entire xid limit
computation into the AM, but I'm worried that we actually need to move
it *further up* instead - independent of this change. I don't think it's
quite right to allow a table with a toast table to be independently
VACUUM FULL/CLUSTERed from the toast table. GetOldestXmin() can go
backwards for a myriad of reasons (or limited by
old_snapshot_threshold), and I'm fairly certain that e.g. VACUUM FULLing
the toast table, setting a lower old_snapshot_threshold, and VACUUM
FULLing the main table would result in failures.

I think we need to fix this for 12, rather than wait for 13. Does
anybody disagree?

Greetings,

Andres Freund
>From 5c84256ea5e41055b0cb9e0dc121a4daaca43336 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 18 Apr 2019 13:20:31 -0700
Subject: [PATCH] Allow pg_class xid & multixid horizons to not be set.

This allows table AMs that don't need these horizons. This was already
documented in the tableam relation_set_new_filenode callback, but an
assert prevented if from actually working (the test AM contained the
ne necessary AM itself).

Reported-By: Heikki Linnakangas
Author: Andres Freund
Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f...@iki.fi
---
 src/backend/access/heap/vacuumlazy.c |  4 +++
 src/backend/commands/vacuum.c| 53 
 src/backend/postmaster/autovacuum.c  |  4 +--
 3 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8dc76fa8583..9364cd4c33f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -213,6 +213,10 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 
+	/* not every AM requires these to be valid, but heap does */
+	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
+	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
+
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1a7291d94bc..94fb6f26063 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1313,36 +1313,61 @@ vac_update_datfrozenxid(void)
 
 		/*
 		 * Only consider relations able to hold unfrozen XIDs (anything else
-		 * should have InvalidTransactionId in relfrozenxid anyway.)
+		 * should have InvalidTransactionId in relfrozenxid anyway).
 		 */
 		if (classForm->relkind != RELKIND_RELATION &&
 			classForm->relkind != RELKIND_MATVIEW &&
 			classForm->relkind != RELKIND_TOASTVALUE)
+		{
+			Assert(!TransactionIdIsValid(classForm->relfrozenxid));
+			Assert(!MultiXactIdIsValid(classForm->relminmxid));
 			continue;
-
-		Assert(TransactionIdIsNormal(classForm->relfrozenxid));
-		Assert(MultiXactIdIsValid(classForm->relminmxid));
+		}
 
 		/*
+		 * Some table AMs might not need per-relation xid / multixid
+		 * horizons. It therefore seems reasonable to allow relfrozenxid and
+		 * relminmxid to not be set (i.e. set to their respective Invalid*Id)
+		 * independently. Thus validate and compute horizon for each only if
+		 * set.
+		 *
 		 * If things are working properly, no relation should have a
 		 * relfrozenxid or relminmxid that is "in the future".  However, such
 		 * cases have been known to arise due to bugs in pg_upgrade.  If we
 		 * see any entries that are "in the future", chicken out and don't do
-		 * anything.  This ensures we won't truncate clog before those
-		 * relations have been scanned and cleaned up.
+		 * anything.  This ensures we won't truncate clog & multixact SLRUs
+		 * before those relations have been scanned and cleaned up.
 		 */
-		if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) ||
-			

Re: Pluggable Storage - Andres's take

2019-04-11 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
>> + * HEIKKI: A flags bitmask argument would be more readable than 6 
>> booleans

> I honestly don't have strong feelings about it. Not sure that I buy that
> bitmasks would be much more readable

Sure they would be --- how's FLAG_FOR_FOO | FLAG_FOR_BAR not
better than unlabeled "true" and "false"?

> - but perhaps we could just use the
> struct trickery we started to use in

I find that rather ugly really.  If we're doing something other than a
dozen-or-so booleans, maybe its the only viable option.  But for cases
where a flags argument will serve, that's our longstanding practice and
I don't see a reason to deviate.

regards, tom lane




Re: Pluggable Storage - Andres's take

2019-04-11 Thread Robert Haas
On Thu, Apr 11, 2019 at 12:49 PM Andres Freund  wrote:
> > @@ -179,6 +184,12 @@ typedef struct TableAmRoutine
> >*
> >* if temp_snap is true, the snapshot will need to be deallocated at
> >* scan_end.
> > +  *
> > +  * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
> > +  * a bit surprising for the AM, no? Can it be called when a scan is
> > +  * already in progress?
>
> Yea, it can be called when the scan is in-progress. I think we probably
> should just fix calling code to not need that - it's imo weird that
> nodeBitmapHeapscan.c doesn't just delay starting the scan till it has
> the snapshot. This isn't new code, but it's now going to be exposed to
> more AMs, so I think there's a good argument to fix it now.
>
> Robert: You committed that addition, in
>
> commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9
> Author: Robert Haas 
> Date:   2017-03-08 12:05:43 -0500
>
> Support parallel bitmap heap scans.
>
> do you remember why that's done?

I don't think there was any brilliant idea behind it.  Delaying the
scan start until it has the snapshot seems like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Pluggable Storage - Andres's take

2019-04-11 Thread Andres Freund
Hi,

On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
> Here is another iteration on the comments. The patch is a mix of
> copy-editing and questions. The questions are marked with "HEIKKI:". I can
> continue the copy-editing, if you can reply to the questions, clarifying the
> intention on some parts of the API. (Or feel free to pick and push any of
> these fixes immediately, if you prefer.)

Thanks!

> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
> index f7f726b5aec..bbcab9ce31a 100644
> --- a/src/backend/utils/misc/guc.c
> +++ b/src/backend/utils/misc/guc.c
> @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] =
>   {"default_table_access_method", PGC_USERSET, 
> CLIENT_CONN_STATEMENT,
>   gettext_noop("Sets the default table access method for 
> new tables."),
>   NULL,
> - GUC_IS_NAME
> + GUC_NOT_IN_SAMPLE | GUC_IS_NAME
>   },
>   _table_access_method,
>   DEFAULT_TABLE_ACCESS_METHOD,

Hm, I think we should rather add it to sample. That's an oversight, not
intentional.


> index 6fbfcb96c98..d4709563e7e 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -91,8 +91,9 @@ typedef enum TM_Result
>   * xmax is the outdating transaction's XID.  If the caller wants to visit the
>   * replacement tuple, it must check that this matches before believing the
>   * replacement is really a match.
> + * HEIKKI: matches what? xmin, but that's specific to the heapam.

It's basically just the old comment moved. I wonder if we can just get
rid of that field - because the logic to follow update chains correctly
is now inside the lock tuple callback. And as you say - it's not clear
what callers can do with it for the purpose of following chains.  The
counter-argument is that having it makes it a lot less annoying to adapt
external code that wants to adapt with the minimal set of changes, and
only is really interested in supporting heap for now.


>   * GetTableAmRoutine() asserts that required callbacks are filled in, 
> remember
>   * to update when adding a callback.
> @@ -179,6 +184,12 @@ typedef struct TableAmRoutine
>*
>* if temp_snap is true, the snapshot will need to be deallocated at
>* scan_end.
> +  *
> +  * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
> +  * a bit surprising for the AM, no? Can it be called when a scan is
> +  * already in progress?

Yea, it can be called when the scan is in-progress. I think we probably
should just fix calling code to not need that - it's imo weird that
nodeBitmapHeapscan.c doesn't just delay starting the scan till it has
the snapshot. This isn't new code, but it's now going to be exposed to
more AMs, so I think there's a good argument to fix it now.

Robert: You committed that addition, in

commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9
Author: Robert Haas 
Date:   2017-03-08 12:05:43 -0500

Support parallel bitmap heap scans.

do you remember why that's done?



> +  * HEIKKI: A flags bitmask argument would be more readable than 6 
> booleans
>*/
>   TableScanDesc (*scan_begin) (Relation rel,
>Snapshot 
> snapshot,

I honestly don't have strong feelings about it. Not sure that I buy that
bitmasks would be much more readable - but perhaps we could just use the
struct trickery we started to use in

commit f831d4accda00b9144bc647ede2e2f848b59f39d
Author: Alvaro Herrera 
Date:   2019-02-01 11:29:42 -0300

Add ArchiveOpts to pass options to ArchiveEntry


> @@ -194,6 +205,9 @@ typedef struct TableAmRoutine
>   /*
>* Release resources and deallocate scan. If TableScanDesc.temp_snap,
>* TableScanDesc.rs_snapshot needs to be unregistered.
> +  *
> +  * HEIKKI: I find this 'temp_snap' thing pretty weird. Can't the caller 
> handle
> +  * deregistering it?
>*/
>   void(*scan_end) (TableScanDesc scan);

It's old logic, just wrapped new.  I think there's some argument that
some of this should be moved to tableam.c rather than the individual
AMs.



> @@ -221,6 +235,11 @@ typedef struct TableAmRoutine
>   /*
>* Estimate the size of shared memory needed for a parallel scan of this
>* relation. The snapshot does not need to be accounted for.
> +  *
> +  * HEIKKI: If this returns X, then the parallelscan_initialize() call
> +  * mustn't use more than X. So this is not just for optimization 
> purposes,
> +  * for example. Not sure how to phrase that, but could use some
> +  * clarification.
>*/
>   Size(*parallelscan_estimate) (Relation rel);

Hm. I thought I'd done that by adding the note about the amount of space
parallelscan_initialize() getting memory sized by
parallelscan_estimate().


>   /*
>

Re: Pluggable Storage - Andres's take

2019-04-11 Thread Heikki Linnakangas

On 08/04/2019 20:37, Andres Freund wrote:

Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:

There were a bunch of typos in the comments in tableam.h, see attached. Some
of the comments could use more copy-editing and clarification, I think, but
I stuck to fixing just typos and such for now.

I pushed these after adding three boring changes by pgindent. Thanks for
those!

I'd greatly welcome more feedback on the comments - I've been pretty
deep in this for so long that I don't see all of the issues anymore. And
a mild dyslexia doesn't help...


Here is another iteration on the comments. The patch is a mix of 
copy-editing and questions. The questions are marked with "HEIKKI:". I 
can continue the copy-editing, if you can reply to the questions, 
clarifying the intention on some parts of the API. (Or feel free to pick 
and push any of these fixes immediately, if you prefer.)


- Heikki
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f7f726b5aec..bbcab9ce31a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] =
 		{"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Sets the default table access method for new tables."),
 			NULL,
-			GUC_IS_NAME
+			GUC_NOT_IN_SAMPLE | GUC_IS_NAME
 		},
 		_table_access_method,
 		DEFAULT_TABLE_ACCESS_METHOD,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 82de4cdcf2c..8aeeba38ca2 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -57,6 +57,8 @@ typedef struct TableScanDescData *TableScanDesc;
  * pointer to this structure.  The information here must be sufficient to
  * properly initialize each new TableScanDesc as workers join the scan, and it
  * must act as a information what to scan for those workers.
+ *
+ * This is stored in dynamic shared memory, so you can't use pointers here.
  */
 typedef struct ParallelTableScanDescData
 {
@@ -64,6 +66,11 @@ typedef struct ParallelTableScanDescData
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
 	Size		phs_snapshot_off;	/* data for snapshot */
+
+	/*
+	 * Table AM specific data follows. After the table AM specific data
+	 * comes the snapshot data, at 'phs_snapshot_off'.
+	 */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6fbfcb96c98..d4709563e7e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -91,8 +91,9 @@ typedef enum TM_Result
  * xmax is the outdating transaction's XID.  If the caller wants to visit the
  * replacement tuple, it must check that this matches before believing the
  * replacement is really a match.
+ * HEIKKI: matches what? xmin, but that's specific to the heapam.
  *
- * cmax is the outdating command's CID, but only when the failure code is
+ * cmax is the outdating command's CID. Only set when the failure code is
  * TM_SelfModified (i.e., something in the current transaction outdated the
  * tuple); otherwise cmax is zero.  (We make this restriction because
  * HeapTupleHeaderGetCmax doesn't work for tuples outdated in other
@@ -133,7 +134,11 @@ typedef void (*IndexBuildCallback) (Relation index,
  * returned by FormData_pg_am.amhandler.
  *
  * In most cases it's not appropriate to call the callbacks directly, use the
- * table_* wrapper functions instead.
+ * table_* wrapper functions instead. The descriptions of the callbacks here
+ * are written from the AM implementor's point of view. For more information
+ * on how to call them, see the wrapper functions. (If you update the comments
+ * on either a callback or a wrapper function, remember to also update the other
+ * one!)
  *
  * GetTableAmRoutine() asserts that required callbacks are filled in, remember
  * to update when adding a callback.
@@ -179,6 +184,12 @@ typedef struct TableAmRoutine
 	 *
 	 * if temp_snap is true, the snapshot will need to be deallocated at
 	 * scan_end.
+	 *
+	 * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
+	 * a bit surprising for the AM, no? Can it be called when a scan is
+	 * already in progress?
+	 *
+	 * HEIKKI: A flags bitmask argument would be more readable than 6 booleans
 	 */
 	TableScanDesc (*scan_begin) (Relation rel,
  Snapshot snapshot,
@@ -194,6 +205,9 @@ typedef struct TableAmRoutine
 	/*
 	 * Release resources and deallocate scan. If TableScanDesc.temp_snap,
 	 * TableScanDesc.rs_snapshot needs to be unregistered.
+	 *
+	 * HEIKKI: I find this 'temp_snap' thing pretty weird. Can't the caller handle
+	 * deregistering it?
 	 */
 	void		(*scan_end) (TableScanDesc scan);
 
@@ -221,6 +235,11 @@ typedef struct TableAmRoutine
 	/*
 	 * Estimate the size of shared memory needed for a parallel scan of this
 	 

Re: Pluggable Storage - Andres's take

2019-04-09 Thread Heikki Linnakangas

On 08/04/2019 20:37, Andres Freund wrote:

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:

There's a little bug in index-only scan executor node, where it mixes up the
slots to hold a tuple from the index, and from the table. That doesn't cause
any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
uses a virtual slot, it caused warnings like this from index-only scans:


Hm. That's another one that I think I had fixed previously :(, and then
concluded that it's not actually necessary for some reason. Your fix
looks correct to me.  Do you want to commit it? Otherwise I'll look at
it after rebasing zheap, and checking it with that.


I found another slot type confusion bug, while playing with zedstore. In 
an Index Scan, if you have an ORDER BY key that needs to be rechecked, 
so that it uses the reorder queue, then it will sometimes use the 
reorder queue slot, and sometimes the table AM's slot, for the scan 
slot. If they're not of the same type, you get an assertion:


TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File: 
"execExprInterp.c", Line: 1905)


Attached is a test for this, again using the toy table AM, extended to 
be able to test this. And a fix.



Attached is a patch with the toy implementation I used to test this. I'm not
suggesting we should commit that - although feel free to do that if you
think it's useful - but it shows how I bumped into these issues.


Hm, probably not a bad idea to include something like it. It seems like
we kinda would need non-stub implementation of more functions for it to
test much / and to serve as an example.  I'm mildy inclined to just do
it via zheap / externally, but I'm not quite sure that's good enough.


Works for me.


+static Size
+toyam_parallelscan_estimate(Relation rel)
+{
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("function %s not implemented yet", __func__)));
+}


The other stubbed functions seem like we should require them, but I
wonder if we should make the parallel stuff optional?


Yeah, that would be good. I would assume it to be optional.

- Heikki
>From e8854c876927b32e21f485337dd2335f4bfebd32 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 8 Apr 2019 15:18:19 +0300
Subject: [PATCH 1/4] Fix confusion on different kinds of slots in
 IndexOnlyScans.

We used the same slot, to store a tuple from the index, and to store a
tuple from the table. That's not OK. It worked with the heap, because
heapam_getnextslot() stores a HeapTuple to the slot, and doesn't care how
large the tts_values/nulls arrays are. But when I played with a toy table
AM implementation that used a virtual tuple, it caused memory overruns.
---
 src/backend/executor/nodeIndexonlyscan.c | 16 +---
 src/include/nodes/execnodes.h|  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 7711728495c..5833d683b38 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -166,10 +166,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			 * Rats, we have to visit the heap to check visibility.
 			 */
 			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, slot))
+			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
 continue;		/* no visible tuple, try next index entry */
 
-			ExecClearTuple(slot);
+			ExecClearTuple(node->ioss_TableSlot);
 
 			/*
 			 * Only MVCC snapshots are supported here, so there should be no
@@ -528,7 +528,17 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	 */
 	tupDesc = ExecTypeFromTL(node->indextlist);
 	ExecInitScanTupleSlot(estate, >ss, tupDesc,
-		  table_slot_callbacks(currentRelation));
+		  );
+
+	/*
+	 * We need another slot, in a format that's suitable for the table AM,
+	 * for when we need to fetch a tuple from the table for rechecking
+	 * visibility.
+	 */
+	indexstate->ioss_TableSlot =
+		ExecAllocTableSlot(>es_tupleTable,
+		   RelationGetDescr(currentRelation),
+		   table_slot_callbacks(currentRelation));
 
 	/*
 	 * Initialize result type and projection info.  The node's targetlist will
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a5e4b7ef2e0..108dee61e24 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1424,6 +1424,7 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+	TupleTableSlot *ioss_TableSlot;
 } IndexOnlyScanState;
 
 /* 
-- 
2.20.1

>From 309a773e09aa3d1256a431f2030f0f5819e6b32d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 8 Apr 2019 15:16:53 +0300
Subject: [PATCH 2/4] Add a toy table AM implementation to play with.

It returns a constant data set. No insert/update/delete. But you can
create 

Re: Pluggable Storage - Andres's take

2019-04-08 Thread Andres Freund
Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> There were a bunch of typos in the comments in tableam.h, see attached. Some
> of the comments could use more copy-editing and clarification, I think, but
> I stuck to fixing just typos and such for now.

I pushed these after adding three boring changes by pgindent. Thanks for
those!

I'd greatly welcome more feedback on the comments - I've been pretty
deep in this for so long that I don't see all of the issues anymore. And
a mild dyslexia doesn't help...


> index_update_stats() calls RelationGetNumberOfBlocks(). If the AM
> doesn't use normal data files, that won't work. I bumped into that with my
> toy implementation, which wouldn't need to create any data files, if it
> wasn't for this.

Hm. That should be fixed. I've been burning the candle on both ends for
too long, so I'll not get to it today. But I think we should fix it
soon.  I'll create an open item for it.


> The comments for relation_set_new_relfilenode() callback say that the AM can
> set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits
> this assertion:
> 
> TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
> 3)))", File: "vacuum.c", Line: 1323)

Hm. That needs to be fixed - IIRC it previously worked, because zheap
doesn't have relfrozenxid either. Probably broke it when trying to
winnow down the tableam patches. I'm planning to rebase zheap onto the
newest version soon, so I'll re-encounter this.


> There's a little bug in index-only scan executor node, where it mixes up the
> slots to hold a tuple from the index, and from the table. That doesn't cause
> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
> uses a virtual slot, it caused warnings like this from index-only scans:

Hm. That's another one that I think I had fixed previously :(, and then
concluded that it's not actually necessary for some reason. Your fix
looks correct to me.  Do you want to commit it? Otherwise I'll look at
it after rebasing zheap, and checking it with that.


> Attached is a patch with the toy implementation I used to test this. I'm not
> suggesting we should commit that - although feel free to do that if you
> think it's useful - but it shows how I bumped into these issues.

Hm, probably not a bad idea to include something like it. It seems like
we kinda would need non-stub implementation of more functions for it to
test much / and to serve as an example.  I'm mildy inclined to just do
it via zheap / externally, but I'm not quite sure that's good enough.


> +static Size
> +toyam_parallelscan_estimate(Relation rel)
> +{
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +  errmsg("function %s not implemented yet", __func__)));
> +}

The other stubbed functions seem like we should require them, but I
wonder if we should make the parallel stuff optional?


Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-04-08 Thread Fabrízio de Royes Mello
On Mon, Apr 8, 2019 at 9:34 AM Heikki Linnakangas  wrote:
>
> I wrote a little toy implementation that just returns constant data to
> play with this a little. Looks good overall.
>
> There were a bunch of typos in the comments in tableam.h, see attached.
> Some of the comments could use more copy-editing and clarification, I
> think, but I stuck to fixing just typos and such for now.
>
> index_update_stats() calls RelationGetNumberOfBlocks(). If the AM
> doesn't use normal data files, that won't work. I bumped into that with
> my toy implementation, which wouldn't need to create any data files, if
> it wasn't for this.
>
> The comments for relation_set_new_relfilenode() callback say that the AM
> can set *freezeXid and *minmulti to invalid. But when I did that, VACUUM
> hits this assertion:
>
> TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
> 3)))", File: "vacuum.c", Line: 1323)
>
> There's a little bug in index-only scan executor node, where it mixes up
> the slots to hold a tuple from the index, and from the table. That
> doesn't cause any ill effects if the AM uses TTSOpsHeapTuple, but with
> my toy AM, which uses a virtual slot, it caused warnings like this from
> index-only scans:
>
> WARNING:  problem in alloc set ExecutorState: detected write past chunk
> end in block 0x56419b0f88e8, chunk 0x56419b0f8f90
>
> Attached is a patch with the toy implementation I used to test this.
> I'm not suggesting we should commit that - although feel free to do that
> if you think it's useful - but it shows how I bumped into these issues.
> The second patch fixes the index-only-scan slot confusion (untested,
> except with my toy AM).
>

Awesome... it's built and ran tests cleanly, but I got assertion running
VACUUM:

fabrizio=# vacuum toytab ;
TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
3)))", File: "vacuum.c", Line: 1323)
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-04-08
12:29:16.204 -03 [20844] LOG:  server process (PID 24457) was terminated by
signal 6: Aborted
2019-04-08 12:29:16.204 -03 [20844] DETAIL:  Failed process was running:
vacuum toytab ;
2019-04-08 12:29:16.204 -03 [20844] LOG:  terminating any other active
server processes
2019-04-08 12:29:16.205 -03 [24458] WARNING:  terminating connection
because of crash of another server process

And backtrace is:

(gdb) bt
#0  0x7f813779f428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x7f81377a102a in __GI_abort () at abort.c:89
#2  0x00ec0de9 in ExceptionalCondition (conditionName=0x10e3bb8
"!(((classForm->relfrozenxid) >= ((TransactionId) 3)))",
errorType=0x10e33f3 "FailedAssertion", fileName=0x10e345a "vacuum.c",
lineNumber=1323) at assert.c:54
#3  0x00893646 in vac_update_datfrozenxid () at vacuum.c:1323
#4  0x0089127a in vacuum (relations=0x26c4390,
params=0x7ffeb1a3fb30, bstrategy=0x26c4218, isTopLevel=true) at vacuum.c:452
#5  0x008906ae in ExecVacuum (pstate=0x26145b8, vacstmt=0x25f46f0,
isTopLevel=true) at vacuum.c:196
#6  0x00c3a883 in standard_ProcessUtility (pstmt=0x25f4a50,
queryString=0x25f3be8 "vacuum toytab ;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "")
at utility.c:670
#7  0x00c3977a in ProcessUtility (pstmt=0x25f4a50,
queryString=0x25f3be8 "vacuum toytab ;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "")
at utility.c:360
#8  0x00c3793e in PortalRunUtility (portal=0x265ba28,
pstmt=0x25f4a50, isTopLevel=true, setHoldSnapshot=false, dest=0x25f4b48,
completionTag=0x7ffeb1a3ffb0 "") at pquery.c:1175
#9  0x00c37d7f in PortalRunMulti (portal=0x265ba28,
isTopLevel=true, setHoldSnapshot=false, dest=0x25f4b48, altdest=0x25f4b48,
completionTag=0x7ffeb1a3ffb0 "") at pquery.c:1321
#10 0x00c36899 in PortalRun (portal=0x265ba28,
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x25f4b48,
altdest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at pquery.c:796
#11 0x00c2a40e in exec_simple_query (query_string=0x25f3be8 "vacuum
toytab ;") at postgres.c:1215
#12 0x00c332a3 in PostgresMain (argc=1, argv=0x261fe68,
dbname=0x261fca8 "fabrizio", username=0x261fc80 "fabrizio") at
postgres.c:4249
#13 0x00b051fc in BackendRun (port=0x2616d20) at postmaster.c:4429
#14 0x00b042c3 in BackendStartup (port=0x2616d20) at
postmaster.c:4120
#15 0x00afc70a in ServerLoop () at postmaster.c:1703
#16 0x00afb94e in PostmasterMain (argc=3, argv=0x25ed850) at
postmaster.c:1376
#17 0x00977de8 in main (argc=3, argv=0x25ed850) at main.c:228


Isn't better raise an exception as you did in other functions??

static void
toyam_relation_vacuum(Relation onerel,

Re: Pluggable Storage - Andres's take

2019-04-08 Thread Heikki Linnakangas
I wrote a little toy implementation that just returns constant data to 
play with this a little. Looks good overall.


There were a bunch of typos in the comments in tableam.h, see attached. 
Some of the comments could use more copy-editing and clarification, I 
think, but I stuck to fixing just typos and such for now.


index_update_stats() calls RelationGetNumberOfBlocks(). If the AM 
doesn't use normal data files, that won't work. I bumped into that with 
my toy implementation, which wouldn't need to create any data files, if 
it wasn't for this.


The comments for relation_set_new_relfilenode() callback say that the AM 
can set *freezeXid and *minmulti to invalid. But when I did that, VACUUM 
hits this assertion:


TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId) 
3)))", File: "vacuum.c", Line: 1323)


There's a little bug in index-only scan executor node, where it mixes up 
the slots to hold a tuple from the index, and from the table. That 
doesn't cause any ill effects if the AM uses TTSOpsHeapTuple, but with 
my toy AM, which uses a virtual slot, it caused warnings like this from 
index-only scans:


WARNING:  problem in alloc set ExecutorState: detected write past chunk 
end in block 0x56419b0f88e8, chunk 0x56419b0f8f90


Attached is a patch with the toy implementation I used to test this. 
I'm not suggesting we should commit that - although feel free to do that 
if you think it's useful - but it shows how I bumped into these issues. 
The second patch fixes the index-only-scan slot confusion (untested, 
except with my toy AM).


- Heikki
>From 97e0eea6a3fb123845ac5650f1aaa1802bf56694 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 8 Apr 2019 15:16:53 +0300
Subject: [PATCH 1/3] Add a toy table AM implementation to play with.

It returns a constant data set. No insert/update/delete. But you can
create indexes.
---
 src/test/modules/toytable/Makefile|  25 +
 .../modules/toytable/expected/toytable.out|  41 ++
 src/test/modules/toytable/sql/toytable.sql|  17 +
 src/test/modules/toytable/toytable--1.0.sql   |  12 +
 src/test/modules/toytable/toytable.control|   4 +
 src/test/modules/toytable/toytableam.c| 612 ++
 6 files changed, 711 insertions(+)
 create mode 100644 src/test/modules/toytable/Makefile
 create mode 100644 src/test/modules/toytable/expected/toytable.out
 create mode 100644 src/test/modules/toytable/sql/toytable.sql
 create mode 100644 src/test/modules/toytable/toytable--1.0.sql
 create mode 100644 src/test/modules/toytable/toytable.control
 create mode 100644 src/test/modules/toytable/toytableam.c

diff --git a/src/test/modules/toytable/Makefile b/src/test/modules/toytable/Makefile
new file mode 100644
index 000..142ef2d23e6
--- /dev/null
+++ b/src/test/modules/toytable/Makefile
@@ -0,0 +1,25 @@
+# src/test/modules/toytable/Makefile
+
+MODULE_big = toytable
+OBJS = toytableam.o $(WIN32RES)
+PGFILEDESC = "A dummy implementantation of the table AM API"
+
+EXTENSION = toytable
+DATA = toytable--1.0.sql
+
+REGRESS = toytable
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/toytable
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+OBJS = toytableam.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/test/modules/toytable/expected/toytable.out b/src/test/modules/toytable/expected/toytable.out
new file mode 100644
index 000..3e8598e284c
--- /dev/null
+++ b/src/test/modules/toytable/expected/toytable.out
@@ -0,0 +1,41 @@
+CREATE EXTENSION toytable;
+create table toytab (i int4, j int4, k int4) using toytable;
+select * from toytab;
+ i  | j  | k  
+++
+  1 |  1 |  1
+  2 |  2 |  2
+  3 |  3 |  3
+  4 |  4 |  4
+  5 |  5 |  5
+  6 |  6 |  6
+  7 |  7 |  7
+  8 |  8 |  8
+  9 |  9 |  9
+ 10 | 10 | 10
+(10 rows)
+
+create index toyidx on toytab(i);
+-- test index scan
+set enable_seqscan=off;
+set enable_indexscan=on;
+select i, j from toytab where i = 4;
+ i | j 
+---+---
+ 4 | 4
+(1 row)
+
+-- index only scan
+explain (costs off) select i from toytab where i = 4;
+   QUERY PLAN   
+
+ Index Only Scan using toyidx on toytab
+   Index Cond: (i = 4)
+(2 rows)
+
+select i from toytab where i = 4 ;
+ i 
+---
+ 4
+(1 row)
+
diff --git a/src/test/modules/toytable/sql/toytable.sql b/src/test/modules/toytable/sql/toytable.sql
new file mode 100644
index 000..8d9bac41bbf
--- /dev/null
+++ b/src/test/modules/toytable/sql/toytable.sql
@@ -0,0 +1,17 @@
+CREATE EXTENSION toytable;
+
+create table toytab (i int4, j int4, k int4) using toytable;
+
+select * from toytab;
+
+create index toyidx on toytab(i);
+
+-- test index scan
+set enable_seqscan=off;
+set enable_indexscan=on;
+
+select i, j from toytab where i = 4;
+
+-- index only scan
+explain (costs off) 

Re: Pluggable Storage - Andres's take

2019-04-05 Thread Andres Freund
Hi,

On 2019-04-04 00:51:38 -0500, Justin Pryzby wrote:
> I reviewed new docs for $SUBJECT.

> Find attached proposed changes.

> There's one XXX item I'm unsure what it's intended to say.

Thanks! I applied most of these, and filled in the XXX. I didn't like
the s/allow to to specify properties/allow specification of properties/,
so I left those out. But I could be convinced otherwise...

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-04-03 Thread Justin Pryzby
I reviewed new docs for $SUBJECT.

Find attached proposed changes.

There's one XXX item I'm unsure what it's intended to say.

Justin

>From a3d290bf67af2a34e44cd6c160daf552b56a13b5 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 4 Apr 2019 00:48:09 -0500
Subject: [PATCH v1] Fine tune documentation for tableam

Added at commit b73c3a11963c8bb783993cfffabb09f558f86e37
---
 doc/src/sgml/catalogs.sgml|  2 +-
 doc/src/sgml/config.sgml  |  4 ++--
 doc/src/sgml/ref/select_into.sgml |  6 +++---
 doc/src/sgml/storage.sgml | 17 ---
 doc/src/sgml/tableam.sgml | 44 ---
 5 files changed, 38 insertions(+), 35 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 58c8c96..40ddec4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -587,7 +587,7 @@
The catalog pg_am stores information about
relation access methods.  There is one row for each access method supported
by the system.
-   Currently, only table and indexes have access methods. The requirements for 
table
+   Currently, only tables and indexes have access methods. The requirements 
for table
and index access methods are discussed in detail in  and
 respectively.
   
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4a9a1e8..90b478d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7306,8 +7306,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
 This parameter specifies the default table access method to use when
 creating tables or materialized views if the CREATE
 command does not explicitly specify an access method, or when
-SELECT ... INTO is used, which does not allow to
-specify a table access method. The default is heap.
+SELECT ... INTO is used, which does not allow
+specification of a table access method. The default is 
heap.

   
  
diff --git a/doc/src/sgml/ref/select_into.sgml 
b/doc/src/sgml/ref/select_into.sgml
index 17bed24..1443d79 100644
--- a/doc/src/sgml/ref/select_into.sgml
+++ b/doc/src/sgml/ref/select_into.sgml
@@ -106,11 +106,11 @@ SELECT [ ALL | DISTINCT [ ON ( expression
 
   
-   In contrast to CREATE TABLE AS SELECT
-   INTO does not allow to specify properties like a table's access
+   In contrast to CREATE TABLE AS, SELECT
+   INTO does not allow specification of properties like a table's 
access
method with  or the table's
tablespace with . Use  if necessary.  Therefore the default table
+   linkend="sql-createtableas"/> if necessary.  Therefore, the default table
access method is chosen for the new table. See  for more information.
   
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 62333e3..5dfca1b 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -189,11 +189,11 @@ there.
 
 
 
- Note that the following sections describe the way the builtin
+ Note that the following sections describe the behavior of the builtin
  heap table access method,
- and the builtin index access methods work. Due
- to the extensible nature of PostgreSQL other types
- of access method might work similar or not.
+ and the builtin index access methods. Due
+ to the extensible nature of PostgreSQL, other
+ access methods might work differently.
 
 
 
@@ -703,11 +703,12 @@ erased (they will be recreated automatically as needed).
 This section provides an overview of the page format used within
 PostgreSQL tables and indexes.
   
-Actually, neither table nor index access methods need not use this page
-format.  All the existing index methods do use this basic format, but the
+Actually, use of this page format is not required for either table or index
+access methods.
+The heap table access method always uses this format.
+All the existing index methods also use the basic format, but the
 data kept on index metapages usually doesn't follow the item layout
-rules. The heap table access method also always uses
-this format.
+rules.
   
 
 Sequences and TOAST tables are formatted just like a 
regular table.
diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml
index 8d9bfd8..0a89935 100644
--- a/doc/src/sgml/tableam.sgml
+++ b/doc/src/sgml/tableam.sgml
@@ -48,54 +48,56 @@
   callbacks and their behavior is defined in the
   TableAmRoutine structure (with comments inside the
   struct defining the requirements for callbacks). Most callbacks have
-  wrapper functions, which are documented for the point of view of a user,
-  rather than an implementor, of the table access method.  For details,
+  wrapper functions, which are documented from the point of view of a user
+  (rather than an implementor) of the table access method.  For details,
   please refer to the 

Re: Pluggable Storage - Andres's take

2019-04-02 Thread Andres Freund
Hi,

On 2019-04-02 17:11:07 +1100, Haribabu Kommi wrote:
> From a72cfcd523887f1220473231d7982928acc23684 Mon Sep 17 00:00:00 2001
> From: Hari Babu 
> Date: Tue, 2 Apr 2019 15:41:17 +1100
> Subject: [PATCH 1/2] tableam : doc update of table access methods
> 
> Providing basic explanation of table access methods
> including their structure details and reference heap
> implementation files.
> ---
>  doc/src/sgml/catalogs.sgml |  5 ++--
>  doc/src/sgml/filelist.sgml |  1 +
>  doc/src/sgml/postgres.sgml |  1 +
>  doc/src/sgml/tableam.sgml  | 56 ++
>  4 files changed, 61 insertions(+), 2 deletions(-)
>  create mode 100644 doc/src/sgml/tableam.sgml
> 
> diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> index f4aabf5dc7..200708e121 100644
> --- a/doc/src/sgml/catalogs.sgml
> +++ b/doc/src/sgml/catalogs.sgml
> @@ -587,8 +587,9 @@
> The catalog pg_am stores information about
> relation access methods.  There is one row for each access method 
> supported
> by the system.
> -   Currently, only indexes have access methods.  The requirements for index
> -   access methods are discussed in detail in .
> +   Currently, only table and indexes have access methods. The requirements 
> for table
> +   access methods are discussed in detail in  and 
> the
> +   requirements for index access methods are discussed in detail in  linkend="indexam"/>.
>

I also adapted pg_am.amtype.


> diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml
> new file mode 100644
> index 00..9eca52ee70
> --- /dev/null
> +++ b/doc/src/sgml/tableam.sgml
> @@ -0,0 +1,56 @@
> +
> +
> +
> + Table Access Method Interface Definition
> + 
> +  
> +   This chapter defines the interface between the core 
> PostgreSQL
> +   system and access methods, which manage 
> TABLE
> +   types. The core system knows nothing about these access methods beyond
> +   what is specified here, so it is possible to develop entirely new access
> +   method types by writing add-on code.
> +  
> +  
> +  
> +   All Tables in PostgreSQL system are the primary
> +   data store. Each table is stored as its own physical 
> relation
> +   and so is described by an entry in the pg_class
> +   catalog. A table's content is entirely controlled by its access method.
> +   (All the access methods furthermore use the standard page layout described
> +   in .)
> +  

I don't think there's actually any sort of dependency on the page
layout. It's entirely conceivable to write an AM that doesn't use
postgres' shared buffers.

> +  
> +   A table access method handler function must be declared to accept a single
> +   argument of type internal and to return the pseudo-type
> +   table_am_handler.  The argument is a dummy value that simply
> +   serves to prevent handler functions from being called directly from SQL 
> commands.

> +   The result of the function must be a palloc'd struct of type 
> TableAmRoutine,
> +   which contains everything that the core code needs to know to make use of
> +   the table access method.

That's not been correct for a while...


> The TableAmRoutine struct,
> +   also called the access method's API struct, 
> includes
> +   fields specifying assorted fixed properties of the access method, such as
> +   whether it can support bitmap scans.  More importantly, it contains 
> pointers
> +   to support functions for the access method, which do all of the real work 
> to
> +   access tables. These support functions are plain C functions and are not
> +   visible or callable at the SQL level.  The support functions are described
> +   in TableAmRoutine structure. For more details, 
> please
> +   refer the file src/include/access/tableam.h.
> +  

This seems to not have been adapted after copying it from indexam?


I'm still working on this (in particular I think storage.sgml and
probably some other places needs updates to make clear they apply to
heap not generally; I think there needs to be some references to generic
WAL records in tableam.sgml, ...), but I got to run a few errands.

One thing I want to call out is that I made the reference to
src/include/access/tableam.h a link to gitweb. I think that makes it
much more useful to the casual reader. But it also means that, baring
some infrastructure / procedure we don't have, the link will just
continue to point to master. I'm not particularly concerned about that,
but it seems worth pointing out, given that we've only a single link to
gitweb in the sgml docs so far.

Greetings,

Andres Freund
>From 33826b211e3f2d5fe76024e2938ff2eb6aeb00da Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 2 Apr 2019 14:55:30 -0700
Subject: [PATCH] tableam: docs

---
 doc/src/sgml/catalogs.sgml|  9 +-
 doc/src/sgml/config.sgml  | 17 
 doc/src/sgml/filelist.sgml|  1 +
 doc/src/sgml/postgres.sgml|  1 +
 doc/src/sgml/ref/create_access_method.sgml| 

Re: Pluggable Storage - Andres's take

2019-04-02 Thread Andres Freund
Hi,

On 2019-04-02 17:11:07 +1100, Haribabu Kommi wrote:

> +  xreflabel="default_table_access_method">
> +  default_table_access_method 
> (string)
> +  
> +   default_table_access_method configuration 
> parameter
> +  
> +  
> +  
> +   
> +The value is either the name of a table access method, or an empty 
> string
> +to specify using the default table access method of the current 
> database.
> +If the value does not match the name of any existing table access 
> method,
> +PostgreSQL will automatically use the 
> default
> +table access method of the current database.
> +   

Hm, this doesn't strike me as right (there's no such thing as "default
table access method of the current database"). You just get an error in
that case. I think we should simply not allow setting to "" - what's the
point in that?

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-04-02 Thread Haribabu Kommi
On Tue, Apr 2, 2019 at 11:53 AM Andres Freund  wrote:

> Hi,
>
> On 2019-04-02 11:39:57 +1100, Haribabu Kommi wrote:
> > > What made you rename indexam.sgml to am.sgml, instead of creating a
> > > separate tableam.sgml?  Seems easier to just have a separate file?
> > >
> >
> > No specific reason, I just thought of adding all the access methods under
> > one file.
> > I can change it to tableam.sgml.
>
> I'd rather keep it separate. It seems likely that both table and indexam
> docs will grow further over time, and they're not that closely
> related. Additionally not moving sect1->sect2 etc will keep links stable
> (which we could also achieve with different sect1 names, I realize
> that).
>

OK.


> > I can understand your point of avoiding function-by-function API
> reference,
> > as the user can check directly the code comments, Still I feel some
> people
> > may refer the doc for API changes. I am fine to remove based on your
> > opinion.
>
> I think having to keeping both tableam.h and the sgml file current is
> too much overhead - and anybody that's going to create a new tableam is
> going to be able to look into the source anyway.
>

Here I attached updated patches as per the discussion.
Is the description of table access methods is enough? or do you want me to
add some more details?

Regards,
Haribabu Kommi
Fujitsu Australia


0001-tableam-doc-update-of-table-access-methods.patch
Description: Binary data


0002-Doc-updates-for-pluggable-table-access-method-syntax.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-04-01 Thread Andres Freund
Hi,

On 2019-04-02 11:39:57 +1100, Haribabu Kommi wrote:
> > What made you rename indexam.sgml to am.sgml, instead of creating a
> > separate tableam.sgml?  Seems easier to just have a separate file?
> >
> 
> No specific reason, I just thought of adding all the access methods under
> one file.
> I can change it to tableam.sgml.

I'd rather keep it separate. It seems likely that both table and indexam
docs will grow further over time, and they're not that closely
related. Additionally not moving sect1->sect2 etc will keep links stable
(which we could also achieve with different sect1 names, I realize
that).

> I can understand your point of avoiding function-by-function API reference,
> as the user can check directly the code comments, Still I feel some people
> may refer the doc for API changes. I am fine to remove based on your
> opinion.

I think having to keeping both tableam.h and the sgml file current is
too much overhead - and anybody that's going to create a new tableam is
going to be able to look into the source anyway.

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-04-01 Thread Haribabu Kommi
On Tue, Apr 2, 2019 at 10:18 AM Andres Freund  wrote:

> Hi,
>
> On 2019-03-16 23:21:31 +1100, Haribabu Kommi wrote:
> > updated patches are attached.
>
> Now that nearly all of the tableam patches are committed (with the
> exception of the copy.c changes which are for bad reasons discussed at
> [1]) I'm looking at the docs changes.
>

Thanks.


> What made you rename indexam.sgml to am.sgml, instead of creating a
> separate tableam.sgml?  Seems easier to just have a separate file?
>

No specific reason, I just thought of adding all the access methods under
one file.
I can change it to tableam.sgml.


> I'm currently not planning to include the function-by-function API
> reference you've in your patchset, as I think it's more reasonable to
> centralize all of it in tableam.h. I think I've included most of the
> information there - could you check whether you agree?
>

I checked all the comments and explanation that is provided in the
tableam.h is
good enough to understand. Even I updated docs section to reflect with some
more
details from tableam.h comments.

I can understand your point of avoiding function-by-function API reference,
as the user can check directly the code comments, Still I feel some people
may refer the doc for API changes. I am fine to remove based on your
opinion.

Added current set of doc patches for your reference.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Rename-indexam.sgml-to-am.sgml.patch
Description: Binary data


0002-Doc-updates-for-pluggable-table-access-method-syntax.patch
Description: Binary data


0003-Table-access-method-API-explanation.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-04-01 Thread Andres Freund
Hi,

On 2019-03-16 23:21:31 +1100, Haribabu Kommi wrote:
> updated patches are attached.

Now that nearly all of the tableam patches are committed (with the
exception of the copy.c changes which are for bad reasons discussed at
[1]) I'm looking at the docs changes.

What made you rename indexam.sgml to am.sgml, instead of creating a
separate tableam.sgml?  Seems easier to just have a separate file?

I'm currently not planning to include the function-by-function API
reference you've in your patchset, as I think it's more reasonable to
centralize all of it in tableam.h. I think I've included most of the
information there - could you check whether you agree?

[1] 
https://postgr.es/m/CAKJS1f98Fa%2BQRTGKwqbtz0M%3DCy1EHYR8Q-W08cpA78tOy4euKQ%40mail.gmail.com

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-03-29 Thread Andres Freund
On 2019-03-29 18:38:46 +1100, Haribabu Kommi wrote:
> As I see that your are fixing some typos of the code that is committed,
> I just want to share some more corrections that I found in the patches
> that are committed till now.

Pushed both, thanks!




Re: Pluggable Storage - Andres's take

2019-03-29 Thread Haribabu Kommi
On Wed, Mar 27, 2019 at 11:17 AM Andres Freund  wrote:

> Hi,
>
> On 2019-02-22 14:52:08 -0500, Robert Haas wrote:
> > On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar 
> wrote:
> > > Thanks for the review. Attached v2.
> >
> > Thanks.  I took this, combined it with Andres's
> > v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did
> > some polishing of the code and comments, and pgindented.  Here's what
> > I ended up with; see what you think.
>
> I pushed this after some fairly minor changes, directly including the
> patch to route the horizon computation through tableam. The only real
> change is that I removed the table relfilenode from the nbtree/hash
> deletion WAL record - it was only required to access the heap without
> accessing the catalog and was unused now.  Also added a WAL version
> bump.
>
> It seems possible that some other AM might want to generalize the
> prefetch logic from heapam.c, but I think it's fair to defer that until
> such an AM wants to do so
>

As I see that your are fixing some typos of the code that is committed,
I just want to share some more corrections that I found in the patches
that are committed till now.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-dA-to-show-Table-type-access-method.patch
Description: Binary data


0002-Typos-and-comemnt-corrections.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-03-26 Thread Andres Freund
Hi,

On 2019-02-22 14:52:08 -0500, Robert Haas wrote:
> On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar  
> wrote:
> > Thanks for the review. Attached v2.
> 
> Thanks.  I took this, combined it with Andres's
> v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did
> some polishing of the code and comments, and pgindented.  Here's what
> I ended up with; see what you think.

I pushed this after some fairly minor changes, directly including the
patch to route the horizon computation through tableam. The only real
change is that I removed the table relfilenode from the nbtree/hash
deletion WAL record - it was only required to access the heap without
accessing the catalog and was unused now.  Also added a WAL version
bump.

It seems possible that some other AM might want to generalize the
prefetch logic from heapam.c, but I think it's fair to defer that until
such an AM wants to do so.

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-03-23 Thread Andres Freund
Hi,

On 2019-03-23 20:16:30 -0700, Andres Freund wrote:
> I'm pretty happy with that last version (of the first patch). I'm
> planning to do one more pass, and then push.

And done, after a bunch of mostly cosmetic changes (renaming
ExecCheckHeapTupleVisible to ExecCheckTupleVisible, removing an
unnecessary change in heap_lock_tuple parameters, a bunch of comments,
stuff like that).  Let's see what the buildfarm says.

The remaining commits luckily all are a good bit smaller.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-03-23 Thread Andres Freund
Hi,

On 2019-03-21 11:15:57 -0700, Andres Freund wrote:
> Pending work:
> - Wondering if table_insert/delete/update should rather be
>   table_tuple_insert etc. Would be a bit more consistent with the
>   callback names, but a bigger departure from existing code.

I've left this as is.


> - I'm not yet happy with TableTupleDeleted computation in heapam.c, I
>   want to revise that further

I changed that. Found a bunch of untested paths, I've pushed tests for
those already.

> - formatting

Done that.


> - commit message

Done that.


> - a few comments need a bit of polishing (ExecCheckTIDVisible, 
> heapam_tuple_lock)

Done that.


> - Rename TableTupleMayBeModified to TableTupleOk, but also probably a 
> s/TableTuple/TableMod/

It's now TM_*.

/*
 * Result codes for table_{update,delete,lock}_tuple, and for visibility
 * routines inside table AMs.
 */
typedef enum TM_Result
{
/*
 * Signals that the action succeeded (i.e. update/delete performed, lock
 * was acquired)
 */
TM_Ok,

/* The affected tuple wasn't visible to the relevant snapshot */
TM_Invisible,

/* The affected tuple was already modified by the calling backend */
TM_SelfModified,

/*
 * The affected tuple was updated by another transaction. This includes
 * the case where tuple was moved to another partition.
 */
TM_Updated,

/* The affected tuple was deleted by another transaction */
TM_Deleted,

/*
 * The affected tuple is currently being modified by another session. 
This
 * will only be returned if (update/delete/lock)_tuple are instructed 
not
 * to wait.
 */
TM_BeingModified,

/* lock couldn't be acquired, action skipped. Only used by lock_tuple */
TM_WouldBlock
} TM_Result;


> - I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h

Done.


> - two more passes through the patch

One of them completed. Which is good, because there was a subtle bug in
heapam_tuple_lock (*tid was adjusted to be the followup tuple after the
heap_fetch(), before going to heap_lock_tuple - but that's wrong, it
should only be adjusted when heap_fetch() ing the next version.).

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-03-21 Thread Haribabu Kommi
On Fri, Mar 22, 2019 at 5:16 AM Andres Freund  wrote:

> Hi,
>
> Attached is a version of just the first patch. I'm still updating it,
> but it's getting closer to commit:
>
> - There were no tests testing EPQ interactions with DELETE, and only an
>   accidental test for EPQ in UPDATE with a concurrent DELETE. I've added
>   tests. Plan to commit that ahead of the big change.
>
> - I was pretty unhappy about how the EPQ integration looked before, I've
>   changed that now.
>
>   I still wonder if we should restore EvalPlanQualFetch and move the
>   table_lock_tuple() calls in ExecDelete/Update into it. But it seems
>   like it'd not gain that much, because there's custom surrounding code,
>   and it's not that much code.
>
> - I changed heapam_tuple_tuple to return *WouldBlock rather than just
>   the last result. I think that's one of the reason Haribabu had
>   neutered a few asserts.
>
> - I moved comments from heapam.h to tableam.h where appropriate
>
> - I updated the name of HeapUpdateFailureData to TM_FailureData,
>   HTSU_Result to TM_Result, TM_Results members now properly distinguish
>   between update vs modifications (delete & update).
>
> - I separated the HEAP_INSERT_ flags into TABLE_INSERT_* and
>   HEAP_INSERT_ with the latter being a copy of TABLE_INSERT_ with the
>   sole addition of _SPECULATIVE. table_insert_speculative callers now
>   don't specify that anymore.
>
>
> Pending work:
> - Wondering if table_insert/delete/update should rather be
>   table_tuple_insert etc. Would be a bit more consistent with the
>   callback names, but a bigger departure from existing code.
>
> - I'm not yet happy with TableTupleDeleted computation in heapam.c, I
>   want to revise that further
>
> - formatting
>
> - commit message
>
> - a few comments need a bit of polishing (ExecCheckTIDVisible,
> heapam_tuple_lock)
>
> - Rename TableTupleMayBeModified to TableTupleOk, but also probably a
> s/TableTuple/TableMod/
>
> - I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h
>
> - two more passes through the patch
>

Thanks for the corrections.



> On 2019-03-21 15:07:04 +1100, Haribabu Kommi wrote:
> > As you are modifying the 0003 patch for modify API's, I went and reviewed
> > the
> > existing patch and found couple corrections that are needed, in case if
> you
> > are not
> > taken care of them already.
>
> Some I have...
>
>
>
> > + /* Update the tuple with table oid */
> > + slot->tts_tableOid = RelationGetRelid(relation);
> > + if (slot->tts_tableOid != InvalidOid)
> > + tuple->t_tableOid = slot->tts_tableOid;
> >
> > The setting of slot->tts_tableOid is not required in this function,
> > After set the check is happening, the above code is present in both
> > heapam_heap_insert and heapam_heap_insert_speculative.
>
> I'm not following? Those functions are independent?
>

In those functions, the slot->tts_tableOid is set and in the next statement
checked whether it is invalid or not? Callers of table_insert should have
already set that. So setting the value and checking in the next line is it
required?
The value cannot be InvalidOid.


>
> > + slot->tts_tableOid = RelationGetRelid(relation);
> >
> > In heapam_heap_update, i don't think there is a need to update
> > slot->tts_tableOid.
>
> Why?
>

The slot->tts_tableOid should have been updated before the call to
heap_update.
setting it again after the heap_update is required?

I also observed setting slot->tts_tableOid after table_insert_XXX calls
also in
Exec_insert function?

Is this to make sure that AM hasn't modified that value?


> > + default:
> > + elog(ERROR, "unrecognized heap_update status: %u", result);
> >
> > heap_update --> table_update?
> >
> >
> > + default:
> > + elog(ERROR, "unrecognized heap_delete status: %u", result);
> >
> > same as above?
>
> I've fixed that in a number of places.
>
>
> > + /*hari FIXME*/
> > + /*Assert(result != HeapTupleUpdated && hufd.traversed);*/
> >
> > Removing the commented codes in both ExecDelete and ExecUpdate functions.
>
> I don't think that's the right fix. I've refactored that code
> significantly now, and restored the assert in a, imo, correct version.
>
>
OK.


> > + /**/
> > + if (epqreturnslot)
> > + {
> > + *epqreturnslot = epqslot;
> > + return NULL;
> > + }
> >
> > comment update missed?
>
> Well, you'd deleted a comment around there ;). I've added something back
> now...
>

This is not only the problem I could have introduced, All the comments that
listed are introduced by me ;).

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Pluggable Storage - Andres's take

2019-03-20 Thread Haribabu Kommi
Hi,

The psql \dA commands currently doesn't show the type of the access methods
of
type 'Table'.

postgres=# \dA heap
List of access methods
 Name | Type
--+---
 heap |
(1 row)

Attached a simple patch that fixes the problem and outputs as follows.

postgres=# \dA heap
List of access methods
 Name | Type
--+---
 heap | Table
(1 row)

The attached patch directly modifies the query that is sent to the server.
Servers < 12 doesn't support of type 'Table', so the same query can work,
because of the case addition as follows.

SELECT amname AS "Name",
  CASE amtype WHEN 'i' THEN 'Index' WHEN 't' THEN 'Table' END AS
"Type"
FROM pg_catalog.pg_am ...

Anyone feels that it requires a separate query for servers < 12?

Regards,
Haribabu Kommi
Fujitsu Australia


0001-dA-to-show-Table-type-access-method.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-03-20 Thread Haribabu Kommi
On Sat, Mar 16, 2019 at 5:43 PM Haribabu Kommi 
wrote:

>
>
> On Sat, Mar 9, 2019 at 2:13 PM Andres Freund  wrote:
>
>> Hi,
>>
>> While 0001 is pretty bulky, the interesting bits concentrate on a
>> comparatively small area. I'd appreciate if somebody could give the
>> comments added in tableam.h a read (both on callbacks, and their
>> wrappers, as they have different audiences). It'd make sense to first
>> read the commit message, to understand the goal (and I'd obviously also
>> appreciate suggestions for improvements there as well).
>>
>> I'm pretty happy with the current state of the scan patch. I plan to do
>> two more passes through it (formatting, comment polishing, etc. I don't
>> know of any functional changes needed), and then commit it, lest
>> somebody objects.
>>
>
> I found couple of typos in the committed patch, attached patch fixes them.
> I am not sure about one typo, please check it once.
>
> And I reviewed the 0002 patch, which is a pretty simple and it can be
> committed.
>

As you are modifying the 0003 patch for modify API's, I went and reviewed
the
existing patch and found couple corrections that are needed, in case if you
are not
taken care of them already.


+ /* Update the tuple with table oid */
+ slot->tts_tableOid = RelationGetRelid(relation);
+ if (slot->tts_tableOid != InvalidOid)
+ tuple->t_tableOid = slot->tts_tableOid;

The setting of slot->tts_tableOid is not required in this function,
After set the check is happening, the above code is present in both
heapam_heap_insert and heapam_heap_insert_speculative.


+ slot->tts_tableOid = RelationGetRelid(relation);

In heapam_heap_update, i don't think there is a need to update
slot->tts_tableOid.


+ default:
+ elog(ERROR, "unrecognized heap_update status: %u", result);

heap_update --> table_update?


+ default:
+ elog(ERROR, "unrecognized heap_delete status: %u", result);

same as above?


+ /*hari FIXME*/
+ /*Assert(result != HeapTupleUpdated && hufd.traversed);*/

Removing the commented codes in both ExecDelete and ExecUpdate functions.


+ /**/
+ if (epqreturnslot)
+ {
+ *epqreturnslot = epqslot;
+ return NULL;
+ }

comment update missed?


Regards,
Haribabu Kommi
Fujitsu Australia


Re: Pluggable Storage - Andres's take

2019-03-16 Thread Haribabu Kommi
On Sat, Mar 9, 2019 at 2:13 PM Andres Freund  wrote:

> Hi,
>
> While 0001 is pretty bulky, the interesting bits concentrate on a
> comparatively small area. I'd appreciate if somebody could give the
> comments added in tableam.h a read (both on callbacks, and their
> wrappers, as they have different audiences). It'd make sense to first
> read the commit message, to understand the goal (and I'd obviously also
> appreciate suggestions for improvements there as well).
>
> I'm pretty happy with the current state of the scan patch. I plan to do
> two more passes through it (formatting, comment polishing, etc. I don't
> know of any functional changes needed), and then commit it, lest
> somebody objects.
>

I found couple of typos in the committed patch, attached patch fixes them.
I am not sure about one typo, please check it once.

And I reviewed the 0002 patch, which is a pretty simple and it can be
committed.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-table-access-methods-typos-correction.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-03-12 Thread Kyotaro HORIGUCHI
Hello.

I had a look on the patch set. I cannot see the thread structure
due to the depth and cannot get the picture on the all patches,
but I have some comments. I apologize in advance for possible
duplicate with upthread.


0001-Reduce-the...

This doesn't apply master.

>  TupleTableSlot *
>  ExecStoreHeapTuple(HeapTuple tuple,
>  TupleTableSlot *slot,
> +Oid relid,
>  bool shouldFree)

  The comment for ExecStoreHeapTuple is missing the description
  on "relid" parameter.


> -if (HeapTupleSatisfiesVisibility(tuple, , 
> hscan->rs_cbuf))
> +if (HeapTupleSatisfiesVisibility(tuple, 
> RelationGetRelid(hscan->rs_scan.rs_rd),
> +, hscan->rs_cbuf))

The second parameter seems to be always
RelationGetRelid(...). Actually only relid is required but isn't
it better to take Relation instead of Oid as the second
parameter?



0005-Reorganize-am-as...

> +   catalog. The contents of an table are entirely under the control of its

 "of an table" => "of a table"

0006-Doc-update-of-Create-access..

> +  be index_am_handler and for TABLE
> +  access methods, it must be table_am_handler.
> +  The C-level API that the handler function must implement varies
> +  depending on the type of access method. The index access method API
> +  is described in  and the table 
> access method
> +  API is described in .

If table is the primary object, talbe-am should precede index-am?


0007-Doc-update-of-create-materi..

> +   This clause specifies optional access method for the new materialize view;

 "materialize view" => "materialized view"?

> +   If this option is not specified, then the default table access method

 I'm not sure the 'then' is needed.

> +   is chosen for the new materialized view. see  linkend="guc-default-table-access-method"/>

 "see" => "See"


0008-Doc-update-of-CREATE_TABLE..

> +[ USING method ]

 *I* prefer access_method to just method.

> +  If this option is not specified, then the default table access method

 Same to 0007. "I'm not sure the 'then' is needed.".

> +  is chosen for the new table. see  linkend="guc-default-table-access-method"/>

 Same to 0007. " "see" => "See" "
"

0009-Doc-of-CREATE-TABLE-AS

 Same as 0008.


0010-Table-access-method-API-

> +Any new TABLE ACCSESS METHOD developers can refer the 
> exisitng HEAP


> +There are differnt type of API's that are defined and those details are 
> below.

  "differnt" => "different"

> + by the AM, in case if it support parallel scan.

  "support" => "supports"

> + This API to return the total size that is required for the AM to perform

  Total size of what? Shared memory chunk? Or parallel scan descriptor?


> + the parallel table scan. The minimum size that is required is 
> + ParallelBlockTableScanDescData.

  I don't get what the "The minimum size" tells. Just reading
  this I would always return the minimum size...


> + This API to perform the initialization of the 
> parallel_scan
> + that is required for the parallel scan to be performed by the AM and 
> also return
> + the total size that is required for the AM to perform the parallel 
> table scan.

 (Note: I'm not good at English..) Similar to the above. I cannot
 read what the "size" is for.

 In the code it is used as:

  > Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);

  (The varialbe name should be snapshot_offset) It is the offset
  from the beginning of parallel scan descriptor but it should be
  described in other representation, which I'm not sure of..

  Something like this?

  > This API to initialize a parallel scan by the AM and also
  > return the consumed size so far of parallel scan descriptor.

(Sorry for not finishing. Time's up today.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Pluggable Storage - Andres's take

2019-03-11 Thread Andres Freund
On 2019-03-11 13:31:26 -0700, Andres Freund wrote:
> On 2019-03-11 12:37:46 -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-03-08 19:13:10 -0800, Andres Freund wrote:
> > > Changes:
> > > - I've added comments to all the callbacks in the first commit / the
> > >   scan commit
> > > - I've renamed table_gimmegimmeslot to table_slot_create
> > > - I've made the callbacks and their wrappers more consistently named
> > > - I've added asserts for necessary callbacks in scan commit
> > > - Lots of smaller cleanup
> > > - Added a commit message
> > > 
> > > While 0001 is pretty bulky, the interesting bits concentrate on a
> > > comparatively small area. I'd appreciate if somebody could give the
> > > comments added in tableam.h a read (both on callbacks, and their
> > > wrappers, as they have different audiences). It'd make sense to first
> > > read the commit message, to understand the goal (and I'd obviously also
> > > appreciate suggestions for improvements there as well).
> > > 
> > > I'm pretty happy with the current state of the scan patch. I plan to do
> > > two more passes through it (formatting, comment polishing, etc. I don't
> > > know of any functional changes needed), and then commit it, lest
> > > somebody objects.
> > 
> > Here's a further polished version. Pretty boring changes:
> > - newlines
> > - put tableam.h into the correct place
> > - a few comment improvements, including typos
> > - changed reorderqueue_push() to accept the slot. That avoids an
> >   unnecessary heap_copytuple() in some cases
> > 
> > No meaningful changes in later patches.
> 
> I pushed this.  There's a failure on 32bit machines, unfortunately. The
> problem comes from the ParallelTableScanDescData embedded in BTShared -
> after the change the compiler can't see that that actually needs more
> alignment, because ParallelTableScanDescData doesn't have any 8byte
> members. That's a problem for just about all such "struct inheritance"
> type tricks postgres, but we normally just allocate them separately,
> guaranteeing maxalign. Given that we here already allocate enough space
> after the BTShared struct, it's probably easiest to just not embed the
> struct anymore.

I've pushed an attempt to fix this, which locally fixes 32bit
builds. It's copying the alignment logic for shm_toc_allocate, namely
using BUFFERALIGN for alignment.  We should probably invent a more
appropriate define for this at some point...

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-03-11 Thread Andres Freund
On 2019-03-11 12:37:46 -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-03-08 19:13:10 -0800, Andres Freund wrote:
> > Changes:
> > - I've added comments to all the callbacks in the first commit / the
> >   scan commit
> > - I've renamed table_gimmegimmeslot to table_slot_create
> > - I've made the callbacks and their wrappers more consistently named
> > - I've added asserts for necessary callbacks in scan commit
> > - Lots of smaller cleanup
> > - Added a commit message
> > 
> > While 0001 is pretty bulky, the interesting bits concentrate on a
> > comparatively small area. I'd appreciate if somebody could give the
> > comments added in tableam.h a read (both on callbacks, and their
> > wrappers, as they have different audiences). It'd make sense to first
> > read the commit message, to understand the goal (and I'd obviously also
> > appreciate suggestions for improvements there as well).
> > 
> > I'm pretty happy with the current state of the scan patch. I plan to do
> > two more passes through it (formatting, comment polishing, etc. I don't
> > know of any functional changes needed), and then commit it, lest
> > somebody objects.
> 
> Here's a further polished version. Pretty boring changes:
> - newlines
> - put tableam.h into the correct place
> - a few comment improvements, including typos
> - changed reorderqueue_push() to accept the slot. That avoids an
>   unnecessary heap_copytuple() in some cases
> 
> No meaningful changes in later patches.

I pushed this.  There's a failure on 32bit machines, unfortunately. The
problem comes from the ParallelTableScanDescData embedded in BTShared -
after the change the compiler can't see that that actually needs more
alignment, because ParallelTableScanDescData doesn't have any 8byte
members. That's a problem for just about all such "struct inheritance"
type tricks postgres, but we normally just allocate them separately,
guaranteeing maxalign. Given that we here already allocate enough space
after the BTShared struct, it's probably easiest to just not embed the
struct anymore.

- Andres



Re: Pluggable Storage - Andres's take

2019-03-10 Thread Haribabu Kommi
On Sat, Mar 9, 2019 at 2:18 PM Andres Freund  wrote:

> Hi,
>
> On 2019-03-09 11:03:21 +1100, Haribabu Kommi wrote:
> > Here I attached the rebased patches that I shared earlier. I am adding
> the
> > comments to explain the API's in the code, will share those patches
> later.
>
> I've started to add those for the callbacks in the first commit. I'd
> appreciate a look!
>

Thanks for the updated patches.

+ /*

+ * Callbacks for hon-modifying operations on individual tuples
+ * 

Typo in tableam.h file. hon->non



> I think I'll include the docs patches, sans the callback documentation,
> in the next version. I'll probably merge them into one commit, if that's
> OK with you?
>

OK.
For easy review, I will still maintain 3 or 4 patches instead of the
current patch
series.


> > I observed a crash with the latest patch series in the COPY command.
>
> Hm, which version was this? I'd at some point accidentally posted a
> 'tmp' commit that was just a performance hack
>

Yes. in my version that checked have that commit.
May be that is the reason for the failure.

Btw, your patches always are attached out of order:
>
> https://www.postgresql.org/message-id/CAJrrPGd%2Brkz54wE-oXRojg4XwC3bcF6bjjRziD%2BXhFup9Q3n2w%40mail.gmail.com
> 10, 1, 3, 4, 2 ...
>

Sorry about that.
I always think why it is ordering that way when I attached the patch files
into the mail.
I thought it may be gmail behavior, but with experiment I found that, while
adding the multiple
patches, the last selected patch given the preference and it will be listed
as first attachment.

I will take care that this problem doesn't repeat it again.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Pluggable Storage - Andres's take

2019-03-09 Thread Andres Freund
Hi,

On 2019-03-10 05:49:26 +0100, Dmitry Dolgov wrote:
> > On Sat, Mar 9, 2019 at 4:13 AM Andres Freund  wrote:
> >
> > While 0001 is pretty bulky, the interesting bits concentrate on a
> > comparatively small area. I'd appreciate if somebody could give the
> > comments added in tableam.h a read (both on callbacks, and their
> > wrappers, as they have different audiences).
> 
> Potentially stupid question, but I'm curious about this one (couldn't find any
> discussion about it):

Not stupid...


> +/*
> + * Generic descriptor for table scans. This is the base-class for
> table scans,
> + * which needs to be embedded in the scans of individual AMs.
> + */
> +typedef struct TableScanDescData
> // ...
> bool rs_pageatatime; /* verify visibility page-at-a-time? */
> bool rs_allow_strat; /* allow or disallow use of access strategy */
> bool rs_allow_sync; /* allow or disallow use of syncscan */
> 
> + * allow_{strat, sync, pagemode} specify whether a scan strategy,
> + * synchronized scans, or page mode may be used (although not every AM
> + * will support those).
> // ...
> + TableScanDesc (*scan_begin) (Relation rel,
> 
> The last commentary makes me think that those flags (allow_strat / allow_sync 
> /
> pageatime) are more like AM specific, shouldn't they live in HeapScanDescData?

They're common enough across AMs, but more importantly calling code
currently specifies them in several places. As they're thus essentially
generic, rather than AM specific, I think it makes sense to have them in
the general scan struct.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-03-09 Thread Dmitry Dolgov
> On Sat, Mar 9, 2019 at 4:13 AM Andres Freund  wrote:
>
> While 0001 is pretty bulky, the interesting bits concentrate on a
> comparatively small area. I'd appreciate if somebody could give the
> comments added in tableam.h a read (both on callbacks, and their
> wrappers, as they have different audiences).

Potentially stupid question, but I'm curious about this one (couldn't find any
discussion about it):

+/*
+ * Generic descriptor for table scans. This is the base-class for
table scans,
+ * which needs to be embedded in the scans of individual AMs.
+ */
+typedef struct TableScanDescData
// ...
bool rs_pageatatime; /* verify visibility page-at-a-time? */
bool rs_allow_strat; /* allow or disallow use of access strategy */
bool rs_allow_sync; /* allow or disallow use of syncscan */

+ * allow_{strat, sync, pagemode} specify whether a scan strategy,
+ * synchronized scans, or page mode may be used (although not every AM
+ * will support those).
// ...
+ TableScanDesc (*scan_begin) (Relation rel,

The last commentary makes me think that those flags (allow_strat / allow_sync /
pageatime) are more like AM specific, shouldn't they live in HeapScanDescData?



Re: Pluggable Storage - Andres's take

2019-03-08 Thread Andres Freund
Hi,

On 2019-03-09 11:03:21 +1100, Haribabu Kommi wrote:
> Here I attached the rebased patches that I shared earlier. I am adding the
> comments to explain the API's in the code, will share those patches later.

I've started to add those for the callbacks in the first commit. I'd
appreciate a look!

I think I'll include the docs patches, sans the callback documentation,
in the next version. I'll probably merge them into one commit, if that's
OK with you?


> I observed a crash with the latest patch series in the COPY command.

Hm, which version was this? I'd at some point accidentally posted a
'tmp' commit that was just a performance hack.


Btw, your patches always are attached out of order:
https://www.postgresql.org/message-id/CAJrrPGd%2Brkz54wE-oXRojg4XwC3bcF6bjjRziD%2BXhFup9Q3n2w%40mail.gmail.com
10, 1, 3, 4, 2 ...

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-03-08 Thread Haribabu Kommi
On Thu, Mar 7, 2019 at 6:33 AM Andres Freund  wrote:

> Hi,
>
> On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> > My next steps are:
> > - final polish & push the basic DDL and pg_dump patches
>
> Done and pushed. Some collation dependent fallout, I'm hoping I've just
> fixed that.
>

Thanks for the corrections that I missed, also for the extra changes.

Here I attached the rebased patches that I shared earlier. I am adding the
comments to explain the API's in the code, will share those patches later.

I observed a crash with the latest patch series in the COPY command.
I am not sure whether the problem is with the reduce of tableOid patch
problem,
Will check it and correct the problem.

Regards,
Haribabu Kommi
Fujitsu Australia


0010-Table-access-method-API-explanation.patch
Description: Binary data


0001-Reduce-the-use-of-HeapTuple-t_tableOid.patch
Description: Binary data


0003-Docs-of-default_table_access_method-GUC.patch
Description: Binary data


0004-Rename-indexam.sgml-to-am.sgml.patch
Description: Binary data


0002-Removal-of-scan_update_snapshot-callback.patch
Description: Binary data


0005-Reorganize-am-as-both-table-and-index.patch
Description: Binary data


0006-Doc-update-of-Create-access-method-type-table.patch
Description: Binary data


0009-Doc-of-CREATE-TABLE-AS-.-USING-syntax.patch
Description: Binary data


0007-Doc-update-of-create-materialized-view-.-USING-synta.patch
Description: Binary data


0008-Doc-update-of-CREATE-TABLE-.-USING-syntax.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-03-08 Thread Dagfinn Ilmari Mannsåker
Andres Freund  writes:

> On 2019-03-07 08:52:21 -0500, Robert Haas wrote:
>> On Wed, Mar 6, 2019 at 6:11 PM Andres Freund  wrote:
>> > slot that's compatible with the "target" table. You can get compatible
>> > slot callbakcs by calling table_slot_callbacks(), or directly create one
>> > by calling table_gimmegimmeslot() (likely to be renamed :)).
>>
>> Hmm.  I assume the issue is that table_createslot() was already taken
>> for another purpose, so then when you needed another callback you went
>> with table_givemeslot(), and then when you needed a third API to do
>> something in the same area the best thing available was
>> table_gimmeslot(), which meant that the fourth API could only be
>> table_gimmegimmeslot().
>>
>> Does that sound about right?
>
> It was 3 AM, and I thought it was hilarious...

♫ Gimme! Gimme! Gimme! A slot after midnight ♫

- ilmari (SCNR) 
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen



Re: Pluggable Storage - Andres's take

2019-03-07 Thread Robert Haas
On Thu, Mar 7, 2019 at 12:49 PM Andres Freund  wrote:
> On 2019-03-07 08:52:21 -0500, Robert Haas wrote:
> > On Wed, Mar 6, 2019 at 6:11 PM Andres Freund  wrote:
> > > slot that's compatible with the "target" table. You can get compatible
> > > slot callbakcs by calling table_slot_callbacks(), or directly create one
> > > by calling table_gimmegimmeslot() (likely to be renamed :)).
> >
> > Hmm.  I assume the issue is that table_createslot() was already taken
> > for another purpose, so then when you needed another callback you went
> > with table_givemeslot(), and then when you needed a third API to do
> > something in the same area the best thing available was
> > table_gimmeslot(), which meant that the fourth API could only be
> > table_gimmegimmeslot().
> >
> > Does that sound about right?
>
> It was 3 AM, and I thought it was hilarious...

It is.  Just like me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pluggable Storage - Andres's take

2019-03-07 Thread Andres Freund
On 2019-03-07 08:52:21 -0500, Robert Haas wrote:
> On Wed, Mar 6, 2019 at 6:11 PM Andres Freund  wrote:
> > slot that's compatible with the "target" table. You can get compatible
> > slot callbakcs by calling table_slot_callbacks(), or directly create one
> > by calling table_gimmegimmeslot() (likely to be renamed :)).
>
> Hmm.  I assume the issue is that table_createslot() was already taken
> for another purpose, so then when you needed another callback you went
> with table_givemeslot(), and then when you needed a third API to do
> something in the same area the best thing available was
> table_gimmeslot(), which meant that the fourth API could only be
> table_gimmegimmeslot().
>
> Does that sound about right?

It was 3 AM, and I thought it was hilarious...



Re: Pluggable Storage - Andres's take

2019-03-07 Thread Robert Haas
On Wed, Mar 6, 2019 at 6:11 PM Andres Freund  wrote:
> slot that's compatible with the "target" table. You can get compatible
> slot callbakcs by calling table_slot_callbacks(), or directly create one
> by calling table_gimmegimmeslot() (likely to be renamed :)).

Hmm.  I assume the issue is that table_createslot() was already taken
for another purpose, so then when you needed another callback you went
with table_givemeslot(), and then when you needed a third API to do
something in the same area the best thing available was
table_gimmeslot(), which meant that the fourth API could only be
table_gimmegimmeslot().

Does that sound about right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pluggable Storage - Andres's take

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-07 11:56:57 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 08:33, Andres Freund  wrote:
> > Here's a cleaned up version of that patch.  David, Alvaro, you also
> > played in that area, any objections? I think this makes that part of the
> > code easier to read actually. Robert, thanks for looking at that patch
> > already.
> 
> I only had a quick look and don't have a grasp of what the patch
> series is doing to tuple slots, but I didn't see anything I found
> alarming during the read.

Thanks for looking.

Re slots - the deal basically is that going forward low level
operations, like fetching a row from a table etc, have to be done by a
slot that's compatible with the "target" table. You can get compatible
slot callbakcs by calling table_slot_callbacks(), or directly create one
by calling table_gimmegimmeslot() (likely to be renamed :)).

The problem here was that the partition root's slot was used to fetch /
store rows from a child partition. By moving mt_existing into
ResultRelInfo that's not the case anymore.

- Andres



Re: Pluggable Storage - Andres's take

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 08:33, Andres Freund  wrote:
> Here's a cleaned up version of that patch.  David, Alvaro, you also
> played in that area, any objections? I think this makes that part of the
> code easier to read actually. Robert, thanks for looking at that patch
> already.

I only had a quick look and don't have a grasp of what the patch
series is doing to tuple slots, but I didn't see anything I found
alarming during the read.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Pluggable Storage - Andres's take

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> My next steps are:
> - final polish & push the basic DDL and pg_dump patches

Done and pushed. Some collation dependent fallout, I'm hoping I've just
fixed that.

> - cleanup & polish the ON CONFLICT refactoring

Here's a cleaned up version of that patch.  David, Alvaro, you also
played in that area, any objections? I think this makes that part of the
code easier to read actually. Robert, thanks for looking at that patch
already.

Greetings,

Andres Freund
>From e4caa7de3006370f52b4dafe204d45f9d99fa5a4 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 6 Mar 2019 11:30:33 -0800
Subject: [PATCH v16] Don't reuse slots between root and partition in ON
 CONFLICT ... UPDATE.

Until now the the slot to store the conflicting tuple, and the result
of the ON CONFLICT SET, where reused between partitions. That
necessitated changing slots descriptor when switching partitions.

Besides the overhead of switching descriptors on a slot (which
requires memory allocations and prevents JITing), that's importantly
also problematic for tableam. There individual partitions might belong
to different tableams, needing different kinds of slots.

In passing also fix ExecOnConflictUpdate to clear the existing slot at
exit. Otherwise that slot could continue to hold a pin till the query
ends, which could be far too long if the input data set is large, and
there's no further conflicts. While previously also problematic, it's
now more important as there will be more such slots when partitioned.

Author: Andres Freund
Reviewed-By: Robert Haas
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n...@alap3.anarazel.de
---
 src/backend/executor/execPartition.c   | 52 +--
 src/backend/executor/nodeModifyTable.c | 70 +-
 src/include/nodes/execnodes.h  |  5 +-
 3 files changed, 64 insertions(+), 63 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e41801662b3..4491ee69912 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -723,28 +723,55 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
 			TupleConversionMap *map;
+			TupleDesc	leaf_desc;
 
 			map = leaf_part_rri->ri_PartitionInfo->pi_RootToPartitionMap;
+			leaf_desc = RelationGetDescr(leaf_part_rri->ri_RelationDesc);
 
 			Assert(node->onConflictSet != NIL);
 			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
+			leaf_part_rri->ri_onConflict = makeNode(OnConflictSetState);
+
+			/*
+			 * Need a separate existing slot for each partition, as the
+			 * partition could be of a different AM, even if the tuple
+			 * descriptors match.
+			 */
+			leaf_part_rri->ri_onConflict->oc_Existing =
+ExecInitExtraTupleSlot(mtstate->ps.state,
+	   leaf_desc,
+	   );
+
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
-			 * parent (the common case), we can simply re-use the parent's ON
+			 * parent (the common case), we can re-use most of the parent's ON
 			 * CONFLICT SET state, skipping a bunch of work.  Otherwise, we
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
+			{
+/*
+ * It's safe to reuse these from the partition root, as we
+ * only process one tuple at a time (therefore we won't
+ * overwrite needed data in slots), and the results of
+ * projections are independent of the underlying
+ * storage. Projections and where clauses themselves don't
+ * store state / are independent of the underlying storage.
+ */
+leaf_part_rri->ri_onConflict->oc_ProjSlot =
+	rootResultRelInfo->ri_onConflict->oc_ProjSlot;
+leaf_part_rri->ri_onConflict->oc_ProjInfo =
+	rootResultRelInfo->ri_onConflict->oc_ProjInfo;
+leaf_part_rri->ri_onConflict->oc_WhereClause =
+	rootResultRelInfo->ri_onConflict->oc_WhereClause;
+			}
 			else
 			{
 List	   *onconflset;
 TupleDesc	tupDesc;
 bool		found_whole_row;
 
-leaf_part_rri->ri_onConflict = makeNode(OnConflictSetState);
-
 /*
  * Translate expressions in onConflictSet to account for
  * different attribute numbers.  For that, map partition
@@ -778,20 +805,17 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 /* Finally, adjust this tlist to match the partition. */
 onconflset = adjust_partition_tlist(onconflset, map);
 
-/*
- * Build UPDATE SET's projection info.  The user of this
- * projection is responsible for setting the slot's tupdesc!
- * We set aside a tupdesc that's good for the common case of a
- * partition that's tupdesc-equal to the partitioned table;
- * partitions of different tupdescs must generate their own.
- */
+/* create the tuple slot for the UPDATE SET projection 

Re: Pluggable Storage - Andres's take

2019-03-05 Thread Andres Freund
Hi,

Thanks for looking!

On 2019-03-05 18:27:45 -0800, Ashwin Agrawal wrote:
> While playing with the tableam, usage of which starts with commit
> v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we
> check for NULL function pointer before actually calling the same and ERROR
> out instead as NOT_SUPPORTED or something on those lines.

Scans seem like absolutely required part of the functionality, so I
don't think there's much point in that. It'd just bloat code and
runtime.


> Understand its kind of think which should get caught during development.
> But still currently it segfaults if missing to define some AM function,

The segfault iself doesn't bother me at all, it's just a NULL pointer
dereference. If we were to put Asserts somewhere it'd crash very
similarly. I think you have a point in that:

> might be easier for iterative development to error instead in common place.

Would make it a tiny bit easier to implement a new AM.  We could
probably add a few asserts to GetTableAmRoutine(), to check that
required functions are implemted.  Don't think that'd make a meaningful
difference for something like the scan functions, but it'd probably make
it easier to forward port AMs to the next release - I'm pretty sure
we're going to add required callbacks in the next few releases.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-03-05 Thread Andres Freund
Hi,

Thanks for looking!

On 2019-03-05 18:27:45 -0800, Ashwin Agrawal wrote:
> While playing with the tableam, usage of which starts with commit
> v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we
> check for NULL function pointer before actually calling the same and ERROR
> out instead as NOT_SUPPORTED or something on those lines.

Scans seem like absolutely required part of the functionality, so I
don't think there's much point in that. It'd just bloat code and
runtime.


> Understand its kind of think which should get caught during development.
> But still currently it segfaults if missing to define some AM function,

The segfault iself doesn't bother me at all, it's just a NULL pointer
dereference. If we were to put Asserts somewhere it'd crash very
similarly. I think you have a point in that:

> might be easier for iterative development to error instead in common place.

Would make it a tiny bit easier to implement a new AM.  We could
probably add a few asserts to GetTableAmRoutine(), to check that
required functions are implemted.  Don't think that'd make a meaningful
difference for something like the scan functions, but it'd probably make
it easier to forward port AMs to the next release - I'm pretty sure
we're going to add required callbacks in the next few releases.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-03-05 Thread Ashwin Agrawal
Hi,

While playing with the tableam, usage of which starts with commit
v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we
check for NULL function pointer before actually calling the same and ERROR
out instead as NOT_SUPPORTED or something on those lines.

Understand its kind of think which should get caught during development.
But still currently it segfaults if missing to define some AM function,
might be easier for iterative development to error instead in common place.

Or should there be upfront check for NULL somewhere if all the AM functions
are mandatory to have functions defined for them and should not be NULL.


Re: Pluggable Storage - Andres's take

2019-02-27 Thread Andres Freund
Hi,

On 2019-02-27 18:00:12 +0800, Heikki Linnakangas wrote:
> I haven't been following this thread closely, but I looked briefly at some
> of the patches posted here:

Thanks!


> On 21/01/2019 11:01, Andres Freund wrote:
> > The patchset is now pretty granularly split into individual pieces.
> 
> Wow, 42 patches, very granular indeed! That's nice for reviewing, but are
> you planning to squash them before committing? Seems a bit excessive for the
> git history.

I've pushed a number of the preliminary patches since you replied. We're
down to 23 in my local count...

I do plan / did squash some, but not actually that many. I find that
patches after a certain size are just too hard to do the necessary final
polish to, especially if they do several independent things. Keeping
things granular also allows to push incrementally, even when later
patches aren't quite ready - imo pretty important for a project this
size.


> Patches 1-4:
> 
> * v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch
> * v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch
> * v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch
> * v12-0004-Remove-superfluous-tqual.h-includes.patch
> 
> These look good to me. I think it would make sense to squash these together,
> and commit now.

I've pushed these a while ago.


> Patches 20 and 21:
> * v12-0020-WIP-Slotified-triggers.patch
> * v12-0021-WIP-Slotify-EPQ.patch
> 
> I like this slotification of trigger and EPQ code. It seems like a nice
> thing to do, independently of the other patches. You said you wanted to
> polish that up to committable state, and commit separately: +1 on
> that.

I pushed the trigger patch yesterday evening. Working to finalize the
EPQ patch now - I've polished it a fair bit since the version posted on
the list, but it still needs a bit more.

Once the EPQ patch (and two other simple preliminary ones) is pushed, I
plan to post a new rebased version to this thread. That's then really
only the core table AM work.


> > --- a/src/include/commands/trigger.h
> > +++ b/src/include/commands/trigger.h
> > @@ -35,8 +35,8 @@ typedef struct TriggerData
> > HeapTuple   tg_trigtuple;
> > HeapTuple   tg_newtuple;
> > Trigger*tg_trigger;
> > -   Buffer  tg_trigtuplebuf;
> > -   Buffer  tg_newtuplebuf;
> > +   TupleTableSlot *tg_trigslot;
> > +   TupleTableSlot *tg_newslot;
> > Tuplestorestate *tg_oldtable;
> > Tuplestorestate *tg_newtable;
> >  } TriggerData;
> 
> Do we still need tg_trigtuple and tg_newtuple? Can't we always use the
> corresponding slots instead? Is it for backwards-compatibility with
> user-defined triggers written in C?

Yes, the external trigger interface currently relies on those being
there. I think we probably ought to revise that, at the very least
because it'll otherwise be noticably less efficient to have triggers on
!heap tables, but also because it's just cleaner.  But I feel like I
don't want more significantly sized patches on my plate right now, so my
current goal is to just put that on the todo after the pluggable storage
work.  Kinda wonder if we don't want to do that earlier in a release
cycle too, so we can do other breaking changes to the trigger interface
without breaking external code multiple times.  There's probably also an
argument for just not breaking the interface.


> (Documentation also needs to be updated for the changes in this
> struct)

Ah, nice catch, will do that next.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-02-27 Thread Heikki Linnakangas
I haven't been following this thread closely, but I looked briefly at 
some of the patches posted here:


On 21/01/2019 11:01, Andres Freund wrote:

The patchset is now pretty granularly split into individual pieces.


Wow, 42 patches, very granular indeed! That's nice for reviewing, but 
are you planning to squash them before committing? Seems a bit excessive 
for the git history.


Patches 1-4:

* v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch
* v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch
* v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch
* v12-0004-Remove-superfluous-tqual.h-includes.patch

These look good to me. I think it would make sense to squash these 
together, and commit now.



Patches 20 and 21:
* v12-0020-WIP-Slotified-triggers.patch
* v12-0021-WIP-Slotify-EPQ.patch

I like this slotification of trigger and EPQ code. It seems like a nice 
thing to do, independently of the other patches. You said you wanted to 
polish that up to committable state, and commit separately: +1 on that. 
Perhaps do that even before patches 1-4.



--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -35,8 +35,8 @@ typedef struct TriggerData
HeapTuple   tg_trigtuple;
HeapTuple   tg_newtuple;
Trigger*tg_trigger;
-   Buffer  tg_trigtuplebuf;
-   Buffer  tg_newtuplebuf;
+   TupleTableSlot *tg_trigslot;
+   TupleTableSlot *tg_newslot;
Tuplestorestate *tg_oldtable;
Tuplestorestate *tg_newtable;
 } TriggerData;


Do we still need tg_trigtuple and tg_newtuple? Can't we always use the 
corresponding slots instead? Is it for backwards-compatibility with 
user-defined triggers written in C? (Documentation also needs to be 
updated for the changes in this struct)



I didn't look a the rest of the patches yet...

- Heikki



Re: Pluggable Storage - Andres's take

2019-02-26 Thread Haribabu Kommi
On Wed, Feb 27, 2019 at 11:10 AM Andres Freund  wrote:

> Hi,
>
> On 2019-01-21 10:32:37 +1100, Haribabu Kommi wrote:
> > I am not able to remove the complete t_tableOid from HeapTuple,
> > because of its use in triggers, as the slot is not available in triggers
> > and I need to store the tableOid also as part of the tuple.
>
> What precisely do you man by "use in triggers"? You mean that a trigger
> might access a HeapTuple's t_tableOid directly, even though all of the
> information is available in the trigger context?
>

I forgot the exact scenario, but during the trigger function execution, the
pl/pgsql function execution access the TableOidAttributeNumber from the
stored
tuple using the heap_get* function. Because of lack of slot support in the
triggers,
we still need to maintain the t_tableOid with proper OID. The heaptuple
t_tableOid
member data is updated whenever the heaptuple is generated from slot.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Pluggable Storage - Andres's take

2019-02-26 Thread Andres Freund
Hi,

On 2019-01-21 10:32:37 +1100, Haribabu Kommi wrote:
> I am not able to remove the complete t_tableOid from HeapTuple,
> because of its use in triggers, as the slot is not available in triggers
> and I need to store the tableOid also as part of the tuple.

What precisely do you man by "use in triggers"? You mean that a trigger
might access a HeapTuple's t_tableOid directly, even though all of the
information is available in the trigger context?

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-02-25 Thread Amit Khandekar
On Sat, 23 Feb 2019 at 01:22, Robert Haas  wrote:
>
> On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar  
> wrote:
> > Thanks for the review. Attached v2.
>
> Thanks.  I took this, combined it with Andres's
> v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did
> some polishing of the code and comments, and pgindented.  Here's what
> I ended up with; see what you think.

Thanks Robert ! The changes look good.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2019-02-22 Thread Robert Haas
On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar  wrote:
> Thanks for the review. Attached v2.

Thanks.  I took this, combined it with Andres's
v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did
some polishing of the code and comments, and pgindented.  Here's what
I ended up with; see what you think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-Updated-lastRemovedXid-to-primary-patch.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-02-22 Thread Amit Khandekar
On Thu, 21 Feb 2019 at 18:06, Robert Haas  wrote:
>
> On Thu, Feb 21, 2019 at 6:44 AM Amit Khandekar  wrote:
> > Ok, so something like XidHorizonPrefetchState ? On similar lines, does
> > prefetch_buffer() function name sound too generic as well ?
>
> Yeah, that sounds good.

> And, yeah, then maybe rename the function too.

Renamed the function to xid_horizon_prefetch_buffer().

>
> > > +/*
> > > + * An arbitrary way to come up with a pre-fetch distance that grows with 
> > > io
> > > + * concurrency, but is at least 10 and not more than the max effective io
> > > + * concurrency.
> > > + */
> > >
> > > This comment is kinda useless, because it only tells us what the code
> > > does (which is obvious anyway) and not why it does that.  Saying that
> > > your formula is arbitrary may not be the best way to attract support
> > > for it.
> >
> > Well, I had checked the way the number of drive spindles
> > (effective_io_concurrency) is used to calculate the prefetch distance
> > for bitmap heap scans (ComputeIoConcurrency). Basically I think the
> > intention behind that method is to come up with a number that makes it
> > highly likely that we pre-fetch a block of each of the drive spindles.
> > But I didn't get how that exactly works, all the less for non-parallel
> > bitmap scans. Same is the case for the pre-fetching that we do here
> > for xid-horizon stuff, where we do the block reads sequentially. Me
> > and Andres discussed this offline, and he was of the opinion that this
> > formula won't help here, and instead we just keep a constant distance
> > that is some number greater than effective_io_concurrency. I agree
> > that instead of saying "arbitrary" we should explain why we have done
> > that, and before that, come up with an agreed-upon formula.

>
> Maybe something like: We don't use the regular formula to determine
> how much to prefetch here, but instead just add a constant to
> effective_io_concurrency.  That's because it seems best to do some
> prefetching here even when effective_io_concurrency is set to 0, but
> if the DBA thinks it's OK to do more prefetching for other operations,
> then it's probably OK to do more prefetching in this case, too.  It
> may be that this formula is too simplistic, but at the moment we have
> no evidence of that or any idea about what would work better.

Thanks for writing it down for me. I think this is good-to-go as a
comment; so I put this as-is into the patch.

>
> > > + for (i = prefetch_state->next_item; i < nitems && count < 
> > > prefetch_count; i++)
> > >
> > > It looks strange to me that next_item is stored in prefetch_state and
> > > nitems is passed around as an argument.  Is there some reason why it's
> > > like that?
> >
> > We could keep the max count in the structure itself as well. There
> > isn't any specific reason for not keeping it there. It's just that
> > this function prefetch_state () is not a general function for
> > maintaining a prefetch state that spans across function calls; so we
> > might as well just pass the max count to that function instead of
> > having another field in that structure. I am not inclined specifically
> > towards either of the approaches.
>
> All right, count me as +0.5 for putting a copy in the structure.

Have put the nitems into the structure.

Thanks for the review. Attached v2.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


prefetch_xid_horizon_scan_v2.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-02-21 Thread Robert Haas
On Thu, Feb 21, 2019 at 6:44 AM Amit Khandekar  wrote:
> Ok, so something like XidHorizonPrefetchState ? On similar lines, does
> prefetch_buffer() function name sound too generic as well ?

Yeah, that sounds good.  And, yeah, then maybe rename the function too.

> > +/*
> > + * An arbitrary way to come up with a pre-fetch distance that grows with io
> > + * concurrency, but is at least 10 and not more than the max effective io
> > + * concurrency.
> > + */
> >
> > This comment is kinda useless, because it only tells us what the code
> > does (which is obvious anyway) and not why it does that.  Saying that
> > your formula is arbitrary may not be the best way to attract support
> > for it.
>
> Well, I had checked the way the number of drive spindles
> (effective_io_concurrency) is used to calculate the prefetch distance
> for bitmap heap scans (ComputeIoConcurrency). Basically I think the
> intention behind that method is to come up with a number that makes it
> highly likely that we pre-fetch a block of each of the drive spindles.
> But I didn't get how that exactly works, all the less for non-parallel
> bitmap scans. Same is the case for the pre-fetching that we do here
> for xid-horizon stuff, where we do the block reads sequentially. Me
> and Andres discussed this offline, and he was of the opinion that this
> formula won't help here, and instead we just keep a constant distance
> that is some number greater than effective_io_concurrency. I agree
> that instead of saying "arbitrary" we should explain why we have done
> that, and before that, come up with an agreed-upon formula.

Maybe something like: We don't use the regular formula to determine
how much to prefetch here, but instead just add a constant to
effective_io_concurrency.  That's because it seems best to do some
prefetching here even when effective_io_concurrency is set to 0, but
if the DBA thinks it's OK to do more prefetching for other operations,
then it's probably OK to do more prefetching in this case, too.  It
may be that this formula is too simplistic, but at the moment we have
no evidence of that or any idea about what would work better.

> > + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; 
> > i++)
> >
> > It looks strange to me that next_item is stored in prefetch_state and
> > nitems is passed around as an argument.  Is there some reason why it's
> > like that?
>
> We could keep the max count in the structure itself as well. There
> isn't any specific reason for not keeping it there. It's just that
> this function prefetch_state () is not a general function for
> maintaining a prefetch state that spans across function calls; so we
> might as well just pass the max count to that function instead of
> having another field in that structure. I am not inclined specifically
> towards either of the approaches.

All right, count me as +0.5 for putting a copy in the structure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pluggable Storage - Andres's take

2019-02-21 Thread Amit Khandekar
On Thu, 21 Feb 2019 at 04:17, Robert Haas  wrote:
>
> On Fri, Feb 8, 2019 at 5:18 AM Amit Khandekar  wrote:
> > In the attached v1 patch, the prefetch_distance is calculated as
> > effective_io_concurrency + 10. Also it has some cosmetic changes.
>
> I did a little brief review of this patch and noticed the following things.
>
> +} PrefetchState;
> That name seems too generic.

Ok, so something like XidHorizonPrefetchState ? On similar lines, does
prefetch_buffer() function name sound too generic as well ?

>
> +/*
> + * An arbitrary way to come up with a pre-fetch distance that grows with io
> + * concurrency, but is at least 10 and not more than the max effective io
> + * concurrency.
> + */
>
> This comment is kinda useless, because it only tells us what the code
> does (which is obvious anyway) and not why it does that.  Saying that
> your formula is arbitrary may not be the best way to attract support
> for it.

Well, I had checked the way the number of drive spindles
(effective_io_concurrency) is used to calculate the prefetch distance
for bitmap heap scans (ComputeIoConcurrency). Basically I think the
intention behind that method is to come up with a number that makes it
highly likely that we pre-fetch a block of each of the drive spindles.
But I didn't get how that exactly works, all the less for non-parallel
bitmap scans. Same is the case for the pre-fetching that we do here
for xid-horizon stuff, where we do the block reads sequentially. Me
and Andres discussed this offline, and he was of the opinion that this
formula won't help here, and instead we just keep a constant distance
that is some number greater than effective_io_concurrency. I agree
that instead of saying "arbitrary" we should explain why we have done
that, and before that, come up with an agreed-upon formula.


>
> + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; 
> i++)
>
> It looks strange to me that next_item is stored in prefetch_state and
> nitems is passed around as an argument.  Is there some reason why it's
> like that?

We could keep the max count in the structure itself as well. There
isn't any specific reason for not keeping it there. It's just that
this function prefetch_state () is not a general function for
maintaining a prefetch state that spans across function calls; so we
might as well just pass the max count to that function instead of
having another field in that structure. I am not inclined specifically
towards either of the approaches.

>
> + /* prefetch a fixed number of pages beforehand. */
>
> Not accurate -- the number of pages we prefetch isn't fixed.  It
> depends on effective_io_concurrency.

Yeah, will change that in the next patch version, according to what we
conclude about the prefetch distance calculation.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2019-02-20 Thread Robert Haas
On Fri, Feb 8, 2019 at 5:18 AM Amit Khandekar  wrote:
> In the attached v1 patch, the prefetch_distance is calculated as
> effective_io_concurrency + 10. Also it has some cosmetic changes.

I did a little brief review of this patch and noticed the following things.

+} PrefetchState;

That name seems too generic.

+/*
+ * An arbitrary way to come up with a pre-fetch distance that grows with io
+ * concurrency, but is at least 10 and not more than the max effective io
+ * concurrency.
+ */

This comment is kinda useless, because it only tells us what the code
does (which is obvious anyway) and not why it does that.  Saying that
your formula is arbitrary may not be the best way to attract support
for it.

+ for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++)

It looks strange to me that next_item is stored in prefetch_state and
nitems is passed around as an argument.  Is there some reason why it's
like that?

+ /* prefetch a fixed number of pages beforehand. */

Not accurate -- the number of pages we prefetch isn't fixed.  It
depends on effective_io_concurrency.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pluggable Storage - Andres's take

2019-02-19 Thread Haribabu Kommi
On Tue, Nov 27, 2018 at 4:59 PM Amit Langote 
wrote:

> Hi,
>
> On 2018/11/02 9:17, Haribabu Kommi wrote:
> > Here I attached the cumulative fixes of the patches, new API additions
> for
> > zheap and
> > basic outline of the documentation.
>
> I've read the documentation patch while also looking at the code and here
> are some comments.
>

Thanks for the review and apologies for the delay.
I have taken care of your most of the comments in the latest version of the
doc patches.


>
> +  
> +
> +TupleTableSlotOps *
> +slot_callbacks (Relation relation);
> +
> +   API to access the slot specific methods;
> +   Following methods are available;
> +   TTSOpsVirtual,
> +   TTSOpsHeapTuple,
> +   TTSOpsMinimalTuple,
> +   TTSOpsBufferTuple,
> +  
>
> Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
> its relations to the TableAmRoutine abstraction, I think the text
> description could better be written as:
>
> "API to get the slot operations struct for a given table access method"
>
> It's not clear to me why various TTSOps* structs are listed here?  Is the
> point that different AMs may choose one of the listed alternatives?  For
> example, I see that heap AM implementation returns TTOpsBufferTuple, so it
> manipulates slots containing buffered tuples, right?  Other AMs are free
> to return any one of these?  For example, some AMs may never use buffer
> manager and hence not use TTOpsBufferTuple.  Is that understanding correct?
>

Yes, AM can decide what type of Slot method it wants to use.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Pluggable Storage - Andres's take

2019-02-19 Thread Haribabu Kommi
On Mon, Feb 4, 2019 at 2:31 PM Haribabu Kommi 
wrote:

>
> On Tue, Jan 22, 2019 at 1:43 PM Haribabu Kommi 
> wrote:
>
>>
>>
>> OK. I will work on the doc changes.
>>
>
> Sorry for the delay.
>
> Attached a draft patch of doc and comments changes that I worked upon.
> Currently I added comments to the callbacks that are present in the
> TableAmRoutine
> structure and I copied it into the docs. I am not sure whether it is a
> good approach or not?
> I am yet to add description for the each parameter of the callbacks for
> easier understanding.
>
> Or, Giving description of each callbacks in the docs with division of
> those callbacks
> according to them divided in the TableAmRoutine structure? Currently
> following divisions
> are available.
> 1. Table scan
> 2. Parallel table scan
> 3. Index scan
> 4. Manipulation of physical tuples
> 5. Non-modifying operations on individual tuples
> 6. DDL
> 7. Planner
> 8. Executor
>
> Suggestions?
>

Here I attached the doc patches for the pluggable storage, I divided the
API's into the above
specified groups and explained them in the docs.I can further add more
details if the approach
seems fine.

Regards,
Haribabu Kommi
Fujitsu Australia


0008-Table-access-method-API-explanation.patch
Description: Binary data


0001-Docs-of-default_table_access_method-GUC.patch
Description: Binary data


0002-Rename-indexam.sgml-to-am.sgml.patch
Description: Binary data


0003-Reorganize-am-as-both-table-and-index.patch
Description: Binary data


0004-Doc-update-of-Create-access-method-type-table.patch
Description: Binary data


0005-Doc-update-of-create-materialized-view-.-USING-synta.patch
Description: Binary data


0006-Doc-update-of-CREATE-TABLE-.-USING-syntax.patch
Description: Binary data


0007-Doc-of-CREATE-TABLE-AS-.-USING-syntax.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-02-08 Thread Amit Khandekar
On Wed, 6 Feb 2019 at 18:30, Amit Khandekar  wrote:
>
> On Mon, 21 Jan 2019 at 08:31, Andres Freund  wrote:
> >
> > Hi,
> >
> > (resending with compressed attachements, perhaps that'll go through)
> >
> > On 2018-12-10 18:13:40 -0800, Andres Freund wrote:
> > > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > > > FWIW, now that oids are removed, and the tuple table slot abstraction
> > > > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > > > that.
> > >
> > > I've pushed a version to that to the git tree, including a rebased
> > > version of zheap:
> > > https://github.com/anarazel/postgres-pluggable-storage
>
> I worked on a slight improvement on the
> 0040-WIP-Move-xid-horizon-computation-for-page-level patch . Instead
> of pre-fetching all the required buffers beforehand, the attached WIP
> patch pre-fetches the buffers keeping a constant distance ahead of the
> buffer reads. It's a WIP patch because right now it just uses a
> hard-coded 5 buffers ahead. Haven't used effective_io_concurrency like
> how it is done in nodeBitmapHeapscan.c. Will do that next. But before
> that, any comments on the way I did the improvements would be nice.
>
> Note that for now, the patch is based on the pluggable-storage latest
> commit; it does not replace the 0040 patch in the patch series.

In the attached v1 patch, the prefetch_distance is calculated as
effective_io_concurrency + 10. Also it has some cosmetic changes.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


prefetch_xid_horizon_scan_v1.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-02-06 Thread Amit Khandekar
On Mon, 21 Jan 2019 at 08:31, Andres Freund  wrote:
>
> Hi,
>
> (resending with compressed attachements, perhaps that'll go through)
>
> On 2018-12-10 18:13:40 -0800, Andres Freund wrote:
> > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > > FWIW, now that oids are removed, and the tuple table slot abstraction
> > > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > > that.
> >
> > I've pushed a version to that to the git tree, including a rebased
> > version of zheap:
> > https://github.com/anarazel/postgres-pluggable-storage

I worked on a slight improvement on the
0040-WIP-Move-xid-horizon-computation-for-page-level patch . Instead
of pre-fetching all the required buffers beforehand, the attached WIP
patch pre-fetches the buffers keeping a constant distance ahead of the
buffer reads. It's a WIP patch because right now it just uses a
hard-coded 5 buffers ahead. Haven't used effective_io_concurrency like
how it is done in nodeBitmapHeapscan.c. Will do that next. But before
that, any comments on the way I did the improvements would be nice.

Note that for now, the patch is based on the pluggable-storage latest
commit; it does not replace the 0040 patch in the patch series.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


prefetch_xid_horizon_scan_WIP.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-02-03 Thread Haribabu Kommi
On Tue, Jan 22, 2019 at 1:43 PM Haribabu Kommi 
wrote:

>
>
> OK. I will work on the doc changes.
>

Sorry for the delay.

Attached a draft patch of doc and comments changes that I worked upon.
Currently I added comments to the callbacks that are present in the
TableAmRoutine
structure and I copied it into the docs. I am not sure whether it is a good
approach or not?
I am yet to add description for the each parameter of the callbacks for
easier understanding.

Or, Giving description of each callbacks in the docs with division of those
callbacks
according to them divided in the TableAmRoutine structure? Currently
following divisions
are available.
1. Table scan
2. Parallel table scan
3. Index scan
4. Manipulation of physical tuples
5. Non-modifying operations on individual tuples
6. DDL
7. Planner
8. Executor

Suggestions?

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Doc-and-comments-update.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-01-31 Thread Amit Khandekar
Hi,

Attached is a patch that adds some test scenarios for testing the
dependency of various object types on table am. Besides simple tables,
it considers materialized views, partitioned table, foreign table, and
composite types, and verifies that the dependency is created only for
those object types that support table access method.

This patch is based on commit 1bc7e6a4838 in
https://github.com/anarazel/postgres-pluggable-storage

Thanks
-Amit Khandekar


test_tableam_dependency.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-01-28 Thread Dmitry Dolgov
> On Sun, Jan 20, 2019 at 6:17 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar  
> > wrote:
> >
> > I believe you are going to add a new regression testcase for the change ?
>
> Yep.

So, here are these two patches for pg_dump/psql with a few regression tests.


psql_describe_am_v3.patch
Description: Binary data


pg_dump_access_method_v5.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-01-22 Thread Amit Khandekar
On Tue, 22 Jan 2019 at 15:29, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > On Mon, Jan 21, 2019 at 9:33 AM Amit Khandekar  
> > wrote:
> >
> > Regression tests that use \d+ to show the table details might
> > not be interested specifically in table access method. But these will
> > fail if run with a modified default access method.
>
> I see your point, but if a test is not interested specifically in a table am,
> then I guess it wouldn't use a custom table am in the first place, right?

Right. It wouldn't use a custom table am. But I mean, despite not
using a custom table am, the test would fail if the regression runs
with a changed default access method, because the regression output
file has only one particular am value output.

> Anyway, I don't have strong opinion here, so if everyone agrees that 
> HIDE_TABLEAM
> will show/hide access method unconditionally, I'm fine with that.

Yeah, I agree it's subjective.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2019-01-22 Thread Dmitry Dolgov
> On Mon, Jan 21, 2019 at 3:01 AM Andres Freund  wrote:
>
> The patchset is now pretty granularly split into individual pieces.

Wow, thanks!

> On Mon, Jan 21, 2019 at 9:33 AM Amit Khandekar  wrote:
>
> Regression tests that use \d+ to show the table details might
> not be interested specifically in table access method. But these will
> fail if run with a modified default access method.

I see your point, but if a test is not interested specifically in a table am,
then I guess it wouldn't use a custom table am in the first place, right? Any
way, I don't have strong opinion here, so if everyone agrees that HIDE_TABLEAM
will show/hide access method unconditionally, I'm fine with that.



Re: Pluggable Storage - Andres's take

2019-01-21 Thread Haribabu Kommi
On Tue, Jan 22, 2019 at 12:15 PM Andres Freund  wrote:

> Hi,
>
> Thanks!
>
> On 2019-01-22 11:51:57 +1100, Haribabu Kommi wrote:
> > Attached the patch for removal of scan_update_snapshot
> > and also the rebased patch of reduction in use of t_tableOid.
>
> I'll soon look at the latter.
>

Thanks.


>
> > > - consider removing table_gimmegimmeslot()
> > > - add substantial docs for every callback
> > >
> >
> > Will work on the above two.
>
> I think it's easier if I do the first, because I can just do it while
> rebasing, reducing unnecessary conflicts.
>
>
OK. I will work on the doc changes.


> > > While I saw an initial attempt at writing smgl docs for the table AM
> > > API, I'm not convinced that's the best approach.  I think it might make
> > > more sense to have high-level docs in sgml, but then do all the
> > > per-callback docs in tableam.h.
> > >
> >
> > OK, I will update the sgml docs accordingly.
> > Index AM has per callback docs in the sgml, refactor them also?
>
> I don't think it's a good idea to tackle the index docs at the same time
> - this patchset is already humongously large...
>

OK.


>
> > diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> > index 62c5f9fa9f..3dc1444739 100644
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -2308,7 +2308,6 @@ static const TableAmRoutine heapam_methods = {
> >   .scan_begin = heap_beginscan,
> >   .scan_end = heap_endscan,
> >   .scan_rescan = heap_rescan,
> > - .scan_update_snapshot = heap_update_snapshot,
> >   .scan_getnextslot = heap_getnextslot,
> >
> >   .parallelscan_estimate = table_block_parallelscan_estimate,
> > diff --git a/src/backend/executor/nodeBitmapHeapscan.c
> b/src/backend/executor/nodeBitmapHeapscan.c
> > index 59061c746b..b48ab5036c 100644
> > --- a/src/backend/executor/nodeBitmapHeapscan.c
> > +++ b/src/backend/executor/nodeBitmapHeapscan.c
> > @@ -954,5 +954,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState
> *node,
> >   node->pstate = pstate;
> >
> >   snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
> > - table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
> > + Assert(IsMVCCSnapshot(snapshot));
> > +
> > + RegisterSnapshot(snapshot);
> > + node->ss.ss_currentScanDesc->rs_snapshot = snapshot;
> > + node->ss.ss_currentScanDesc->rs_temp_snap = true;
> >  }
>
> I was rather thinking that we'd just move this logic into
> table_scan_update_snapshot(), without it invoking a callback.
>

OK. Changed accordingly.
But the table_scan_update_snapshot() function is moved into tableam.c,
to avoid additional header file snapmgr.h inclusion in tableam.h

Regards,
Haribabu Kommi
Fujitsu Australia


0002-Removal-of-scan_update_snapshot-callback.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-01-21 Thread Andres Freund
Hi,

Thanks!

On 2019-01-22 11:51:57 +1100, Haribabu Kommi wrote:
> Attached the patch for removal of scan_update_snapshot
> and also the rebased patch of reduction in use of t_tableOid.

I'll soon look at the latter.


> > - consider removing table_gimmegimmeslot()
> > - add substantial docs for every callback
> >
> 
> Will work on the above two.

I think it's easier if I do the first, because I can just do it while
rebasing, reducing unnecessary conflicts.


> > While I saw an initial attempt at writing smgl docs for the table AM
> > API, I'm not convinced that's the best approach.  I think it might make
> > more sense to have high-level docs in sgml, but then do all the
> > per-callback docs in tableam.h.
> >
> 
> OK, I will update the sgml docs accordingly.
> Index AM has per callback docs in the sgml, refactor them also?

I don't think it's a good idea to tackle the index docs at the same time
- this patchset is already humongously large...


> diff --git a/src/backend/access/heap/heapam_handler.c 
> b/src/backend/access/heap/heapam_handler.c
> index 62c5f9fa9f..3dc1444739 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2308,7 +2308,6 @@ static const TableAmRoutine heapam_methods = {
>   .scan_begin = heap_beginscan,
>   .scan_end = heap_endscan,
>   .scan_rescan = heap_rescan,
> - .scan_update_snapshot = heap_update_snapshot,
>   .scan_getnextslot = heap_getnextslot,
>  
>   .parallelscan_estimate = table_block_parallelscan_estimate,
> diff --git a/src/backend/executor/nodeBitmapHeapscan.c 
> b/src/backend/executor/nodeBitmapHeapscan.c
> index 59061c746b..b48ab5036c 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -954,5 +954,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
>   node->pstate = pstate;
>  
>   snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
> - table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
> + Assert(IsMVCCSnapshot(snapshot));
> +
> + RegisterSnapshot(snapshot);
> + node->ss.ss_currentScanDesc->rs_snapshot = snapshot;
> + node->ss.ss_currentScanDesc->rs_temp_snap = true;
>  }

I was rather thinking that we'd just move this logic into
table_scan_update_snapshot(), without it invoking a callback.


Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-01-21 Thread Haribabu Kommi
On Mon, Jan 21, 2019 at 1:01 PM Andres Freund  wrote:

> Hi,
>
> On 2018-12-10 18:13:40 -0800, Andres Freund wrote:
> > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > > FWIW, now that oids are removed, and the tuple table slot abstraction
> > > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > > that.
> >
> > I've pushed a version to that to the git tree, including a rebased
> > version of zheap:
> > https://github.com/anarazel/postgres-pluggable-storage
> > https://github.com/anarazel/postgres-pluggable-zheap
>
> I've pushed the newest, substantially revised, version to the same
> repository. Note, that while the newest pluggable-zheap version is newer
> than my last email, it's not based on the latest version, and the
> pluggable-zheap development is now happening in the main zheap
> repository.
>

Thanks for the new version of patches and changes.


> Todo:
> - consider removing scan_update_snapshot
>

Attached the patch for removal of scan_update_snapshot
and also the rebased patch of reduction in use of t_tableOid.


> - consider removing table_gimmegimmeslot()
> - add substantial docs for every callback
>

Will work on the above two.

While I saw an initial attempt at writing smgl docs for the table AM
> API, I'm not convinced that's the best approach.  I think it might make
> more sense to have high-level docs in sgml, but then do all the
> per-callback docs in tableam.h.
>

OK, I will update the sgml docs accordingly.
Index AM has per callback docs in the sgml, refactor them also?

Regards,
Haribabu Kommi
Fujitsu Australia


0002-Removal-of-scan_update_snapshot.patch
Description: Binary data


0001-Reduce-the-use-of-HeapTuple-t_tableOid.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-01-21 Thread Amit Khandekar
On Sun, 20 Jan 2019 at 22:46, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar  
> > wrote:
> >
> > --- a/src/test/regress/expected/copy2.out
> > +++ b/src/test/regress/expected/copy2.out
> > @@ -1,3 +1,4 @@
> > +\set HIDE_TABLEAM on
> >
> > I thought we wanted to avoid having to add this setting in individual
> > regression tests. Can't we do this in pg_regress as a common setting ?
>
> Yeah, you're probably right. Actually, I couldn't find anything that looks 
> like
> "common settings", and so far I've placed it into psql_start_test as a psql
> argument. But not sure, maybe there is a better place.

Yeah, psql_start_test() looks good to me. pg_regress does not seem to
have it's own psqlrc file where we could have put this variable. May
be later on if we want to have more such variables, we could device
this infrastructure.

>
> > + /* Access method info */
> > + if (pset.sversion >= 12 && verbose && tableinfo.relam != NULL &&
> > +!(pset.hide_tableam && tableinfo.relam_is_default))
> > + {
> > + printfPQExpBuffer(, _("Access method: %s"),
> > fmtId(tableinfo.relam));
> >
> > So this will make psql hide the access method if it's same as the
> > default. I understand that this was kind of concluded in the other
> > thread "Displaying and dumping of table access methods". But IMHO, if
> > the hide_tableam is false, we should *always* show the access method,
> > regardless of the default value. I mean, we can make it simple : off
> > means never show table-access, on means always show table-access,
> > regardless of the default access method. And this also will work with
> > regression tests. If some regression test wants specifically to output
> > the access method, it can have a "\SET HIDE_TABLEAM off" command.
> >
> > If we hide the method if it's default, then for a regression test that
> > wants to forcibly show the table access method of all tables, it won't
> > show up for tables that have default access method.
>
> I can't imagine, what kind of test would need to forcibly show the table 
> access
> method of all the tables? Even if you need to verify tableam for something,
> maybe it's even easier just to select it from pg_am?

Actually my statement is wrong, sorry. For a regression test that
wants to forcibly show table access for all tables, it just needs to
SET HIDE_TABLEAM to OFF. With your patch, if we set HIDE_TABLEAM to
OFF, it will *always* show the table access regardless of default
access method.

It is with HIDE_TABLEAM=ON that your patch hides the table access
conditionally (i.e. it shows when default value does not match). It's
in this case, that I feel we should *unconditionally* hide the table
access. Regression tests that use \d+ to show the table details might
not be interested specifically in table access method. But these will
fail if run with a modified default access method.

Besides, my general inclination is : keep the GUC behaviour simple;
and also,  it looks like we can keep the regression test output
consistent without having to have this conditional behaviour.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2019-01-20 Thread Haribabu Kommi
On Tue, Jan 15, 2019 at 6:05 PM Andres Freund  wrote:

> Hi,
>
> On 2019-01-15 18:02:38 +1100, Haribabu Kommi wrote:
> > On Tue, Dec 11, 2018 at 1:13 PM Andres Freund 
> wrote:
> >
> > > Hi,
> > >
> > > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > > Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> > > - pg_upgrade testing
> > >
> >
> > I did the pg_upgrade testing from older version with some tables and
> views
> > exists,  and all of them are properly transformed into new server with
> heap
> > as the default access method.
> >
> > I will add the dimitry pg_dump patch and test the pg_upgrade to confirm
> > the proper access method is retained on the upgraded database.
> >
> >
> >
> > > - I think we should consider removing HeapTuple->t_tableOid, it should
> > >   imo live entirely in the slot
> > >
> >
> > I removed the t_tableOid from HeapTuple and during testing I found some
> > problems with triggers, will post the patch once it is fixed.
>
>
> Please note that I'm working on a heavily revised version of the patch
> right now, trying to clean up a lot of things (you might have seen some
> of the threads I started). I hope to post it ~Thursday.  Local-ish
> patches shouldn't be a problem though.
>

Yes, I am checking you other threads of refactoring and cleanups.
I will rebase this patch once the revised code is available.

I am not able to remove the complete t_tableOid from HeapTuple,
because of its use in triggers, as the slot is not available in triggers
and I need to store the tableOid also as part of the tuple.

Currently setting of t_tableOid is done only when the tuple is formed
from the slot, and it is use is replaced with slot member.

comments?

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Reduce-the-use-of-HeapTuple-t_tableOid.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-01-20 Thread Dmitry Dolgov
> On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar  
> wrote:
>
> --- a/src/test/regress/expected/copy2.out
> +++ b/src/test/regress/expected/copy2.out
> @@ -1,3 +1,4 @@
> +\set HIDE_TABLEAM on
>
> I thought we wanted to avoid having to add this setting in individual
> regression tests. Can't we do this in pg_regress as a common setting ?

Yeah, you're probably right. Actually, I couldn't find anything that looks like
"common settings", and so far I've placed it into psql_start_test as a psql
argument. But not sure, maybe there is a better place.

> + /* Access method info */
> + if (pset.sversion >= 12 && verbose && tableinfo.relam != NULL &&
> +!(pset.hide_tableam && tableinfo.relam_is_default))
> + {
> + printfPQExpBuffer(, _("Access method: %s"),
> fmtId(tableinfo.relam));
>
> So this will make psql hide the access method if it's same as the
> default. I understand that this was kind of concluded in the other
> thread "Displaying and dumping of table access methods". But IMHO, if
> the hide_tableam is false, we should *always* show the access method,
> regardless of the default value. I mean, we can make it simple : off
> means never show table-access, on means always show table-access,
> regardless of the default access method. And this also will work with
> regression tests. If some regression test wants specifically to output
> the access method, it can have a "\SET HIDE_TABLEAM off" command.
>
> If we hide the method if it's default, then for a regression test that
> wants to forcibly show the table access method of all tables, it won't
> show up for tables that have default access method.

I can't imagine, what kind of test would need to forcibly show the table access
method of all the tables? Even if you need to verify tableam for something,
maybe it's even easier just to select it from pg_am?

> + if (pset.sversion >= 12 && verbose && tableinfo.relam != NULL &&
>
> If the server does not support relam, tableinfo.relam will be NULL
> anyways. So I think sversion check is not needed.
> 
>
> + printfPQExpBuffer(, _("Access method: %s"), fmtId(tableinfo.relam));
> fmtId is not required.
> ---
>
> +  printfPQExpBuffer(, _("Access method: %s"), 
> fmtId(tableinfo.relam));
> +  printTableAddFooter(, buf.data);
> +   }
> +
> +
>  }
>
> Last two blank lines are not needed.

Right, fixed.

> + boolhide_tableam;
>  } PsqlSettings;
>
> These variables, it seems, are supposed to be grouped together by type.

Well, this grouping looks strange for me. But since I don't have a strong
opinion, I moved the variable.

> I believe you are going to add a new regression testcase for the change ?

Yep.


psql_describe_am_v2.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-01-18 Thread Amit Khandekar
On Fri, 18 Jan 2019 at 10:13, Amit Khandekar  wrote:
> On Tue, 15 Jan 2019 at 17:58, Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > Also I guess another attached patch should address the psql part, namely
> > displaying a table access method with \d+ and possibility to hide it with a
> > psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name).

I am ok with the name.

>
> Will have a look at this one.

--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -1,3 +1,4 @@
+\set HIDE_TABLEAM on
 CREATE TEMP TABLE x (

I thought we wanted to avoid having to add this setting in individual
regression tests. Can't we do this in pg_regress as a common setting ?

+ /* Access method info */
+ if (pset.sversion >= 12 && verbose && tableinfo.relam != NULL &&
+!(pset.hide_tableam && tableinfo.relam_is_default))
+ {
+ printfPQExpBuffer(, _("Access method: %s"),
fmtId(tableinfo.relam));

So this will make psql hide the access method if it's same as the
default. I understand that this was kind of concluded in the other
thread "Displaying and dumping of table access methods". But IMHO, if
the hide_tableam is false, we should *always* show the access method,
regardless of the default value. I mean, we can make it simple : off
means never show table-access, on means always show table-access,
regardless of the default access method. And this also will work with
regression tests. If some regression test wants specifically to output
the access method, it can have a "\SET HIDE_TABLEAM off" command.

If we hide the method if it's default, then for a regression test that
wants to forcibly show the table access method of all tables, it won't
show up for tables that have default access method.



+ if (pset.sversion >= 12 && verbose && tableinfo.relam != NULL &&

If the server does not support relam, tableinfo.relam will be NULL
anyways. So I think sversion check is not needed.



+ printfPQExpBuffer(, _("Access method: %s"), fmtId(tableinfo.relam));
fmtId is not required. In fact, we should display the access method
name as-is. fmtId is required only for identifiers present in SQL
queries.

---

+  printfPQExpBuffer(, _("Access method: %s"), fmtId(tableinfo.relam));
+  printTableAddFooter(, buf.data);
+   }
+
+
 }

Last two blank lines are not needed.

---

+ boolhide_tableam;
 } PsqlSettings;

These variables, it seems, are supposed to be grouped together by type.

---

I believe you are going to add a new regression testcase for the change ?


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2019-01-17 Thread Amit Khandekar
On Tue, 15 Jan 2019 at 17:58, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Tue, Jan 15, 2019 at 10:52 AM Amit Khandekar  
> > wrote:
> >
> > Need to bump K_VERS_MINOR as well.
>
> I've bumped it up, but somehow this change escaped the previous version. Now
> should be there, thanks!
>
> > On Mon, 14 Jan 2019 at 18:36, Amit Khandekar  wrote:
> > > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char
> > > *tablespace);
> > > tablespace => tableam
> >
> > This is yet to be addressed.
>
> Fixed.

Thanks, the patch looks good to me. Of course there's the other thread
about ArchiveEntry arguments which may alter this patch, but
otherwise, I have no more comments on this patch.

>
> Also I guess another attached patch should address the psql part, namely
> displaying a table access method with \d+ and possibility to hide it with a
> psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name).

Will have a look at this one.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2019-01-15 Thread Andres Freund
Hi,

On 2019-01-15 14:37:36 +0530, Amit Khandekar wrote:
> Then for each of the calls, we would need to declare that structure
> variable (with = {0}) and assign required fields in that structure
> before passing it to ArchiveEntry(). But a major reason of
> ArchiveEntry() is to avoid doing this and instead conveniently pass
> those fields as parameters. This will cause unnecessary more lines of
> code. I think better way is to have an ArchiveEntry() function with
> limited number of parameters, and have an ArchiveEntryEx() with those
> extra parameters which are not needed in usual cases.

I don't think that'll really solve the problem. I think it might be more
reasonable to rely on structs. Now that we can rely on designated
initializers for structs we can do something like

ArchiveEntry((ArchiveArgs){.tablespace = 3,
   .dumpFn = somefunc,
   ...});

and unused arguments will automatically initialized to zero.  Or we
could pass the struct as a pointer, might be more efficient (although I
doubt it matters here):

ArchiveEntry(&(ArchiveArgs){.tablespace = 3,
.dumpFn = somefunc,
...});

What do others think?  It'd probably be a good idea to start a new
thread about this.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-01-15 Thread Dmitry Dolgov
> On Tue, Jan 15, 2019 at 10:52 AM Amit Khandekar  
> wrote:
>
> Need to bump K_VERS_MINOR as well.

I've bumped it up, but somehow this change escaped the previous version. Now
should be there, thanks!

> On Mon, 14 Jan 2019 at 18:36, Amit Khandekar  wrote:
> > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char
> > *tablespace);
> > tablespace => tableam
>
> This is yet to be addressed.

Fixed.

Also I guess another attached patch should address the psql part, namely
displaying a table access method with \d+ and possibility to hide it with a
psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name).


pg_dump_access_method_v4.patch
Description: Binary data


psql_describe_am.patch
Description: Binary data


Re: Pluggable Storage - Andres's take

2019-01-15 Thread Amit Khandekar
On Tue, 15 Jan 2019 at 12:27, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Mon, Jan 14, 2019 at 2:07 PM Amit Khandekar  
> > wrote:
> >
> > createPQExpBuffer() should be moved after the below statement, so that
> > it does not leak memory
>
> Thanks for noticing, fixed.

Looks good.

>
> > So how about bumping up the archive version and doing these checks ?
>
> Yeah, you're right, I've added this check.

Need to bump K_VERS_MINOR as well.

On Mon, 14 Jan 2019 at 18:36, Amit Khandekar  wrote:
> +static void _selectTableAccessMethod(ArchiveHandle *AH, const char
> *tablespace);
> tablespace => tableam

This is yet to be addressed.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2019-01-15 Thread Amit Khandekar
On Sat, 12 Jan 2019 at 18:11, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > On Sat, Jan 12, 2019 at 1:44 AM Andres Freund  wrote:
> > >   /* other fields were zeroed above */
> > >
> > > @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const 
> > > char *name,
> > >* post-data.
> > >*/
> > >   ArchiveEntry(fout, nilCatalogId, createDumpId(),
> > > -  tag->data, namespace, NULL, owner,
> > > +  tag->data, namespace, NULL, owner, 
> > > NULL,
> > >"COMMENT", SECTION_NONE,
> > >query->data, "", NULL,
> > >&(dumpId), 1,
> >
> > We really ought to move the arguments to a struct, so we don't generate
> > quite as much useless diffs whenever we do a change around one of
> > these...
>
> That's what I thought too. Maybe then I'll suggest a mini-patch to the master 
> to
> refactor these arguments out into a separate struct, so we can leverage it 
> here.

Then for each of the calls, we would need to declare that structure
variable (with = {0}) and assign required fields in that structure
before passing it to ArchiveEntry(). But a major reason of
ArchiveEntry() is to avoid doing this and instead conveniently pass
those fields as parameters. This will cause unnecessary more lines of
code. I think better way is to have an ArchiveEntry() function with
limited number of parameters, and have an ArchiveEntryEx() with those
extra parameters which are not needed in usual cases. E.g. we can have
tablespace, tableam, dumpFn and dumpArg as those extra arguments of
ArchiveEntryEx(), because most of the places these are passed as NULL.
All future arguments would go in ArchiveEntryEx().
Comments ?



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2019-01-14 Thread Andres Freund
Hi,

On 2019-01-15 18:02:38 +1100, Haribabu Kommi wrote:
> On Tue, Dec 11, 2018 at 1:13 PM Andres Freund  wrote:
> 
> > Hi,
> >
> > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> > - pg_upgrade testing
> >
> 
> I did the pg_upgrade testing from older version with some tables and views
> exists,  and all of them are properly transformed into new server with heap
> as the default access method.
> 
> I will add the dimitry pg_dump patch and test the pg_upgrade to confirm
> the proper access method is retained on the upgraded database.
> 
> 
> 
> > - I think we should consider removing HeapTuple->t_tableOid, it should
> >   imo live entirely in the slot
> >
> 
> I removed the t_tableOid from HeapTuple and during testing I found some
> problems with triggers, will post the patch once it is fixed.


Please note that I'm working on a heavily revised version of the patch
right now, trying to clean up a lot of things (you might have seen some
of the threads I started). I hope to post it ~Thursday.  Local-ish
patches shouldn't be a problem though.

Greetings,

Andres Freund



  1   2   >