Re: brininsert optimization opportunity

2023-11-25 Thread Ashwin Agrawal
On Sat, Nov 25, 2023 at 12:06 PM Tomas Vondra 
wrote:

> I've done a bit more cleanup on the last version of the patch (renamed
> the fields to start with bis_ as agreed, rephrased the comments / docs /
> commit message a bit) and pushed.


Thanks a lot Tomas for helping to drive the patch to completion iteratively
and realizing the benefits.

- Ashwin


Re: Backends stalled in 'startup' state

2023-01-17 Thread Ashwin Agrawal
On Tue, Jan 17, 2023 at 4:52 PM Ashwin Agrawal  wrote:

>
> We recently saw many backends (close to max_connection limit) get stalled
> in 'startup' in one of the production environments for Greenplum (fork of
> PostgreSQL). Tracing the reason, it was found all the tuples created by
> bootstrap (xmin=1) in pg_attribute were at super high block numbers (for
> example beyond 30,000). Tracing the reason for the backend startup stall
> exactly matched Tom's reasoning in [1]. Stalls became much longer in
> presence of sub-transaction overflow or presence of long running
> transactions as tuple visibility took longer. The thread ruled out the
> possibility of system catalog rows to be present in higher block numbers
> instead of in front for pg_attribute.
>
> This thread provides simple reproduction on the latest version of
> PostgreSQL and RCA for how bootstrap catalog entries can move to higher
> blocks and as a result cause stalls for backend starts. Simple fix to avoid
> the issue provided at the end.
>
> The cause is syncscan triggering during VACUUM FULL. VACUUM FULL rewrites
> the table by performing the seqscan as well. And
> heapam_relation_copy_for_cluster() conveys feel free to use syncscan. Hence
> logic to not start from block 0 instead some other block already in cache
> is possible and opens the possibility to move the bootstrap tuples to
> anywhere else in the table.
>
> --
> Repro
> --
> -- create database to play
> drop database if exists test;
> create database test;
> \c test
>
> -- function just to create many tables to increase pg_attribute size
> -- (ideally many column table might do the job more easily)
> CREATE OR REPLACE FUNCTION public.f(id integer)
>  RETURNS void
>  LANGUAGE plpgsql
>  STRICT
> AS $function$
> declare
>   sql text;
>   i int;
> begin
>   for i in id..id+ loop
> sql='create table if not exists tbl'||i||' (id int)';
> execute sql;
>   end loop;
> end;
> $function$;
>
> select f(1);
> select f(2);
> select f(3);
> select f(4);
>
> -- validate pg_attribute size is greater than 1/4 of shared_buffers
> -- for syncscan to triggger
> show shared_buffers;
> select pg_size_pretty(pg_relation_size('pg_attribute'));
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>
> -- perform seq scan of pg_attribute to page past bootstrapped tuples
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
>
> -- this will internally use syncscan starting with block after bootstrap
> tuples
> -- and hence move bootstrap tuples last to higher block numbers
> vacuum full pg_attribute;
>
> --
> Sample run
> --
> show shared_buffers;
>  shared_buffers
> 
>  128MB
> (1 row)
>
> select pg_size_pretty(pg_relation_size('pg_attribute'));
>  pg_size_pretty
> 
>  40 MB
> (1 row)
>
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>  ctid  | xmin | attrelid |   attname
> ---+--+--+--
>  (0,1) |1 | 1255 | oid
>  (0,2) |1 | 1255 | proname
>  (0,3) |1 | 1255 | pronamespace
>  (0,4) |1 | 1255 | proowner
>  (0,5) |1 | 1255 | prolang
> (5 rows)
>
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
> COPY 2000
> vacuum full pg_attribute;
> VACUUM
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>ctid| xmin | attrelid |   attname
> ---+--+--+--
>  (5115,14) |1 | 1255 | oid
>  (5115,15) |1 | 1255 | proname
>  (5115,16) |1 | 1255 | pronamespace
>  (5115,17) |1 | 1255 | proowner
>  (5115,18) |1 | 1255 | prolang
> (5 rows)
>
>
> Note:
> -- used logic causing the problem to fix it as well on the system :-)
> -- scan till block where bootstrap tuples are located
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
> -- now due to syncscan triggering it will pick the blocks with bootstrap
> tuples first and help to bring them back to front
> vacuum full pg_attribute;
>
> --
> Patch to avoid the problem:
> --
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index a34

Re: Backends stalled in 'startup' state

2022-09-27 Thread Ashwin Agrawal
On Thu, Sep 15, 2022 at 4:42 PM Ashwin Agrawal  wrote:

>
> We recently saw many backends (close to max_connection limit) get stalled
> in 'startup' in one of the production environments for Greenplum (fork of
> PostgreSQL). Tracing the reason, it was found all the tuples created by
> bootstrap (xmin=1) in pg_attribute were at super high block numbers (for
> example beyond 30,000). Tracing the reason for the backend startup stall
> exactly matched Tom's reasoning in [1]. Stalls became much longer in
> presence of sub-transaction overflow or presence of long running
> transactions as tuple visibility took longer. The thread ruled out the
> possibility of system catalog rows to be present in higher block numbers
> instead of in front for pg_attribute.
>
> This thread provides simple reproduction on the latest version of
> PostgreSQL and RCA for how bootstrap catalog entries can move to higher
> blocks and as a result cause stalls for backend starts. Simple fix to avoid
> the issue provided at the end.
>
> The cause is syncscan triggering during VACUUM FULL. VACUUM FULL rewrites
> the table by performing the seqscan as well. And
> heapam_relation_copy_for_cluster() conveys feel free to use syncscan. Hence
> logic to not start from block 0 instead some other block already in cache
> is possible and opens the possibility to move the bootstrap tuples to
> anywhere else in the table.
>
> --
> Repro
> --
> -- create database to play
> drop database if exists test;
> create database test;
> \c test
>
> -- function just to create many tables to increase pg_attribute size
> -- (ideally many column table might do the job more easily)
> CREATE OR REPLACE FUNCTION public.f(id integer)
>  RETURNS void
>  LANGUAGE plpgsql
>  STRICT
> AS $function$
> declare
>   sql text;
>   i int;
> begin
>   for i in id..id+ loop
> sql='create table if not exists tbl'||i||' (id int)';
> execute sql;
>   end loop;
> end;
> $function$;
>
> select f(1);
> select f(2);
> select f(3);
> select f(4);
>
> -- validate pg_attribute size is greater than 1/4 of shared_buffers
> -- for syncscan to triggger
> show shared_buffers;
> select pg_size_pretty(pg_relation_size('pg_attribute'));
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>
> -- perform seq scan of pg_attribute to page past bootstrapped tuples
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
>
> -- this will internally use syncscan starting with block after bootstrap
> tuples
> -- and hence move bootstrap tuples last to higher block numbers
> vacuum full pg_attribute;
>
> --
> Sample run
> --
> show shared_buffers;
>  shared_buffers
> 
>  128MB
> (1 row)
>
> select pg_size_pretty(pg_relation_size('pg_attribute'));
>  pg_size_pretty
> 
>  40 MB
> (1 row)
>
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>  ctid  | xmin | attrelid |   attname
> ---+--+--+--
>  (0,1) |1 | 1255 | oid
>  (0,2) |1 | 1255 | proname
>  (0,3) |1 | 1255 | pronamespace
>  (0,4) |1 | 1255 | proowner
>  (0,5) |1 | 1255 | prolang
> (5 rows)
>
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
> COPY 2000
> vacuum full pg_attribute;
> VACUUM
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>ctid| xmin | attrelid |   attname
> ---+--+--+--
>  (5115,14) |1 | 1255 | oid
>  (5115,15) |1 | 1255 | proname
>  (5115,16) |1 | 1255 | pronamespace
>  (5115,17) |1 | 1255 | proowner
>  (5115,18) |1 | 1255 | prolang
> (5 rows)
>
>
> Note:
> -- used logic causing the problem to fix it as well on the system :-)
> -- scan till block where bootstrap tuples are located
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
> -- now due to syncscan triggering it will pick the blocks with bootstrap
> tuples first and help to bring them back to front
> vacuum full pg_attribute;
>
> --
> Patch to avoid the problem:
> --
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index a34

Re: pg_basebackup --create-slot-if-not-exists?

2022-09-22 Thread Ashwin Agrawal
On Wed, Sep 21, 2022 at 5:34 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wednesday, September 21, 2022, Ashwin Agrawal 
> wrote:
>
>> Currently, pg_basebackup has
>> --create-slot option to create slot if not already exists or
>> --slot to use existing slot
>>
>> Which means it needs knowledge on if the slot with the given name already
>> exists or not before invoking the command. If pg_basebackup --create-slot
>> <> command fails for some reason after creating the slot. Re-triggering the
>> same command fails with ERROR slot already exists. Either then need to
>> delete the slot and retrigger Or need to add a check before retriggering
>> the command to check if the slot exists and based on the same alter the
>> command to avoid passing --create-slot option. This poses
>> inconvenience while automating on top of pg_basebackup. As checking for
>> slot presence before invoking pg_basebackup unnecessarily involves issuing
>> separate SQL commands. Would be really helpful for such scenarios if
>> similar to CREATE TABLE, pg_basebackup can have IF NOT EXISTS kind of
>> semantic. (Seems the limitation most likely is coming from CREATE
>> REPLICATION SLOT protocol itself), Thoughts?
>>
>
> What’s the use case for automating pg_basebackup with a named replication
> slot created by the pg_basebackup command?
>

Greenplum runs N (some hundred) number of PostgreSQL instances to form a
sharded database cluster. Hence, automation/scripts are in place to create
replicas, failover failback for these N instances and such. As Michael said
for predictable management and monitoring of the slot across these many
instances, specific named replication slots are used across all these
instances. These named replication slots are used both for pg_basebackup
followed by streaming replication.

Why can you not leverage a temporary replication slot (i.e., omit —slot).
> ISTM the create option is basically obsolete now.
>

We would be more than happy to use a temporary replication slot if it
provided full functionality. It might be a gap in my understanding, but I
feel a temporary replication slot only protects WAL deletion for the
duration of pg_basebackup. It doesn't protect the window between
pg_basebackup completion and streaming replication starting.
With --write-recovery-conf option "primary_slot_name" only gets written
to postgresql.auto.conf if the named replication slot is provided, which
makes sure the same slot will be used for pg_basebackup and streaming
replication hence will keep the WAL around till streaming replica connects
after pg_basebackup. How to avoid this window with a temp slot?

-- 
*Ashwin Agrawal (VMware)*


pg_basebackup --create-slot-if-not-exists?

2022-09-21 Thread Ashwin Agrawal
Currently, pg_basebackup has
--create-slot option to create slot if not already exists or
--slot to use existing slot

Which means it needs knowledge on if the slot with the given name already
exists or not before invoking the command. If pg_basebackup --create-slot
<> command fails for some reason after creating the slot. Re-triggering the
same command fails with ERROR slot already exists. Either then need to
delete the slot and retrigger Or need to add a check before retriggering
the command to check if the slot exists and based on the same alter the
command to avoid passing --create-slot option. This poses
inconvenience while automating on top of pg_basebackup. As checking for
slot presence before invoking pg_basebackup unnecessarily involves issuing
separate SQL commands. Would be really helpful for such scenarios if
similar to CREATE TABLE, pg_basebackup can have IF NOT EXISTS kind of
semantic. (Seems the limitation most likely is coming from CREATE
REPLICATION SLOT protocol itself), Thoughts?

-- 
*Ashwin Agrawal (VMware)*


Backends stalled in 'startup' state

2022-09-15 Thread Ashwin Agrawal
+* tuples may move to higher blocks, which will lead to
degraded
+* performance for relcache initialization during
connection starts.
+*/
+   if (is_system_catalog)
+   tableScan = table_beginscan_strat(OldHeap,
SnapshotAny, 0, (ScanKey) NULL, true, false);
+   else
+   tableScan = table_beginscan(OldHeap, SnapshotAny,
0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
--


1] https://www.postgresql.org/message-id/27844.1338148415%40sss.pgh.pa.us

-- 
*Ashwin Agrawal (VMware)*


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2022-01-03 Thread Ashwin Agrawal
On Wed, Dec 22, 2021 at 4:23 PM SATYANARAYANA NARLAPURAM <
satyanarlapu...@gmail.com> wrote:

> Hi Hackers,
>
> I am considering implementing RPO (recovery point objective) enforcement
> feature for Postgres where the WAL writes on the primary are stalled when
> the WAL distance between the primary and standby exceeds the configured
> (replica_lag_in_bytes) threshold. This feature is useful particularly in
> the disaster recovery setups where primary and standby are in different
> regions and synchronous replication can't be set up for latency and
> performance reasons yet requires some level of RPO enforcement.
>
> The idea here is to calculate the lag between the primary and the standby
> (Async?) server during XLogInsert and block the caller until the lag is
> less than the threshold value. We can calculate the max lag by iterating
> over ReplicationSlotCtl->replication_slots. If this is not something we
> don't want to do in the core, at least adding a hook for XlogInsert is of
> great value.
>
> A few other scenarios I can think of with the hook are:
>
>1. Enforcing RPO as described above
>2. Enforcing rate limit and slow throttling when sync standby is
>falling behind (could be flush lag or replay lag)
>3. Transactional log rate governance - useful for cloud providers to
>provide SKU sizes based on allowed WAL writes.
>
> Thoughts?
>

Very similar requirement or need was discussed in the past in [1], not
exactly RPO enforcement but large bulk operation/transaction negatively
impacting concurrent transactions due to replication lag.
Would be good to refer to that thread as it explains the challenges for
implementing functionality mentioned in this thread. Mostly the challenge
being no common place to code the throttling logic instead requiring calls
to be sprinkled around in various parts.

1]
https://www.postgresql.org/message-id/flat/CA%2BU5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg%40mail.gmail.com


Re: storing an explicit nonce

2021-10-07 Thread Ashwin Agrawal
On Thu, Oct 7, 2021 at 12:12 PM Robert Haas  wrote:

> On Thu, Oct 7, 2021 at 2:52 PM Stephen Frost  wrote:
> > Assuming that's correct, and I don't see any reason to doubt it, then
> > perhaps it would make sense to have the LSN be unencrypted and include
> > it in the tweak as that would limit the risk from re-use of the same
> > tweak over time.
>
> Talking about things like "limiting the risk" makes me super-nervous.
>
> Maybe we're all on the same page here, but just to make my assumptions
> explicit: I think we have to approach this feature with the idea in
> mind that there are going to be very smart people actively attacking
> any TDE implementation we ship. I expect that if you are lucky enough
> to get your hands on a PostgreSQL cluster's data files and they happen
> to be encrypted, your best option for handling that situation is not
> going to be attacking the encryption, but rather something like
> calling the person who has the password and pretending to be someone
> to whom they ought to disclose it. However, I also believe that
> PostgreSQL is a sufficiently high-profile project that security
> researchers will find it a tempting target. And if they manage to
> write a shell script or tool that breaks our encryption without too
> much difficulty, it will generate a ton of negative PR for the
> project. This will be especially true if the problem can't be fixed
> without re-engineering the whole thing, because we're not
> realistically going to be able to re-engineer the whole thing in a
> minor release, and thus will be saddled with the defective
> implementation for many years.
>
> Now none of that is to say that we shouldn't limit risk - I mean less
> risk is always better than more. But we need to be sure this is not
> like a 90% thing, where we're pretty sure it works. We can get by with
> that for a lot of things, but I think here we had better try
> extra-hard to make sure that we don't have any exposures. We probably
> will anyway, but at least if they're just bugs and not architectural
> deficiencies, we can hope to be able to patch them as they are
> discovered.
>

Not at all knowledgeable on security topics (bravely using terms and
recommendation), can we approach decisions like AES-XTS vs AES-GCM (which
in turn decides whether we need to store nonce or not) based on which
compliance it can achieve or not. Like can using AES-XTS make it FIPS 140-2
compliant or not?


Re: vac_update_datfrozenxid will raise "wrong tuple length" if pg_database tuple contains toast attribute.

2020-11-30 Thread Ashwin Agrawal
On Sun, Nov 29, 2020 at 5:27 PM Michael Paquier  wrote:

> > One thing that strikes me as unwise is that we could run into similar
> > problems with vac_update_relstats() in the future, and there have been
> > recent talks about having more toast tables like for pg_class.  That
> > is not worth caring about on stable branches because it is not an
> > active problem there, but we could do something better on HEAD.
>
> For now, I have added just a comment at the top of
> heap_inplace_update() to warn callers.
>

I am thinking if there is some way to assert this aspect, but seems no way.
So, yes, having at least a comment is good for now.

Junfeng and Ashwin also mentioned to me off-list that their patch used
> a second copy for performance reasons, but I don't see why this could
> become an issue as we update each pg_database row only once a job is
> done.  So I'd like to apply something like the attached on HEAD,
> comments are welcome.


Yes the attached patch looks good to me for PostgreSQL. Thanks Michael.
(In Greenplum, due to per table dispatch to segments, during database wide
vacuum this function gets called per table instead of only at the end, hence
we were trying to be conservative.)


Re: [PoC] Non-volatile WAL buffer

2020-11-24 Thread Ashwin Agrawal
On Sun, Nov 22, 2020 at 5:23 PM Tomas Vondra 
wrote:

> I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance
> with WAL on PMEM DAX device) is actually safe, but I included it anyway
> to see what difference is.


I am curious to learn more on this aspect. Kernels have provided support
for "pmemdax" mode so what part is unsafe in stack.

Reading the numbers it seems only at smaller scale modified PostgreSQL is
giving enhanced benefit over unmodified PostgreSQL with "pmemdax". For most
of other cases the numbers are pretty close between these two setups, so
curious to learn, why even modify PostgreSQL if unmodified PostgreSQL can
provide similar benefit with just DAX mode.


CREATE RULE may generate duplicate entries in pg_depend

2020-08-26 Thread Ashwin Agrawal
If action and qual reference same object in CREATE RULE, it results in
creating duplicate entries in pg_depend for it. Doesn't pose any harm, just
unnecessarily bloats pg_depend. Reference InsertRule(). I think should be
able to avoid adding duplicate entries.

Don't know if this behaviour was discussed earlier, I didn't find it on
search.
We accidentally encountered it while enhancing a catalog check tool for
Greenplum Database.

For example (from rules test):
create table rtest_t5 (a int4, b text);
create table rtest_t7 (a int4, b text);

create rule rtest_t5_ins as on insert to rtest_t5
where new.a > 15 do
insert into rtest_t7 values (new.a, new.b);

# select classid::regclass, refobjid::regclass,* from pg_depend where
refobjid='rtest_t5'::regclass and deptype = 'n';
  classid   | refobjid | classid | objid | objsubid | refclassid | refobjid
| refobjsubid | deptype
+--+-+---+--++--+-+-
 pg_rewrite | rtest_t5 |2618 | 16457 |0 |   1259 |16445
|   1 | n
 pg_rewrite | rtest_t5 |2618 | 16457 |0 |   1259 |16445
|   1 | n
 pg_rewrite | rtest_t5 |2618 | 16457 |0 |   1259 |16445
|   2 | n
(3 rows)


-- 
*Ashwin Agrawal (VMware)*


Re: language cleanups in code and docs

2020-08-20 Thread Ashwin Agrawal
On Wed, Jun 17, 2020 at 9:27 AM David Steele  wrote:

> On 6/17/20 12:08 PM, Magnus Hagander wrote:
> > On Wed, Jun 17, 2020 at 4:15 PM Andrew Dunstan
> > mailto:andrew.duns...@2ndquadrant.com>>
>
> >
> > I'm not sure I like doing s/Black/Block/ here. It reads oddly. There
> are
> > too many other uses of Block in the sources. Forbidden might be a
> better
> > substitution, or Banned maybe. BanList is even less characters than
> > BlackList.
> >
> > I'd be OK with either of those really -- I went with block because it
> > was the easiest one :)
> >
> > Not sure the number of characters is the important part :) Banlist does
> > make sense to me for other reasons though -- it's what it is, isn't it?
> > It bans those oids from being used in the current session -- I don't
> > think there's any struggle to "make that sentence work", which means
> > that seems like the relevant term.
>
> I've seen also seen allowList/denyList as an alternative. I do agree
> that blockList is a bit confusing since we often use block in a very
> different context.
>

+1 for allowList/denyList as alternative

> I do think it's worth doing -- it's a small round of changes, and it
> > doesn't change anything user-exposed, so the cost for us is basically
> zero.
>
> +1


Agree number of occurrences for whitelist and blacklist are not many, so
cleaning these would be helpful and patches already proposed for it

git grep whitelist | wc -l
10
git grep blacklist | wc -l
40

Thanks a lot for language cleanups. Greenplum, fork of PostgreSQL, wishes
to perform similar cleanups and upstream doing it really helps us
downstream.


For standby pg_ctl doesn't wait for PM_STATUS_READY in presence of promote_trigger_file

2020-08-04 Thread Ashwin Agrawal
If shutdown (non hot enabled) standby and promote the standby using
promote_trigger_file via pg_ctl start with -w (wait), currently pg_ctl
returns as soon as recovery is started. Instead would be helpful if
pg_ctl can wait till PM_STATUS_READY for this case, given promotion is
requested.

pg_ctl -w returns as soon as recovery is started for non hot enabled
standby because PM_STATUS_STANDBY is written
on PMSIGNAL_RECOVERY_STARTED. Given the intent to promote the standby
using promote_trigger_file, it would be better to not write
PM_STATUS_STANDBY, instead let promotion complete and return only
after connections can be actually accepted.

Seems helpful behavior for users, though I am not sure about how much
promote_trigger_file is used with non hot enabled standbys. This is
something which will help to solidify some of the tests in Greenplum
hence checking interest for the same here.

It's doable via below patch:

diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 5b5fc97c72..c49010aa5a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5197,6 +5197,7 @@ sigusr1_handler(SIGNAL_ARGS)
if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
pmState == PM_STARTUP && Shutdown == NoShutdown)
{
+   bool promote_trigger_file_exist = false;
/* WAL redo has started. We're out of reinitialization. */
FatalError = false;
AbortStartTime = 0;
@@ -5218,12 +5219,25 @@ sigusr1_handler(SIGNAL_ARGS)
if (XLogArchivingAlways())
PgArchPID = pgarch_start();

+   {
+   /*
+* if promote trigger file exist we don't wish to
convey
+* PM_STATUS_STANDBY, instead wish pg_ctl -w to
wait till
+* connections can be actually accepted by the
database.
+*/
+   struct stat stat_buf;
+   if (PromoteTriggerFile != NULL &&
+   strcmp(PromoteTriggerFile, "") != 0 &&
+   stat(PromoteTriggerFile, _buf) == 0)
+   promote_trigger_file_exist = true;
+   }
+
/*
 * If we aren't planning to enter hot standby mode later,
treat
 * RECOVERY_STARTED as meaning we're out of startup, and
report status
 * accordingly.
 */
-   if (!EnableHotStandby)
+   if (!EnableHotStandby && !promote_trigger_file_exist)
{
AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS,
PM_STATUS_STANDBY);
 #ifdef USE_SYSTEMD


-- 
*Ashwin Agrawal (VMware)*


Re: SyncRepLock acquired exclusively in default configuration

2020-05-19 Thread Ashwin Agrawal
On Mon, May 18, 2020 at 7:41 PM Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> This item is for PG14, right? If so I'd like to add this item to the
> next commit fest.
>

Sure, add it to commit fest.
Seems though it should be backpatched to relevant branches as well.


Re: pg_basebackup misses to report checksum error

2020-05-07 Thread Ashwin Agrawal
On Thu, May 7, 2020 at 10:25 AM Stephen Frost  wrote:

> Greetings,
>
> * Ashwin Agrawal (aagra...@pivotal.io) wrote:
> > On Wed, May 6, 2020 at 3:02 PM Robert Haas 
> wrote:
> > > On Wed, May 6, 2020 at 5:48 PM Ashwin Agrawal 
> wrote:
> > > > If pg_basebackup is not able to read BLCKSZ content from file, then
> it
> > > > just emits a warning "could not verify checksum in file "" block
> > > > X: read buffer size X and page size 8192 differ" currently but misses
> > > > to error with "checksum error occurred". Only if it can read 8192 and
> > > > checksum mismatch happens will it error in the end.
> > >
> > > I don't think it's a good idea to conflate "hey, we can't checksum
> > > this because the size is strange" with "hey, the checksum didn't
> > > match". Suppose the a file has 1000 full blocks and a partial block.
> > > All 1000 blocks have good checksums. With your change, ISTM that we'd
> > > first emit a warning saying that the checksum couldn't be verified,
> > > and then we'd emit a second warning saying that there was 1 checksum
> > > verification failure, which would also be reported to the stats
> > > system. I don't think that's what we want.
> >
> > I feel the intent of reporting "total checksum verification failure" is
> to
> > report corruption. Which way is the secondary piece of the puzzle. Not
> > being able to read checksum itself to verify is also corruption and is
> > checksum verification failure I think. WARNINGs will provide fine grained
> > clarity on what type of checksum verification failure it is, so I am not
> > sure we really need fine grained clarity in "total numbers" to
> > differentiate these two types.
>
> Are we absolutely sure that there's no way for a partial block to end up
> being seen by pg_basebackup, which is just doing routine filesystem
> read() calls, during normal operation though..?  Across all platforms?
>

Okay, that's a good point, I didn't think about it. This comment to skip
verifying checksum, I suppose convinces, can't be sure and hence can't
report partial blocks as corruption.

/*
 * Only check pages which have not been modified since the
  * start of the base backup. Otherwise, they might have been
  * written only halfway and the checksum would not be valid.
  * However, replaying WAL would reinstate the correct page in
  * this case. We also skip completely new pages, since they
  * don't have a checksum yet.
  */

Might be nice to have a similar comment for the partial block case to
document why we can't report it as corruption. Thanks.


Re: pg_basebackup misses to report checksum error

2020-05-07 Thread Ashwin Agrawal
On Wed, May 6, 2020 at 3:02 PM Robert Haas  wrote:

> On Wed, May 6, 2020 at 5:48 PM Ashwin Agrawal  wrote:
> > If pg_basebackup is not able to read BLCKSZ content from file, then it
> > just emits a warning "could not verify checksum in file "" block
> > X: read buffer size X and page size 8192 differ" currently but misses
> > to error with "checksum error occurred". Only if it can read 8192 and
> > checksum mismatch happens will it error in the end.
>
> I don't think it's a good idea to conflate "hey, we can't checksum
> this because the size is strange" with "hey, the checksum didn't
> match". Suppose the a file has 1000 full blocks and a partial block.
> All 1000 blocks have good checksums. With your change, ISTM that we'd
> first emit a warning saying that the checksum couldn't be verified,
> and then we'd emit a second warning saying that there was 1 checksum
> verification failure, which would also be reported to the stats
> system. I don't think that's what we want.


I feel the intent of reporting "total checksum verification failure" is to
report corruption. Which way is the secondary piece of the puzzle. Not
being able to read checksum itself to verify is also corruption and is
checksum verification failure I think. WARNINGs will provide fine grained
clarity on what type of checksum verification failure it is, so I am not
sure we really need fine grained clarity in "total numbers" to
differentiate these two types.

Not reporting anything to the stats system and at end reporting there is
checksum failure would be more weird, right? Or will say
ERRCODE_DATA_CORRUPTED with some other message and not checksum
verification failure.

There might be an argument
> for making this code trigger...
>
> ereport(ERROR,
> (errcode(ERRCODE_DATA_CORRUPTED),
>  errmsg("checksum verification failure during base
> backup")));
>
> ...but I wouldn't for that reason inflate the number of blocks that
> are reported as having failures.
>

When checksum verification is turned on and the issue is detected, I
strongly feel ERROR must be triggered as silently reporting success doesn't
seem right.
I can introduce one more variable just to capture and track files with such
cases. But will we report them separately to stats? How? Also, do we want
to have separate WARNING for the total number of files in this category?
Those all seem slight complications but if wind is blowing in that
direction, I am ready to fly that way.


pg_basebackup misses to report checksum error

2020-05-06 Thread Ashwin Agrawal
If pg_basebackup is not able to read BLCKSZ content from file, then it
just emits a warning "could not verify checksum in file "" block
X: read buffer size X and page size 8192 differ" currently but misses
to error with "checksum error occurred". Only if it can read 8192 and
checksum mismatch happens will it error in the end.

Repro is pretty simple:
/usr/local/pgsql/bin/initdb -k /tmp/postgres
/usr/local/pgsql/bin/pg_ctl -D /tmp/postgres -l /tmp/logfile start
# just create random file of size not in multiple of 8192
echo "corruption" > /tmp/postgres/base/12696/4

Without the fix:
$ /usr/local/pgsql/bin/pg_basebackup -D /tmp/dummy
WARNING:  could not verify checksum in file "./base/12696/4", block 0:
read buffer size 11 and page size 8192 differ
$ echo $?
0

With the fix:
$ /usr/local/pgsql/bin/pg_basebackup -D /tmp/dummy
WARNING:  could not verify checksum in file "./base/12696/4", block 0:
read buffer size 11 and page size 8192 differ
pg_basebackup: error: checksum error occurred
$ echo $?
1


I think it's an important case to be handled and should not be silently
skipped,
unless I am missing something. This one liner should fix it:

diff --git a/src/backend/replication/basebackup.c
b/src/backend/replication/basebackup.c
index fbdc28ec39..68febbedf0 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1641,6 +1641,7 @@ sendFile(const char *readfilename, const char
*tarfilename,
"differ",
readfilename,
blkno, (int) cnt, BLCKSZ)));
verify_checksum = false;
+  checksum_failures++;
}

if (verify_checksum)


Re: SyncRepLock acquired exclusively in default configuration

2020-04-07 Thread Ashwin Agrawal
On Mon, Apr 6, 2020 at 2:14 PM Andres Freund  wrote:

> > How about we change it to this ?
>
> Hm. Better. But I think it might need at least a compiler barrier /
> volatile memory load?  Unlikely here, but otherwise the compiler could
> theoretically just stash the variable somewhere locally (it's not likely
> to be a problem because it'd not be long ago that we acquired an lwlock,
> which is a full barrier).
>

That's the part, I am not fully sure about. But reading the comment above
SyncRepUpdateSyncStandbysDefined(), it seems fine.

> Bring back the check which existed based on GUC but instead of just
> blindly
> > returning based on just GUC not being set, check
> > WalSndCtl->sync_standbys_defined. Thoughts?
>
> Hm. Is there any reason not to just check
> WalSndCtl->sync_standbys_defined? rather than both !SyncStandbysDefined()
> and WalSndCtl->sync_standbys_defined?
>

Agree, just checking for WalSndCtl->sync_standbys_defined seems fine.

I wasn't fully thinking there, as I got distracted by if lock will be
required or not for reading the same. If lock was required then checking
for guc first would have been better, but seems not required.


Re: SyncRepLock acquired exclusively in default configuration

2020-04-06 Thread Ashwin Agrawal
On Mon, Apr 6, 2020 at 1:52 AM Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> On Mon, 6 Apr 2020 at 14:04, Andres Freund  wrote:
> >
> > commit 48c9f4926562278a2fd2b85e7486c6d11705f177
> > Author: Simon Riggs 
> > Date:   2017-12-29 14:30:33 +
> >
> > Fix race condition when changing synchronous_standby_names
> >
> > A momentary window exists when synchronous_standby_names
> > changes that allows commands issued after the change to
> > continue to act as async until the change becomes visible.
> > Remove the race by using more appropriate test in syncrep.c
> >
> >     Author: Asim Rama Praveen and Ashwin Agrawal
> > Reported-by: Xin Zhang, Ashwin Agrawal, and Asim Rama Praveen
> > Reviewed-by: Michael Paquier, Masahiko Sawada
> >
> > As far as I can tell there was no discussion about the added contention
> > due this change in the relevant thread [1].
> >
> > The default configuration has an empty synchronous_standby_names. Before
> > this change we'd fall out of SyncRepWaitForLSN() before acquiring
> > SyncRepLock in exlusive mode. Now we don't anymore.
> >
> >
> > I'm really not ok with unneccessarily adding an exclusive lock
> > acquisition to such a crucial path.
> >
>
> I think we can acquire SyncRepLock in share mode once to check
> WalSndCtl->sync_standbys_defined and if it's true then check it again
> after acquiring it in exclusive mode. But it in turn ends up with
> adding one extra LWLockAcquire and LWLockRelease in sync rep path.
>

How about we change it to this ?

diff --git a/src/backend/replication/syncrep.c
b/src/backend/replication/syncrep.c
index ffd5b31eb2..cdb82a8b28 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -165,8 +165,11 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
/*
 * Fast exit if user has not requested sync replication.
 */
-   if (!SyncRepRequested())
-   return;
+   if (!SyncRepRequested() || !SyncStandbysDefined())
+   {
+   if (!WalSndCtl->sync_standbys_defined)
+   return;
+   }

Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
Assert(WalSndCtl != NULL);

Bring back the check which existed based on GUC but instead of just blindly
returning based on just GUC not being set, check
WalSndCtl->sync_standbys_defined. Thoughts?


Re: Make mesage at end-of-recovery less scary.

2020-03-23 Thread Ashwin Agrawal
On Mon, Mar 23, 2020 at 2:37 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> > | [20866] LOG:  replication terminated by primary server
> > | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> > | [20866] FATAL:  could not send end-of-streaming message to primary: no
> COPY in progress
> > | [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive
> during standby mode
> > | [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0
> >
> > I changed the above to the below, which looks more adequate.
> >
> > | [24271]  LOG:  replication terminated by primary server on timeline 1
> at 0/3000240.
> > | [24271]  FATAL:  could not send end-of-streaming message to primary:
> no COPY in progress
> > | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in
> archive during standby mode
> > | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0
>
> Is this the before and after?  That doesn't seem like a substantial
> improvement to me.  You still get the "scary" message at the end.
>

+1 I agree it still reads scary and doesn't seem improvement.

Plus, I am hoping message will improve for pg_waldump as well?
Since it reads confusing and every-time have to explain new developer it's
expected behavior which is annoying.

pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length
at 0/1553FA8: wanted 24, got 0


Re: Use LN_S instead of "ln -s" in Makefile

2020-02-14 Thread Ashwin Agrawal
On Fri, Feb 14, 2020 at 4:57 PM Tom Lane  wrote:

> Ashwin Agrawal  writes:
> > In general, the variable LN_S is 'ln -s', however, LN_S changes to 'cp
> > -pR' if configure finds the file system does not support symbolic
> > links.
> > I was playing with 'ln' for some other reason where I symbolic linked
> > it to '/bin/false'. That revealed the failure for
> > src/backend/Makefile. Greping for 'ln -s' revealed three places it's
> > used. Attaching the patch to fix the same.
>
> Hm.  So, these oversights are certainly bugs in narrow terms.  However,
> they've all been like that for a *long* time --- the three instances
> you found date to 2005, 2006, and 2008 according to "git blame".
> The complete lack of complaints shows that nobody cares.  So I think
> a fairly strong case could be made for going the other way, ie
> s/$(LN_S)/ln -s/g and get rid of the configure-time cycles wasted in
> setting up that variable.
>

I accede to it.


Use LN_S instead of "ln -s" in Makefile

2020-02-14 Thread Ashwin Agrawal
In general, the variable LN_S is 'ln -s', however, LN_S changes to 'cp
-pR' if configure finds the file system does not support symbolic
links.

I was playing with 'ln' for some other reason where I symbolic linked
it to '/bin/false'. That revealed the failure for
src/backend/Makefile. Greping for 'ln -s' revealed three places it's
used. Attaching the patch to fix the same.
From 60b3374de1e4d0fd143c2a7b3bf78893e32579af Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 14 Feb 2020 16:16:37 -0800
Subject: [PATCH v1] Use LN_S instead of "ln -s" in Makefile

---
 src/backend/Makefile  | 2 +-
 src/interfaces/ecpg/test/Makefile | 2 +-
 src/makefiles/pgxs.mk | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 9706a95848..f21a454286 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -224,7 +224,7 @@ install-bin: postgres $(POSTGRES_IMP) installdirs
 	$(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postgres$(X)'
 ifneq ($(PORTNAME), win32)
 	@rm -f '$(DESTDIR)$(bindir)/postmaster$(X)'
-	ln -s postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)'
+	$(LN_S) postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)'
 else
 	$(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)'
 endif
diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index be53b7b94d..11d55bde3e 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -73,7 +73,7 @@ remaining_files_build := $(patsubst $(srcdir)/%, $(abs_builddir)/%, $(remaining_
 
 all: $(remaining_files_build)
 $(remaining_files_build): $(abs_builddir)/%: $(srcdir)/%
-	ln -s $< $@
+	$(LN_S) $< $@
 endif
 
 # Common options for tests. Also pick up anything passed in EXTRA_REGRESS_OPTS
diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk
index 271e7eaba8..8c7d126593 100644
--- a/src/makefiles/pgxs.mk
+++ b/src/makefiles/pgxs.mk
@@ -397,7 +397,7 @@ test_files_build := $(patsubst $(srcdir)/%, $(abs_builddir)/%, $(test_files_src)
 all: $(test_files_build)
 $(test_files_build): $(abs_builddir)/%: $(srcdir)/%
 	$(MKDIR_P) $(dir $@)
-	ln -s $< $@
+	$(LN_S) $< $@
 endif # VPATH
 endif # REGRESS
 
-- 
2.25.0



Re: Extracting only the columns needed for a query

2020-01-31 Thread Ashwin Agrawal
On Wed, Jan 15, 2020 at 7:54 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > For UPDATE, we need all of the columns in the table because of the
> > table_lock() API's current expectation that the slot has all of the
> > columns populated. If we want UPDATE to only need to insert the column
> > values which have changed, table_tuple_lock() will have to change.
>
> Can you please elaborate on this part? I'm probably missing something,
> since I don't see immediately where are those expectations expressed.
>

The table_tuple_lock() has TupleTableSlot as output argument. Comment for
the function states "*slot: contains the target tuple". Usage of
table_tuple_lock() in places like ExecUpdate() and GetTupleForTrigger() use
the returned tuple for EvalPlanQual. Also, it's unknown
within table_tuple_lock() what is expectation and what would be
performed on the returned slot/tuple. Hence, it becomes tricky for any AM
currently to guess and hence need to return full tuple, as API doesn't
state only which columns it would be interested in or just wishes to take
the lock and needs nothing back in slot.


Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2020-01-27 Thread Ashwin Agrawal
On Mon, Jan 27, 2020 at 4:47 PM Thomas Munro  wrote:

> OK, I kept only the small comment change from that little fixup patch,
> and pushed this.
>
> > I had proposed as alternative way in initial email and also later,
> > didn't receive comment on that, so re-posting.
>
> > typedef bool (*AMCheckForSerializableConflictOutCallback) (void *arg);
> ...
> > Just due to void* callback argument aspect I didn't prefer that
> > solution and felt AM performing checks and calling
> > CheckForSerializableConflictOut() seems better.  Please let me know
> > how you feel about this.
>
> Yeah.  We could always come back to this idea if it looks better once
> we have more experience with new table AMs.
>

Sounds good. Thank You!


Re: Should we rename amapi.h and amapi.c?

2019-12-23 Thread Ashwin Agrawal
On Sun, Dec 22, 2019 at 9:34 PM Michael Paquier  wrote:

> Hi all,
>
> I was working on some stuff for table AMs, and I got to wonder it we
> had better rename amapi.h to indexam.h and amapi.c to indexam.c, so as
> things are more consistent with table AM.  It is a bit annoying to
> name the files dedicated to index AMs with what looks like now a too
> generic name.  That would require switching a couple of header files
> for existing module developers, which is always annoying, but the move
> makes sense thinking long-term?
>
> Any thoughts?
>

I had raised the same earlier and [1] has response from Andres, which was
"We probably should rename it, but not in 12..."

[1]
https://www.postgresql.org/message-id/20190508215135.4eljnhnle5xp3jwb%40alap3.anarazel.de


Re: Start Walreceiver completely before shut down it on standby server.

2019-12-10 Thread Ashwin Agrawal
On Tue, Dec 10, 2019 at 3:06 AM jiankang liu  wrote:

> Start Walreceiver completely before shut down it on standby server.
>
> The walreceiver will be shut down, when read an invalid record in the
> WAL streaming from master.And then, we retry from archive/pg_wal again.
>
> After that, we start walreceiver in RequestXLogStreaming(), and read
> record from the WAL streaming. But before walreceiver starts, we read
> data from file which be streamed over and present in pg_wal by last
> time, because of walrcv->receivedUpto > RecPtr and the wal is actually
> flush on disk. Now, we read the invalid record again, what the next to
> do? Shut down the walreceiver and do it again.
>

I am missing something here, if walrcv->receivedUpto > RecPtr, why are we
getting / reading invalid record?


Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-11-12 Thread Ashwin Agrawal
On Sun, Nov 10, 2019 at 8:21 PM Thomas Munro  wrote:

> I pushed the first two,


Thank You!

but on another read-through of the main patch
> I didn't like the comments for CheckForSerializableConflictOut() or
> the fact that it checks SerializationNeededForRead() again, after I
> thought a bit about what the contract for this API is now.  Here's a
> version with small fixup that I'd like to squash into the patch.
> Please let me know what you think,


The thought or reasoning behind having SerializationNeededForRead()
inside CheckForSerializableConflictOut() is to keep that API clean and
complete by itself. Only if AM like heap needs to perform some AM
specific checking only for serialization needed case can do so but is
not forced. So, if AM for example Zedstore doesn't need to do any
specific checking, then it can directly call
CheckForSerializableConflictOut(). With the modified fixup patch, the
responsibility is unnecessarily forced to caller even if
CheckForSerializableConflictOut() can do it. I understand the intent
is to avoid duplicate check for heap.

>
> or if you see how to improve it
> further.
>

I had proposed as alternative way in initial email and also later,
didn't receive comment on that, so re-posting.

Alternative way to refactor. CheckForSerializableConflictOut() can
take callback function and (void *) callback argument for AM specific
check. So, the flow would be AM calling
CheckForSerializableConflictOut() as today and only if
SerializationNeededForRead() will invoke the callback to check with AM
if more work should be performed or not. Essentially
HeapCheckForSerializableConflictOut() will become callback function
instead. So, roughly would look like

typedef bool (*AMCheckForSerializableConflictOutCallback) (void *arg);

void CheckForSerializableConflictOut(Relation relation, TransactionId xid,
Snapshot snapshot, AMCheckForSerializableConflictOutCallback callback, void
*callback_arg)
{
if (!SerializationNeededForRead(relation, snapshot))
   return;
if (callback != NULL && !callback(callback_args))
   return;

.
}

With this AMs which don't have any specific checks to perform can pass
callback as NULL. So, function call is involved only if
SerializationNeededForRead() and only for AMs which need it.

Just due to void* callback argument aspect I didn't prefer that
solution and felt AM performing checks and calling
CheckForSerializableConflictOut() seems better.  Please let me know
how you feel about this.


Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-11-08 Thread Ashwin Agrawal
On Thu, Nov 7, 2019 at 8:44 PM Thomas Munro  wrote:

> On Thu, Aug 8, 2019 at 6:53 AM Ashwin Agrawal  wrote:
> >>> - I wonder if CheckForSerializableConflictOutNeeded() shouldn't have a
> >>>   portion of it's code as a static inline. In particular, it's a shame
> >>>   that we currently perform external function calls at quite the
> >>>   frequency when serializable isn't even in use.
> >>
> >> I am not sure on portion of the code part? SerializationNeededForRead()
> is static inline function in C file. Can't inline
> CheckForSerializableConflictOutNeeded() without moving
> SerializationNeededForRead() and some other variables to header file.
> CheckForSerializableConflictOut() wasn't inline either, so a function call
> was performed earlier as well when serializable isn't even in use.
>
> I agree that it's strange that we do these high frequency function
> calls just to figure out that we're not even using this stuff, which
> ultimately comes down to the static global variable MySerializableXact
> being not reachable from (say) an inline function defined in a header.
> That's something to look into in another thread.
>
> > Attaching re-based version of the patches on top of current master,
> which has the fix for HOT serializable predicate locking bug spotted by
> Andres committed now.
>
> I'm planning to commit these three patches on Monday.  I've attached
> versions with whitespace-only changes from pgindent, and commit
> messages lightly massaged and updated to point to this discussion and
> reviewers.
>

Thanks a lot, sounds good.


Missing error_context_stack = NULL in AutoVacWorkerMain()

2019-10-18 Thread Ashwin Agrawal
I am not sure if this causes any potential problems or not, but for
consistency of code seems we are missing below. All other places in code
where sigsetjmp() exists for top level handling has error_context_stack set
to NULL.

diff --git a/src/backend/postmaster/autovacuum.c
b/src/backend/postmaster/autovacuum.c
index 073f313337..b06d0ad058 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1558,6 +1558,9 @@ AutoVacWorkerMain(int argc, char *argv[])
 */
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
{
+   /* Since not using PG_TRY, must reset error stack by hand */
+   error_context_stack = NULL;
+
/* Prevents interrupts while cleaning up */
HOLD_INTERRUPTS();

This was spotted by Paul during code inspection.


Re: Proposal: Make use of C99 designated initialisers for nulls/values arrays

2019-10-04 Thread Ashwin Agrawal
On Fri, Oct 4, 2019 at 8:49 AM Tom Lane  wrote:

> Jacob Champion  writes:
> > On Fri, Oct 4, 2019 at 7:51 AM Tom Lane  wrote:
> >> I concur with Joe here.  The reason why some of the existing
> >> memset's use "false" is for symmetry with other places where we use
> >> "memset(p, true, n)" to set an array of bools to all-true.
>
> > Why introduce a macro at all for the universal zero initializer, if it
> > seems to encourage the construction of other (incorrect) macros?
>
> Well, the argument is that some people might think that if {0} is enough
> to set all array elements to 0, then maybe {1} sets them all to ones
> (as, indeed, one could argue would be a far better specification than
> what the C committee actually wrote).  Using a separate macro and then
> discouraging direct use of the incomplete-initializer syntax should help
> to avoid that error.
>

Seems avoidable overhead to remind folks on macro existence. Plus, for such
a thing macro exist in first place will be hard to remember. So,
irrespective in long run, {0} might get used in code and hence seems better
to just use {0} from start itself instead of macro/wrapper on top.

Plus, even if someone starts out with thought {1} sets them all to ones, I
feel will soon realize by exercising the code isn't the reality. If such
code is written and nothing fails, that itself seems bigger issue.


Re: heapam_index_build_range_scan's anyvisible

2019-09-25 Thread Ashwin Agrawal
On Wed, Sep 25, 2019 at 1:52 PM Alvaro Herrera 
wrote:

> Sounds OK ... except that Travis points out that Ashwin forgot to patch
> contrib:
>
> make[4]: Entering directory
> '/home/travis/build/postgresql-cfbot/postgresql/contrib/amcheck'
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -g -O2 -Wall -Werror -fPIC -I. -I.
> -I../../src/include -I/usr/include/x86_64-linux-gnu -D_GNU_SOURCE   -c -o
> verify_nbtree.o verify_nbtree.c
> verify_nbtree.c: In function ‘bt_check_every_level’:
> verify_nbtree.c:614:11: error: passing argument 6 of
> ‘table_index_build_scan’ from incompatible pointer type
> [-Werror=incompatible-pointer-types]
>bt_tuple_present_callback, (void *) state, scan);
>^
> In file included from verify_nbtree.c:29:0:
> ../../src/include/access/tableam.h:1499:1: note: expected
> ‘IndexBuildCallback {aka void (*)(struct RelationData *, struct
> ItemPointerData *, long unsigned int *, _Bool *, _Bool,  void *)}’ but
> argument is of type ‘void (*)(struct RelationData *, HeapTupleData *, Datum
> *, _Bool *, _Bool,  void *) {aka void (*)(struct RelationData *, struct
> HeapTupleData *, long unsigned int *, _Bool *, _Bool,  void *)}’
>  table_index_build_scan(Relation table_rel,
>  ^
> cc1: all warnings being treated as errors
> : recipe for target 'verify_nbtree.o' failed
> make[4]: *** [verify_nbtree.o] Error 1
>

Thanks for reporting, I did indeed missed out contrib. Please find attached
the v2 of the patch which includes the change required in contrib as well.
From 873711e0601d5035a302c9f2a4147250e902a28f Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Wed, 25 Sep 2019 22:19:43 -0700
Subject: [PATCH v2] Remove IndexBuildCallback's dependency on HeapTuple

With IndexBuildCallback taking input as HeapTuple, all table AMs
irrespective of storing the tuples in HeapTuple form or not, are
forced to construct HeapTuple, to insert the tuple in Index. Since,
only thing required by the index callbacks is TID and not really the
full tuple, modify callback to only take ItemPointer.
---
 contrib/amcheck/verify_nbtree.c  |  6 +++---
 contrib/bloom/blinsert.c |  4 ++--
 src/backend/access/brin/brin.c   |  4 ++--
 src/backend/access/gin/gininsert.c   |  4 ++--
 src/backend/access/gist/gistbuild.c  |  6 +++---
 src/backend/access/hash/hash.c   |  8 
 src/backend/access/heap/heapam_handler.c | 11 +--
 src/backend/access/nbtree/nbtsort.c  |  8 
 src/backend/access/spgist/spginsert.c|  4 ++--
 src/include/access/tableam.h |  2 +-
 10 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..3542545de5 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -140,7 +140,7 @@ static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 			  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
-static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 	  Datum *values, bool *isnull,
 	  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
@@ -1890,7 +1890,7 @@ bt_downlink_missing_check(BtreeCheckState *state)
  * also allows us to detect the corruption in many cases.
  */
 static void
-bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
 		  bool *isnull, bool tupleIsAlive, void *checkstate)
 {
 	BtreeCheckState *state = (BtreeCheckState *) checkstate;
@@ -1901,7 +1901,7 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
 
 	/* Generate a normalized index tuple for fingerprinting */
 	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
-	itup->t_tid = htup->t_self;
+	itup->t_tid = *tid;
 	norm = bt_normalize_tuple(state, itup);
 
 	/* Probe Bloom filter -- tuple should be present */
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 4b2186b8dd..66c6c1858d 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -73,7 +73,7 @@ initCachedPage(BloomBuildState *buildstate)
  * Per-tuple callback for table_index_build_scan.
  */
 static void
-bloomBuildCallback(Relation index, HeapTuple htup, Datum *values,
+bloomBuildCallback(Relation index, ItemPointer tid, Datum *values,
    bool *isnull, bool tupleIsAlive, void *state)
 {
 	BloomBuildState *buildstate = (BloomBuildState *) st

Re: Zedstore - compressed in-core columnar storage

2019-08-26 Thread Ashwin Agrawal
On Mon, Aug 26, 2019 at 5:36 AM Ashutosh Sharma 
wrote:

> Thanks Ashwin and Heikki for your responses. I've one more query here,
>
> If BTree index is created on a zedstore table, the t_tid field of
> Index tuple contains the physical tid that is not actually pointing to
> the data block instead it contains something from which the logical
> tid can be derived. So, when IndexScan is performed on a zedstore
> table, it fetches the physical tid from the index page and derives the
> logical tid out of it and then retrieves the data corresponding to
> this logical tid from the zedstore table. For that, it kind of
> performs SeqScan on the zedstore table for the given tid.


Nope, it won't perform seqscan. As zedstore is laid out as btree itself
with logical TID as its key. It can quickly find which page the logical TID
belongs to and only access that page. It doesn't need to perform the
seqscan for the same. That's one of the rationals for laying out things in
btree fashion to easily connect logical to physical world and not keep any
external mapping.

AFAIU, the following user level query on zedstore table
>
> select * from zed_tab where a = 3;
>
> gets internally converted to
>
> select * from zed_tab where tid = 3; -- assuming that index is created
> on column 'a' and the logical tid associated with a = 3 is 3.
>

So, for this it will first only access the TID btree, find the leaf page
with tid=3. Perform the visibility checks for the tuple and if tuple is
visible, then only will fetch all the columns for that TID. Again using the
btrees for those columns to only fetch leaf page for that logical tid.

Hope that helps to clarify the confusion.


Re: Comment in ginpostinglist.c doesn't match code

2019-08-22 Thread Ashwin Agrawal
On Thu, Aug 22, 2019 at 1:14 AM Heikki Linnakangas  wrote:

>
> The patch also includes a little unit test module to test this without
> creating a 16 TB table. A whole new test module seems a bit like
> overkill just for this, but clearly we were missing test coverage here.
> And it will come handy, if we want to invent a new better posting list
> format in the future. Thoughts on whether to include the test module or
> not?
>

I like the test as importantly adds missing coverage. Also, really
simplifies validation effort if required to make change in this area
anytime in future. So, I would +1 keeping the same.


Re: Zedstore - compressed in-core columnar storage

2019-08-14 Thread Ashwin Agrawal
 On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma 
wrote:

> Hi Ashwin,
>
> I tried playing around with the zedstore code a bit today and there
> are couple questions that came into my mind.
>

Great! Thank You.


>
> 1) Can zedstore tables be vacuumed? If yes, does VACUUM on zedstore
> table set the VM bits associated with it.
>

Zedstore tables can be vacuumed. On vacuum, minimal work is performed
though compared to heap. Full table is not scanned.  Only UNDO log is
truncated/discarded based on RecentGlobalXmin. Plus, only TidTree or
Meta column is scanned to find dead tuples and index entries cleaned
for them, based on the same.

Currently, for zedstore we have not used the VM at all. So, it doesn't
touch the same during any operation.

2) Is there a chance that IndexOnlyScan would ever be required for
> zedstore tables considering the design approach taken for it?
>

We have not given much thought to IndexOnlyScans so far. But I think
IndexOnlyScan definitely would be beneficial for zedstore as
well. Even for normal index scans as well, fetching as many columns
possible from Index itself and only getting rest of required columns
from the table would be good for zedstore. It would help to further
cut down IO. Ideally, for visibility checking only TidTree needs to be
scanned and visibility checked with the same, so the cost of checking
is much lower compared to heap (if VM can't be consulted) but still is
a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
checks are pretty cheap. Still given all that, having VM type thing to
optimize the same further would help.


> Further, I tried creating a zedstore table with btree index on one of
> it's column and loaded around 50 lacs record into the table. When the
> indexed column was scanned (with enable_seqscan flag set to off), it
> went for IndexOnlyScan and that took around 15-20 times more than it
> would take for IndexOnly Scan on heap table just because IndexOnlyScan
> in zedstore always goes to heap as the visibility check fails.
> However, the seqscan on zedstore table is quite faster than seqscan on
> heap table because the time taken for I/O is quite less in case for
> zedstore.
>

Thanks for reporting, we will look into it. Should be able to optimize
it. Given no VM exists, IndexOnlyScans currently for zedstore behave
more or less like IndexScans. Planner picks IndexOnlyScans for
zedstore, mostly due to off values for reltuples, relpages, and
relallvisible.

We have been focused on implementing and optimizing the AM pieces. So,
not much work has been done for planner estimates and tunning yet. The
first step for the same to get the needed columns in the planner
instead of the executor in [1] is proposed. Once, that bakes will use
the same to perform more planner estimates and all. Also, analyze
needs work to properly reflect reltuples and relpages to influence the
planner correctly.


1]
https://www.postgresql.org/message-id/CAAKRu_ZQ0Jy7LfZDCY0JdxChdpja9rf-S8Y5%2BU4vX7cYJd62dA%40mail.gmail.com


Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-07 Thread Ashwin Agrawal
On Fri, Aug 2, 2019 at 4:56 PM Ashwin Agrawal  wrote:

>
> On Wed, Jul 31, 2019 at 2:06 PM Andres Freund  wrote:
>
>> Looking at the code as of master, we currently have:
>>
>
> Super awesome feedback and insights, thank you!
>
> - PredicateLockTuple() calls SubTransGetTopmostTransaction() to figure
>>   out a whether the tuple has been locked by the current
>>   transaction. That check afaict just should be
>>   TransactionIdIsCurrentTransactionId(), without all the other
>>   stuff that's done today.
>>
>
> Agree. v1-0002 patch attached does that now. Please let me know if that's
> what you meant.
>
>   TransactionIdIsCurrentTransactionId() imo ought to be optimized to
>>   always check for the top level transactionid first - that's a good bet
>>   today, but even moreso for the upcoming AMs that won't have separate
>>   xids for subtransactions.  Alternatively we shouldn't make that a
>>   binary search for each subtrans level, but just have a small
>>   simplehash hashtable for xids.
>>
>
> v1-0001 patch checks for GetTopTransactionIdIfAny() first in
> TransactionIdIsCurrentTransactionId() which seems yes better in general and
> more for future. That mostly meets the needs for current discussion.
>
> The alternative of not using binary search seems bigger refactoring and
> should be handled as separate optimization exercise outside of this thread.
>
>
>> - CheckForSerializableConflictOut() wants to get the toplevel xid for
>>   the tuple, because that's the one the predicate hashtable stores.
>>
>>   In your patch you've already moved the HTSV() call etc out of
>>   CheckForSerializableConflictOut(). I'm somewhat inclined to think that
>>   the SubTransGetTopmostTransaction() call ought to go along with that.
>>   I don't really think that belongs in predicate.c, especially if
>>   most/all new AMs don't use subtransaction ids.
>>
>>   The only downside is that currently the
>>   TransactionIdEquals(xid, GetTopTransactionIdIfAny()) check
>>   avoids the SubTransGetTopmostTransaction() check.
>>
>>   But again, the better fix for that seems to be to improve the generic
>>   code. As written the check won't prevent a subtrans lookup for heap
>>   when subtransactions are in use, and it's IME pretty common for tuples
>>   to get looked at again in the transaction that has created them. So
>>   I'm somewhat inclined to think that SubTransGetTopmostTransaction()
>>   should have a fast-path for the current transaction - probably just
>>   employing TransactionIdIsCurrentTransactionId().
>>
>
> That optimization, as Kuntal also mentioned, seems something which can be
> done on-top afterwards on current patch.
>
>
>> I don't really see what we gain by having the subtrans handling in the
>> predicate code. Especially given that we've already moved the HTSV()
>> handling out, it seems architecturally the wrong place to me - but I
>> admit that that's a fuzzy argument.  The relevant mapping should be one
>> line in the caller.
>>
>
> Okay, I moved the sub transaction handling out of
> CheckForSerializableConflictOut() and have it along side HTSV() now.
>
> The reason I felt leaving subtransaction handling in generic place, was it
> might be premature to thing no future AM will need it. Plus, all
> serializable function api's having same expectations is easier. Like
> PredicateLockTuple() can be passed top or subtransaction id and it can
> handle it but with the change CheckForSerializableConflictOut() only be
> feed top transaction ID. But its fine and can see the point of AM needing
> it can easily get top transaction ID and feed it as heap.
>
>
>> I wonder if it'd be wroth to combine the
>> TransactionIdIsCurrentTransactionId() calls in the heap cases that
>> currently do both, PredicateLockTuple() and
>> HeapCheckForSerializableConflictOut(). The heap_fetch() case probably
>> isn't commonly that hot a pathq, but heap_hot_search_buffer() is.
>>
>
> Maybe, will give thought to it separate from the current patch.
>
>
>> Minor notes:
>> - I don't think 'insert_xid' is necessarily great - it could also be the
>>   updating xid etc. And while you can argue that an update is an insert
>>   in the current heap, that's not the case for future AMs.
>> - to me
>> @@ -1621,7 +1622,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation
>> relation, Buffer buffer,
>> if (valid)
>> {
>> ItemPointerSetOffsetNumber(tid, offnum);
>> -   PredicateL

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-02 Thread Ashwin Agrawal
ple->t_self)? Even that is pretty superfluous,
>  but it at least clarifies the precedence.
>

Fixed. No idea what I was thinking there, mostly feel I intended to have it
as like &(heapTuple->t_self).

  I'm also a bit confused why we don't need to pass in the offset of the
>  current tuple, rather than the HOT root tuple here. That's not related
>  to this patch. But aren't we locking the wrong tuple here, in case of
>  HOT?
>

Yes, root is being locked here instead of the HOT. But I don't have full
context on the same. If we wish to fix it though, can be easily done now
with the patch by passing "tid" instead of &(heapTuple->t_self).

- I wonder if CheckForSerializableConflictOutNeeded() shouldn't have a
>   portion of it's code as a static inline. In particular, it's a shame
>   that we currently perform external function calls at quite the
>   frequency when serializable isn't even in use.
>

I am not sure on portion of the code part? SerializationNeededForRead() is
static inline function in C file. Can't inline
CheckForSerializableConflictOutNeeded() without moving
SerializationNeededForRead() and some other variables to header file.
CheckForSerializableConflictOut() wasn't inline either, so a function call
was performed earlier as well when serializable isn't even in use.

I understand that with refactor, HeapCheckForSerializableConflictOut() is
called which calls CheckForSerializableConflictOutNeeded(). If that's the
problem, for addressing the same, I had proposed alternative way to
refactor. CheckForSerializableConflictOut() can take callback function and
void* callback argument for AM specific check instead. So, the flow would
be AM calling CheckForSerializableConflictOut() as today and only if
serializable in use will invoke the callback to check with AM if more work
should be performed or not. Essentially
HeapCheckForSerializableConflictOut() will become callback function
instead. Due to void* callback argument aspect I didn't like that solution
and felt AM performing checks and calling CheckForSerializableConflictOut()
seems more straight forward.
From 2516c592db97f285925667aafb34fc2de4286282 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 2 Aug 2019 15:02:28 -0700
Subject: [PATCH v1 1/3] Optimize TransactionIdIsCurrentTransactionId()

If xid is current top transaction, can fast check for the same and
exit early. This should optmize for current heap but also works very
well for future AMs, which don't have separate xid for
subtransactions. This optimization was proposed by Andres.
---
 src/backend/access/transam/xact.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1bbaeeebf4d..41952bc4d26 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -871,6 +871,9 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
 	if (!TransactionIdIsNormal(xid))
 		return false;
 
+	if (TransactionIdEquals(xid, GetTopTransactionIdIfAny()))
+		return true;
+
 	/*
 	 * In parallel workers, the XIDs we must consider as current are stored in
 	 * ParallelCurrentXids rather than the transaction-state stack.  Note that
-- 
2.19.1

From 45ba97a206a8c8f04188fe10b9b73c0b79f5d263 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 2 Aug 2019 15:21:16 -0700
Subject: [PATCH v1 2/3] Optimize PredicateLockTuple().

PredicateLockTuple fast exits if tuple was written by current
transaction, as for that case it already has the lock. This check can
be easily performed using TransactionIdIsCurrentTransactionId()
instead of using SubTransGetTopmostTransaction(). Since
TransactionIdIsCurrentTransactionId() is all in-memory operation,
makes this efficient compared to SubTransGetTopmostTransaction(),
which can hit the disk.

This simplification was proposed by Andres.
---
 src/backend/storage/lmgr/predicate.c | 22 --
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 2d709420c3d..1417ba37441 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2559,24 +2559,10 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
 	 */
 	if (relation->rd_index == NULL)
 	{
-		TransactionId myxid;
-
-		targetxmin = HeapTupleHeaderGetXmin(tuple->t_data);
-
-		myxid = GetTopTransactionIdIfAny();
-		if (TransactionIdIsValid(myxid))
-		{
-			if (TransactionIdFollowsOrEquals(targetxmin, TransactionXmin))
-			{
-TransactionId xid = SubTransGetTopmostTransaction(targetxmin);
-
-if (TransactionIdEquals(xid, myxid))
-{
-	/* We wrote it; we already have a write lock. */
-	return;
-}
-			}
-		}
+		/* If we wrote it; we already have a write lock. */
+		if (TransactionIdIsCurrentTransactionId(
+HeapTupleHeaderGetXmin(tuple->t_data)))
+			return;
 	}
 
 	/*
-- 

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Ashwin Agrawal
On Wed, Jul 31, 2019 at 12:37 PM Ashwin Agrawal  wrote:

>
> On Wed, Jul 31, 2019 at 10:55 AM Andres Freund  wrote:
>
>> Hi,
>>
>> On 2019-07-31 10:42:50 -0700, Ashwin Agrawal wrote:
>> > On Tue, Jul 30, 2019 at 2:58 PM Thomas Munro 
>> wrote:
>> >
>> > > On Tue, Jun 25, 2019 at 6:02 AM Andres Freund 
>> wrote:
>> > > > > - CheckForSerializableConflictOut() no more takes HeapTuple nor
>> > > > >   buffer, instead just takes xid. Push heap specific parts from
>> > > > >   CheckForSerializableConflictOut() into its own function
>> > > > >   HeapCheckForSerializableConflictOut() which calls
>> > > > >   CheckForSerializableConflictOut(). The alternative option could
>> be
>> > > > >   CheckForSerializableConflictOut() take callback function and
>> > > > >   callback arguments, which gets called if required after
>> performing
>> > > > >   prechecks. Though currently I fell AM having its own wrapper to
>> > > > >   perform AM specific task and then calling
>> > > > >   CheckForSerializableConflictOut() is fine.
>> > > >
>> > > > I think it's right to move the xid handling out of
>> > > > CheckForSerializableConflictOut(). But I think we also ought to
>> move the
>> > > > subtransaction handling out of the function - e.g. zheap doesn't
>> > > > want/need that.
>> > >
>> > > Thoughts on this Ashwin?
>> > >
>> >
>> > I think the only part its doing for sub-transaction is
>> > SubTransGetTopmostTransaction(xid). If xid passed to this function is
>> > already top most transaction which is case for zheap and zedstore, then
>> > there is no downside to keeping that code here in common place.
>>
>> Well, it's far from a cheap function. It'll do unnecessary on-disk
>> lookups in many cases. I'd call that quite a downside.
>>
>
> Okay, agree, its costly function and better to avoid the call if possible.
>
> Instead of moving the handling out of the function, how do feel about
> adding boolean isTopTransactionId argument to function
> CheckForSerializableConflictOut(). The AMs, which implicitly know, only
> pass top transaction Id to this function, can pass true and avoid the
> function call to SubTransGetTopmostTransaction(xid). With this
> subtransaction code remains in generic place and AMs intending to use it
> continue to leverage the common code, plus explicitly clarifies the
> behavior as well.
>

Added argument to function to make the subtransaction handling optional in
attached v2 of patch.
From 4ca67592f34b63cf80247068a128407d800c62a6 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Wed, 31 Jul 2019 13:53:52 -0700
Subject: [PATCH v2] Remove HeapTuple dependency for predicate locking
 functions.

Following changes help to make predicate locking functions generic and
not tied to particular AM.

- PredicateLockTuple() is renamed to PredicateLockTID(). It takes
  ItemPointer and inserting transaction id now instead of HeapTuple.

- CheckForSerializableConflictIn() takes blocknum instead of buffer

- CheckForSerializableConflictOut() no more takes HeapTuple nor buffer
---
 src/backend/access/gin/ginbtree.c|   2 +-
 src/backend/access/gin/ginfast.c |   2 +-
 src/backend/access/gin/gininsert.c   |   4 +-
 src/backend/access/gist/gist.c   |   2 +-
 src/backend/access/hash/hashinsert.c |   2 +-
 src/backend/access/heap/heapam.c | 104 --
 src/backend/access/heap/heapam_handler.c |   7 +-
 src/backend/access/index/indexam.c   |   4 +-
 src/backend/access/nbtree/nbtinsert.c|   4 +-
 src/backend/storage/lmgr/predicate.c | 134 ---
 src/include/access/heapam.h  |   2 +
 src/include/storage/predicate.h  |  10 +-
 12 files changed, 162 insertions(+), 115 deletions(-)

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 11a8ed7bbc2..e795375495b 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -89,7 +89,7 @@ ginFindLeafPage(GinBtree btree, bool searchMode,
 	stack->predictNumber = 1;
 
 	if (rootConflictCheck)
-		CheckForSerializableConflictIn(btree->index, NULL, stack->buffer);
+		CheckForSerializableConflictIn(btree->index, NULL, btree->rootBlkno);
 
 	for (;;)
 	{
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 439a91b3e61..d7b52476817 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -246,7 +246,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Ashwin Agrawal
On Wed, Jul 31, 2019 at 10:55 AM Andres Freund  wrote:

> Hi,
>
> On 2019-07-31 10:42:50 -0700, Ashwin Agrawal wrote:
> > On Tue, Jul 30, 2019 at 2:58 PM Thomas Munro 
> wrote:
> >
> > > On Tue, Jun 25, 2019 at 6:02 AM Andres Freund 
> wrote:
> > > > > - CheckForSerializableConflictOut() no more takes HeapTuple nor
> > > > >   buffer, instead just takes xid. Push heap specific parts from
> > > > >   CheckForSerializableConflictOut() into its own function
> > > > >   HeapCheckForSerializableConflictOut() which calls
> > > > >   CheckForSerializableConflictOut(). The alternative option could
> be
> > > > >   CheckForSerializableConflictOut() take callback function and
> > > > >   callback arguments, which gets called if required after
> performing
> > > > >   prechecks. Though currently I fell AM having its own wrapper to
> > > > >   perform AM specific task and then calling
> > > > >   CheckForSerializableConflictOut() is fine.
> > > >
> > > > I think it's right to move the xid handling out of
> > > > CheckForSerializableConflictOut(). But I think we also ought to move
> the
> > > > subtransaction handling out of the function - e.g. zheap doesn't
> > > > want/need that.
> > >
> > > Thoughts on this Ashwin?
> > >
> >
> > I think the only part its doing for sub-transaction is
> > SubTransGetTopmostTransaction(xid). If xid passed to this function is
> > already top most transaction which is case for zheap and zedstore, then
> > there is no downside to keeping that code here in common place.
>
> Well, it's far from a cheap function. It'll do unnecessary on-disk
> lookups in many cases. I'd call that quite a downside.
>

Okay, agree, its costly function and better to avoid the call if possible.

Instead of moving the handling out of the function, how do feel about
adding boolean isTopTransactionId argument to function
CheckForSerializableConflictOut(). The AMs, which implicitly know, only
pass top transaction Id to this function, can pass true and avoid the
function call to SubTransGetTopmostTransaction(xid). With this
subtransaction code remains in generic place and AMs intending to use it
continue to leverage the common code, plus explicitly clarifies the
behavior as well.


Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Ashwin Agrawal
On Tue, Jul 30, 2019 at 2:58 PM Thomas Munro  wrote:

> On Tue, Jun 25, 2019 at 6:02 AM Andres Freund  wrote:
> > > - CheckForSerializableConflictOut() no more takes HeapTuple nor
> > >   buffer, instead just takes xid. Push heap specific parts from
> > >   CheckForSerializableConflictOut() into its own function
> > >   HeapCheckForSerializableConflictOut() which calls
> > >   CheckForSerializableConflictOut(). The alternative option could be
> > >   CheckForSerializableConflictOut() take callback function and
> > >   callback arguments, which gets called if required after performing
> > >   prechecks. Though currently I fell AM having its own wrapper to
> > >   perform AM specific task and then calling
> > >   CheckForSerializableConflictOut() is fine.
> >
> > I think it's right to move the xid handling out of
> > CheckForSerializableConflictOut(). But I think we also ought to move the
> > subtransaction handling out of the function - e.g. zheap doesn't
> > want/need that.
>
> Thoughts on this Ashwin?
>

I think the only part its doing for sub-transaction is
SubTransGetTopmostTransaction(xid). If xid passed to this function is
already top most transaction which is case for zheap and zedstore, then
there is no downside to keeping that code here in common place.


Re: heapam_index_build_range_scan's anyvisible

2019-07-30 Thread Ashwin Agrawal
On Tue, Jul 16, 2019 at 10:22 AM Andres Freund  wrote:

> Hi,
>
> On 2019-07-11 17:27:46 -0700, Ashwin Agrawal wrote:
> > Please find attached the patch to remove IndexBuildCallback's dependency
> on
> > HeapTuple, as discussed. Changed to have the argument as ItemPointer
> > instead of HeapTuple. Other larger refactoring if feasible for
> > index_build_range_scan API itself can be performed as follow-up changes.
>
> > From f73b0061795f0c320f96ecfed0c0602ae318d73e Mon Sep 17 00:00:00 2001
> > From: Ashwin Agrawal 
> > Date: Thu, 11 Jul 2019 16:58:50 -0700
> > Subject: [PATCH v1] Remove IndexBuildCallback's dependency on HeapTuple.
> >
> > With IndexBuildCallback taking input as HeapTuple, all table AMs
> > irrespective of storing the tuples in HeapTuple form or not, are
> > forced to construct HeapTuple, to insert the tuple in Index. Since,
> > only thing required by the index callbacks is TID and not really the
> > full tuple, modify callback to only take ItemPointer.
>
> Looks good to me. Planning to apply this unless somebody wants to argue
> against it soon?
>

Andres, I didn't yet register this for next commitfest. If its going in
soon anyways will not do it otherwise let me know and I will add it to the
list.


Re: heapam_index_build_range_scan's anyvisible

2019-07-11 Thread Ashwin Agrawal
On Mon, Jun 10, 2019 at 5:38 PM Ashwin Agrawal  wrote:

>
> On Mon, Jun 10, 2019 at 2:56 PM Andres Freund  wrote:
>
>> > Currently, all AM needs to build HeapTuple in
>> > index_build_range_scan function. I looked into all the callback
>> functions
>> > and only htup->t_self is used from heaptuple in all the functions
>> (unless I
>> > missed something). So, if seems fine will be happy to write patch to
>> make
>> > that argument ItemPointer instead of HeapTuple?
>>
>> I wonder if it should better be the slot? It's not inconceivable that
>> some AMs could benefit from that. Although it'd add some complication
>> to the heap HeapTupleIsHeapOnly case.
>>
>
> I like the slot suggestion, only if can push FormIndexDatum() out of AM
> code as a result and pass slot to the callback. Not sure what else can it
> help. HeapTupleIsHeapOnly possibly can be made to work with slot similar to
> current hack of updating the t_self and slot's tid field, maybe.
>
> Index callback using the slot can form the index datum. Though that would
> mean every Index AM callback function needs to do it, not sure which way is
> better. Plus, need to create ExecutorState for the same. With current setup
> every AM needs to do. Feels good if belongs to indexing code though instead
> of AM.
>
> Currently, index build needing to create ExecutorState and all at AM layer
> seems not nice either. Maybe there is quite generic logic here and possible
> can be extracted into common code which either most of AM leverage. Or
> possibly the API itself can be simplified to get minimum input from AM and
> have rest of flow/machinery at upper layer. As Robert pointed at start of
> thread at heart its pretty simple flow and possibly will remain same for
> any AM.
>
>
Please find attached the patch to remove IndexBuildCallback's dependency on
HeapTuple, as discussed. Changed to have the argument as ItemPointer
instead of HeapTuple. Other larger refactoring if feasible for
index_build_range_scan API itself can be performed as follow-up changes.
From f73b0061795f0c320f96ecfed0c0602ae318d73e Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Thu, 11 Jul 2019 16:58:50 -0700
Subject: [PATCH v1] Remove IndexBuildCallback's dependency on HeapTuple.

With IndexBuildCallback taking input as HeapTuple, all table AMs
irrespective of storing the tuples in HeapTuple form or not, are
forced to construct HeapTuple, to insert the tuple in Index. Since,
only thing required by the index callbacks is TID and not really the
full tuple, modify callback to only take ItemPointer.
---
 src/backend/access/brin/brin.c   |  4 ++--
 src/backend/access/gin/gininsert.c   |  4 ++--
 src/backend/access/gist/gistbuild.c  |  6 +++---
 src/backend/access/hash/hash.c   |  8 
 src/backend/access/heap/heapam_handler.c | 11 +--
 src/backend/access/nbtree/nbtsort.c  |  8 
 src/backend/access/spgist/spginsert.c|  4 ++--
 src/include/access/tableam.h |  2 +-
 8 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd9..d27d503c7f1 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -597,7 +597,7 @@ brinendscan(IndexScanDesc scan)
  */
 static void
 brinbuildCallback(Relation index,
-  HeapTuple htup,
+  ItemPointer tid,
   Datum *values,
   bool *isnull,
   bool tupleIsAlive,
@@ -607,7 +607,7 @@ brinbuildCallback(Relation index,
 	BlockNumber thisblock;
 	int			i;
 
-	thisblock = ItemPointerGetBlockNumber(>t_self);
+	thisblock = ItemPointerGetBlockNumber(tid);
 
 	/*
 	 * If we're in a block that belongs to a future range, summarize what
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab146173..d19cbcb2f90 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -273,7 +273,7 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 }
 
 static void
-ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
+ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  bool *isnull, bool tupleIsAlive, void *state)
 {
 	GinBuildState *buildstate = (GinBuildState *) state;
@@ -285,7 +285,7 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 			   values[i], isnull[i],
-			   >t_self);
+			   tid);
 
 	/* If we've maxed out our available memory, dump everything to the index */
 	if (buildstate->accum.allocatedMemory >= (Size) maintenance_work_mem * 1024L)
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuil

Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-10 Thread Ashwin Agrawal
On Wed, Jul 10, 2019 at 6:46 AM Tom Lane  wrote:

> Michael Paquier  writes:
> > On Wed, Jul 10, 2019 at 12:51:28PM +0530, Amit Kapila wrote:
> >> It would be good if we can come up with something like that.  It will
> >> be helpful for zheap, where in some cases we get different row
> >> ordering due to in-place updates.  As of now, we try to add Order By
> >> or do some extra magic to get consistent row ordering.
>
> > That was an issue for me as well when working with Postgres-XC when
> > the row ordering was not guaranteed depending on the number of nodes
> > (speaking of which Greenplum has the same issues, no?).  Adding ORDER
> > BY clauses to a set of tests may make sense, but then this may impact
> > the plans generated for some of them..
>
> Yeah, I do not want to get into a situation where we can't test
> queries that lack ORDER BY.  Also, the fact that tableam X doesn't
> reproduce heap's row ordering is not a good reason to relax the
> strength of the tests for heap.  So I'm wondering about some
> postprocessing that we could optionally apply.  Perhaps the tools
> Melanie mentions could help.
>

Surprisingly, I have been working from a couple of days to use those
Perl tools from Greenplum for Zedstore. As for Zedstore plans differ
for many regress tests because relation size not being the same as
heap and all. Also, for similar reasons, row orders change as
well. So, to effectively use the test untouched to validate Zedstore
and yes was thinking will help Zheap testing as well. I also tested
the same for regressplans.sh and it will lift a lot of manual burden
of investigating the results. As one can specify to completely ignore
explain plan outputs from the comparison between results and
expected. Will post patch for the tool, once I get in little decent
shape.


Re: Declared but no defined functions

2019-07-08 Thread Ashwin Agrawal
On Sat, Jul 6, 2019 at 4:32 PM Masahiko Sawada 
wrote:

> Indeed. I've tried to search again with the following script and got
> more such functions.
>
> for func in `git ls-files | egrep "\w+\.h$" | xargs cat | egrep -v
> "(^typedef)|(DECLARE)|(BKI)" | egrep "^(extern )*[\_0-9A-Za-z]+
> [\_\*0-9a-zA-Z]+ ?\(.+\);$" | sed -e "s/\(^extern \)*[\_0-9A-Za-z]\+
> \([\_0-9A-Za-z\*]\+\) \{0,1\}(.*);$/\2(/g" | sed -e "s/\*//g"`
> do
> if [ "`git grep "$func" -- "*.c" | wc -l`" -lt 1 ];then
> echo $func
> fi
> done
>
>
Do we wish to make this a tool and have it in src/tools, either as part of
find_static tool after renaming that one to more generic name or
independent script.


Re: Comment typo in tableam.h

2019-07-08 Thread Ashwin Agrawal
On Sat, Jul 6, 2019 at 12:05 AM Amit Kapila  wrote:

> On Tue, Jul 2, 2019 at 1:00 AM Ashwin Agrawal  wrote:
> > Please find attached v2 of patch 1 without objectionable comment change.
> v1 of patch 2 attaching here just for convenience, no modifications made to
> it.
> >
>
> 0001*
>   * See table_index_fetch_tuple's comment about what the difference between
> - * these functions is. This function is the correct to use outside of
> - * index entry->table tuple lookups.
> + * these functions is. This function is correct to use outside of index
> + * entry->table tuple lookups.
>
> How about if we write the last line of comment as "It is correct to
> use this function outside of index entry->table tuple lookups."?  I am
> not an expert on this matter, but I find the way I am suggesting
> easier to read.
>

I am fine with the way you have suggested.


Re: C testing for Postgres

2019-07-02 Thread Ashwin Agrawal
On Mon, Jul 1, 2019 at 11:26 PM Michael Paquier  wrote:

> On Fri, Jun 28, 2019 at 09:42:54AM -0400, Adam Berlin wrote:
> > If we were to use this tool, would the community want to vendor the
> > framework in the Postgres repository, or keep it in a separate repository
> > that produces a versioned shared library?
>
> Well, my take is that having a base infrastructure for a fault
> injection framework is something that would prove to be helpful, and
> that I am not against having something in core.  While working on
> various issues, I have found myself doing many times crazy stat()
> calls on an on-disk file to enforce an elog(ERROR) or elog(FATAL), and
> by experience fault points are things very *hard* to place correctly
> because they should not be single-purpose things.
>
> Now, we don't want to finish with an infinity of fault points in the
> tree, but being able to enforce a failure in a point added for a patch
> using a SQL command can make the integration of tests in a patch
> easier for reviewers, for example isolation tests with elog(ERROR)
> (like what has been discussed for b4721f3).
>

Just to clarify what Adam is proposing in this thread is *not* a fault
injection framework.


Re: Comment typo in tableam.h

2019-07-01 Thread Ashwin Agrawal
On Fri, Jun 28, 2019 at 1:47 PM Amit Kapila  wrote:

> On Mon, Jun 24, 2019 at 11:26 PM Ashwin Agrawal 
> wrote:
> >
> > On Mon, Jun 3, 2019 at 5:24 PM Ashwin Agrawal 
> wrote:
> >>
> >> There were few more minor typos I had collected for table am, passing
> them along here.
> >>
> >> Some of the required callback functions are missing Assert checking
> (minor thing), adding them in separate patch.
> >
> >
> > Curious to know if need to register such small typo fixing and assertion
> adding patchs to commit-fest as well ?
> >
>
> Normally, such things are handled out of CF, but to avoid forgetting,
> we can register it.  However, this particular item suits more to 'Open
> Items'[1].  You can remove the objectionable part of the comment,
> other things in your patch look good to me.  If nobody else picks it
> up, I will take care of it.
>

Thank you, I thought Committer would remove the objectionable part of
comment change and commit the patch if seems fine. I don't mind changing,
just feel adds extra back and forth cycle.

Please find attached v2 of patch 1 without objectionable comment change. v1
of patch 2 attaching here just for convenience, no modifications made to it.
From 5c75807a56101a07685ed1f435eabcc43cd22fb7 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 24 May 2019 16:30:38 -0700
Subject: [PATCH v2 1/2] Fix typos in few tableam comments.

---
 src/include/access/tableam.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fd07773b74f..1e45908c78a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -434,8 +434,8 @@ typedef struct TableAmRoutine
 	 *
 	 * Note that only the subset of the relcache filled by
 	 * RelationBuildLocalRelation() can be relied upon and that the relation's
-	 * catalog entries either will either not yet exist (new relation), or
-	 * will still reference the old relfilenode.
+	 * catalog entries will either not yet exist (new relation), or will still
+	 * reference the old relfilenode.
 	 *
 	 * As output *freezeXid, *minmulti must be set to the values appropriate
 	 * for pg_class.{relfrozenxid, relminmxid}. For AMs that don't need those
@@ -591,7 +591,7 @@ typedef struct TableAmRoutine
 	 * See table_relation_estimate_size().
 	 *
 	 * While block oriented, it shouldn't be too hard for an AM that doesn't
-	 * doesn't internally use blocks to convert into a usable representation.
+	 * internally use blocks to convert into a usable representation.
 	 *
 	 * This differs from the relation_size callback by returning size
 	 * estimates (both relation size and tuple count) for planning purposes,
@@ -967,7 +967,7 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  *
  * *all_dead, if all_dead is not NULL, will be set to true by
  * table_index_fetch_tuple() iff it is guaranteed that no backend needs to see
- * that tuple. Index AMs can use that do avoid returning that tid in future
+ * that tuple. Index AMs can use that to avoid returning that tid in future
  * searches.
  *
  * The difference between this function and table_fetch_row_version is that
@@ -1014,8 +1014,8 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  * true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
- * these functions is. This function is the correct to use outside of
- * index entry->table tuple lookups.
+ * these functions is. This function is correct to use outside of index
+ * entry->table tuple lookups.
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-- 
2.19.1

From 4ed947590f2f61182356a7fa4bc429c679e7619f Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Mon, 3 Jun 2019 17:07:05 -0700
Subject: [PATCH v2 2/2] Add assertions for required table am callbacks.

---
 src/backend/access/table/tableamapi.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index b2f58768107..98b8ea559d8 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -51,6 +51,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_begin != NULL);
 	Assert(routine->scan_end != NULL);
 	Assert(routine->scan_rescan != NULL);
+	Assert(routine->scan_getnextslot != NULL);
 
 	Assert(routine->parallelscan_estimate != NULL);
 	Assert(routine->parallelscan_initialize != NULL);
@@ -62,8 +63,12 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->index_fetch_tuple != NULL);
 
 	Assert(routine->tuple_fetch_row_version != NULL);
+	Assert(routine->tuple_tid_valid != NULL);
+	Assert(routine->tuple_get_latest_tid != NULL);
 	Assert(routine->tuple_satisfies_snapshot != NULL);
 
+	Assert(routine->compute_xid_horizon_for_tuples != 

Re: Zedstore - compressed in-core columnar storage

2019-07-01 Thread Ashwin Agrawal
On Thu, May 30, 2019 at 8:07 AM DEV_OPS  wrote:

>
> it's really cool and very good progress,
>
> I'm interesting if SIDM/JIT will be supported
>

That's something outside of Zedstore work directly at least now. The intent
is to work with current executor code or enhance it only wherever needed.
If current executor code supports something that would work for Zedstore.
But any other enhancements to executor will be separate undertaking.


Re: Zedstore - compressed in-core columnar storage

2019-07-01 Thread Ashwin Agrawal
On Sun, Jun 30, 2019 at 7:59 PM Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Ashwin Agrawal [mailto:aagra...@pivotal.io]
> > The objective is to gather feedback on design and approach to the same.
> > The implementation has core basic pieces working but not close to
> complete.
>
> Thank you for proposing a very interesting topic.  Are you thinking of
> including this in PostgreSQL 13 if possible?
>
>
> > * All Indexes supported
> ...
> > work. Btree indexes can be created. Btree and bitmap index scans work.
>
> Does Zedstore allow to create indexes of existing types on the table
> (btree, GIN, BRIN, etc.) and perform index scans (point query, range query,
> etc.)?
>

Yes, all indexes types work for zedstore and allow point or range queries.


> > * Hybrid row-column store, where some columns are stored together, and
> >   others separately. Provide flexibility of granularity on how to
> >   divide the columns. Columns accessed together can be stored
> >   together.
> ...
> > This way of laying out the data also easily allows for hybrid row-column
> > store, where some columns are stored together, and others have a
> dedicated
> > B-tree. Need to have user facing syntax to allow specifying how to group
> > the columns.
> ...
> > Zedstore Table can be
> > created using command:
> >
> > CREATE TABLE  (column listing) USING zedstore;
>
> Are you aiming to enable Zedstore to be used for HTAP, i.e. the same table
> can be accessed simultaneously for both OLTP and analytics with the minimal
> performance impact on OLTP?  (I got that impression from the word "hybrid".)
>

Well "hybrid" is more to convey compressed row and column store can be
supported with same design. It really wasn't referring to HTAP. In general
the goal we are moving towards is column store to be extremely efficient at
analytics but still should be able to support all the OLTP operations (with
minimal performance or storage size impact) Like when making trade-offs
between different design choices and if both can't be meet, preference if
towards analytics.

If yes, is the assumption that only a limited number of  columns are to be
> stored in columnar format (for efficient scanning), and many other columns
> are to be stored in row format for efficient tuple access?
>

Yes, like if its known that certain columns are always accessed together
better to store them together and avoid the tuple formation cost. Though
its still to be seen if compression plays role and storing each individual
column and compressing can still be winner compared to compressing
different columns as blob. Like saving on IO cost offsets out the tuple
formation cost or not.

Are those row-formatted columns stored in the same file as the
> column-formatted columns, or in a separate file?
>

Currently, we are focused to just get pure column store working and hence
not coded anything for hybrid layout yet. But at least right now the
thought is would be in same file.

Regarding the column grouping, can I imagine HBase and Cassandra?
> How could the current CREATE TABLE syntax support column grouping?  (I
> guess CREATE TABLE needs a syntax for columnar store, and Zedstore need to
> be incorporated in core, not as an extension...)
>

When column grouping comes up yes will need to modify CREATE TABLE syntax,
we are still to reach that point in development.


> > A column store uses the same structure but we have *multiple* B-trees,
> one
> > for each column, all indexed by TID. The B-trees for all columns are
> stored
> > in the same physical file.
>
> Did you think that it's not a good idea to have a different file for each
> group of columns?  Is that because we can't expect physical adjacency of
> data blocks on disk even if we separate a column in a separate file?
>
> I thought a separate file for each group of columns would be easier and
> less error-prone to implement and debug.  Adding and dropping the column
> group would also be very easy and fast.
>

Currently, each group is a single column (till we don't have column
families) and having file for each column definitely seems not good idea.
As it just explodes the number of files. Separate file may have its
advantage from pre-fetching point of view but yes can't expect physical
adjacency of data blocks plus access pattern will anyways involve reading
multiple files (if each column stored in separate file).

I doubt storing each group makes it any easier to implement or debug, I
feel its actually reverse. Storing everything in single file but separate
blocks, keep the logic contained inside AM layer. And don't have to write
special code for example for drop table to delete files for all the groups
and all, or while moving table to different tablespace and all such
complication.

Adding and dropping column group, irrespective can be made easy and fast
with blocks for that group, added or marked for reuse within same file.

Thank you for the questions.


Re: An out-of-date comment in nodeIndexonlyscan.c

2019-06-27 Thread Ashwin Agrawal
On Thu, Jun 27, 2019 at 4:33 PM Thomas Munro  wrote:

> Here's a patch I'd like to commit to fix the comment.  We could look
> at improving the actual code after
> https://commitfest.postgresql.org/23/2169/ is done.
>

Change looks good to me.


> I wonder if it might be possible to avoid page locks on both the heap
> *and* index in some limited cases, as I mentioned over here (just an
> idea, could be way off base):
>
>
> https://www.postgresql.org/message-id/CA%2BhUKGJGDVfhHmoUGzi-_J%2BN8FmRjmWTY0MCd1ZV5Fj9qdyb1w%40mail.gmail.com


I am in same boat as you. One can get serializable conflict error today
based on tuple gets HOT updated or not. HOT is logically internal code
optimization and not so much user functionality, so ideally feels shouldn't
affect serializable behavior. But it does currently, again, due to index
locking. Disable HOT update and 4 isolation tests fail due to "could not
serialize access due to read/write dependencies among transactions"
otherwise not. If the tuple happens to fit on same page user will not get
the error, if the tuple gets inserted to different page the error can
happen, due to index page locking. I had discussed this with Heikki and
thinking is we shouldn't need to take the lock on the index page, if the
index key was not changed, even if it was a non-HOT update. Not sure of
complications and implications, but just a thought.


Re: errbacktrace

2019-06-25 Thread Ashwin Agrawal
On Tue, Jun 25, 2019 at 4:08 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> New thread continuing from
> <
> https://www.postgresql.org/message-id/d4903af2-e7b7-b551-71f8-3e4a6bdc2...@2ndquadrant.com
> >.
>
> Here is a extended version of Álvaro's patch that adds an errbacktrace()
> function.  You can do two things with this:
>
> - Manually attach it to an ereport() call site that you want to debug.
>
> - Set a configuration parameter like backtrace_function = 'int8in' to
> debug ereport()/elog() calls in a specific function.
>

Thank You. This is very helpful. Surprised is missing for so long time. We
have printing backtrace in Greenplum and its extremely helpful during
development and production.

There was also mention of settings that would automatically produce
> backtraces for PANICs etc.  Those could surely be added if there is
> enough interest.
>

In Greenplum, we have backtrace enabled for PANICs, SEGV/BUS/ILL and
internal ERRORs, proves very helpful.

For the implementation, I support both backtrace() provided by the OS as
> well as using libunwind.  The former seems to be supported by a number
> of platforms, including glibc, macOS, and FreeBSD, so maybe we don't
> need the libunwind suport.  I haven't found any difference in quality in
> the backtraces between the two approaches, but surely that is highly
> dependent on the exact configuration.
>

We have implemented it using backtrace(). Also, using addr2line() (or atos
for mac) can convert addresses to file and line numbers before printing if
available, to take it a step further.


Re: Comment typo in tableam.h

2019-06-24 Thread Ashwin Agrawal
On Mon, Jun 3, 2019 at 5:24 PM Ashwin Agrawal  wrote:

> There were few more minor typos I had collected for table am, passing them
> along here.
>
> Some of the required callback functions are missing Assert checking (minor
> thing), adding them in separate patch.
>

Curious to know if need to register such small typo fixing and assertion
adding patchs to commit-fest as well ?


Remove HeapTuple and Buffer dependency for predicate locking functions

2019-06-24 Thread Ashwin Agrawal
Proposing following changes to make predicate locking and checking
functions generic and remove dependency on HeapTuple and Heap AM. We
made these changes to help with Zedstore. I think the changes should
help Zheap and other AMs in general.

- Change PredicateLockTuple() to PredicateLockTID(). So, instead of
  passing HeapTuple to it, just pass ItemPointer and tuple insert
  transaction id if known. This was also discussed earlier in [1],
  don't think was done in that context but would be helpful in future
  if such requirement comes up as well.

- CheckForSerializableConflictIn() take blocknum instead of
  buffer. Currently, the function anyways does nothing with the buffer
  just needs blocknum. Also, helps to decouple dependency on buffer as
  not all AMs may have one to one notion between blocknum and single
  buffer. Like for zedstore, tuple is stored across individual column
  buffers. So, wish to have way to lock not physical buffer but
  logical blocknum.

- CheckForSerializableConflictOut() no more takes HeapTuple nor
  buffer, instead just takes xid. Push heap specific parts from
  CheckForSerializableConflictOut() into its own function
  HeapCheckForSerializableConflictOut() which calls
  CheckForSerializableConflictOut(). The alternative option could be
  CheckForSerializableConflictOut() take callback function and
  callback arguments, which gets called if required after performing
  prechecks. Though currently I fell AM having its own wrapper to
  perform AM specific task and then calling
  CheckForSerializableConflictOut() is fine.

Attaching patch which makes these changes.

This way PredicateLockTID(), CheckForSerializableConflictIn() and
CheckForSerializableConflictOut() functions become usable by any AM.


1]
https://www.postgresql.org/message-id/CAEepm%3D2QbqQ_%2BKQQCnhKukF6NEAeq4SqiO3Qxe%2BfHza5-H-jKA%40mail.gmail.com
From aac57c17f078f869bf360677556842d58d5d33ea Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 21 Jun 2019 13:42:00 -0700
Subject: [PATCH v1] Remove HeapTuple dependency for predicate locking
 functions.

Following changes help to make predicate locking functions generic and
not tied to particular AM.

- PredicateLockTuple() is renamed to PredicateLockTID(). It takes
  ItemPointer and inserting transaction id now instead of HeapTuple.

- CheckForSerializableConflictIn() takes blocknum instead of buffer

- CheckForSerializableConflictOut() no more takes HeapTuple nor buffer
---
 src/backend/access/gin/ginbtree.c|   2 +-
 src/backend/access/gin/ginfast.c |   2 +-
 src/backend/access/gin/gininsert.c   |   4 +-
 src/backend/access/gist/gist.c   |   2 +-
 src/backend/access/hash/hashinsert.c |   2 +-
 src/backend/access/heap/heapam.c | 103 +---
 src/backend/access/heap/heapam_handler.c |   7 +-
 src/backend/access/index/indexam.c   |   4 +-
 src/backend/access/nbtree/nbtinsert.c|   4 +-
 src/backend/storage/lmgr/predicate.c | 119 ---
 src/include/access/heapam.h  |   2 +
 src/include/storage/predicate.h  |   9 +-
 12 files changed, 150 insertions(+), 110 deletions(-)

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 11a8ed7bbc2..e795375495b 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -89,7 +89,7 @@ ginFindLeafPage(GinBtree btree, bool searchMode,
 	stack->predictNumber = 1;
 
 	if (rootConflictCheck)
-		CheckForSerializableConflictIn(btree->index, NULL, stack->buffer);
+		CheckForSerializableConflictIn(btree->index, NULL, btree->rootBlkno);
 
 	for (;;)
 	{
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2b3dd1c677f..f8ffeb06f8a 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -246,7 +246,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
 	 * tree, so it conflicts with all serializable scans.  All scans acquire a
 	 * predicate lock on the metabuffer to represent that.
 	 */
-	CheckForSerializableConflictIn(index, NULL, metabuffer);
+	CheckForSerializableConflictIn(index, NULL, GIN_METAPAGE_BLKNO);
 
 	if (collector->sumsize + collector->ntuples * sizeof(ItemIdData) > GinListPageSize)
 	{
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab146173..046a20a3d41 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -221,7 +221,7 @@ ginEntryInsert(GinState *ginstate,
 			return;
 		}
 
-		CheckForSerializableConflictIn(ginstate->index, NULL, stack->buffer);
+		CheckForSerializableConflictIn(ginstate->index, NULL, BufferGetBlockNumber(stack->buffer));
 		/* modify an existing leaf entry */
 		itup = addItemPointersToLeafTuple(ginstate, itup,
 		  items, nitem, buildStats, stack->buffer);
@@ -230,7 +230,7 @@ ginEnt

Re: Extracting only the columns needed for a query

2019-06-16 Thread Ashwin Agrawal
On Sat, Jun 15, 2019 at 10:02 AM Tom Lane  wrote:

> > Approach B: after parsing and/or after planning
>
> If we wanted to do something about this, making the planner record
> the set of used columns seems like the thing to do.  We could avoid
> the expense of doing it when it's not needed by setting up an AM/FDW/
> etc property or callback to request it.
>

Sounds good. In Zedstore patch, we have added AM property to convey the AM
leverages column projection and currently skip physical tlist optimization
based
on it. So, yes can similarly be leveraged for other planning needs.


> Another reason for having the planner do this is that presumably, in
> an AM that's excited about this, the set of fetched columns should
> play into the cost estimates for the scan.  I've not been paying
> enough attention to the tableam work to know if we've got hooks for
> the AM to affect scan costing ... but if we don't, that seems like
> a hole that needs plugged.
>

AM callback relation_estimate_size exists currently which planner
leverages. Via
this callback it fetches tuples, pages, etc.. So, our thought is to extend
this
API if possible to pass down needed column and help perform better costing
for
the query. Though we think if wish to leverage this function, need to know
list
of columns before planning hence might need to use query tree.


> > Approach B, however, does not work for utility statements which do
> > not go through planning.
>
> I'm not sure why you're excited about that case?  Utility statements
> tend to be pretty much all-or-nothing as far as data access goes.
>

Statements like COPY, CREATE INDEX, CREATE CONSTRAINTS, etc.. can benefit
from
subset of columns for scan. For example in Zedstore currently for CREATE
INDEX we extract needed columns by walking indexInfo->ii_Predicate and
indexInfo->ii_Expressions. For COPY, we currently use cstate->attnumlist to
know
which columns need to be scanned.


Re: heapam_index_build_range_scan's anyvisible

2019-06-10 Thread Ashwin Agrawal
On Mon, Jun 10, 2019 at 2:56 PM Andres Freund  wrote:

> > Currently, all AM needs to build HeapTuple in
> > index_build_range_scan function. I looked into all the callback functions
> > and only htup->t_self is used from heaptuple in all the functions
> (unless I
> > missed something). So, if seems fine will be happy to write patch to make
> > that argument ItemPointer instead of HeapTuple?
>
> I wonder if it should better be the slot? It's not inconceivable that
> some AMs could benefit from that. Although it'd add some complication
> to the heap HeapTupleIsHeapOnly case.
>

I like the slot suggestion, only if can push FormIndexDatum() out of AM
code as a result and pass slot to the callback. Not sure what else can it
help. HeapTupleIsHeapOnly possibly can be made to work with slot similar to
current hack of updating the t_self and slot's tid field, maybe.

Index callback using the slot can form the index datum. Though that would
mean every Index AM callback function needs to do it, not sure which way is
better. Plus, need to create ExecutorState for the same. With current setup
every AM needs to do. Feels good if belongs to indexing code though instead
of AM.

Currently, index build needing to create ExecutorState and all at AM layer
seems not nice either. Maybe there is quite generic logic here and possible
can be extracted into common code which either most of AM leverage. Or
possibly the API itself can be simplified to get minimum input from AM and
have rest of flow/machinery at upper layer. As Robert pointed at start of
thread at heart its pretty simple flow and possibly will remain same for
any AM.


Re: heapam_index_build_range_scan's anyvisible

2019-06-10 Thread Ashwin Agrawal
On Fri, Jun 7, 2019 at 1:19 PM Robert Haas  wrote:

> I spent some time today studying heapam_index_build_range_scan and
> quickly reached the conclusion that it's kind of a mess.
>

While at it might be helpful and better to also decouple HeapTuple
dependency for
IndexBuildCallback. Currently, all AM needs to build HeapTuple in
index_build_range_scan function. I looked into all the callback functions
and only htup->t_self is used from heaptuple in all the functions (unless I
missed something). So, if seems fine will be happy to write patch to make
that argument ItemPointer instead of HeapTuple?


Re: Removing a few more lseek() calls

2019-06-06 Thread Ashwin Agrawal
On Sat, Mar 30, 2019 at 2:14 AM Thomas Munro  wrote:

> Hello,
>
> Patch 0001 gets rid of the unconditional lseek() calls for SLRU I/O,
> as a small follow-up to commit c24dcd0c.  Patch 0002 gets rid of a few
> places that usually do a good job of avoiding lseek() calls while
> reading and writing WAL, but it seems better to have no code at all.
>

I reviewed the changes and they look good to me. Code looks much cleaner
after 2nd patch.
After these changes, only one usage of SLRU_SEEK_FAILED remains in
SimpleLruDoesPhysicalPageExist().


Re: Confusing error message for REINDEX TABLE CONCURRENTLY

2019-06-04 Thread Ashwin Agrawal
On Mon, Jun 3, 2019 at 6:27 PM Michael Paquier  wrote:

> On Mon, Jun 03, 2019 at 04:53:48PM -0700, Ashwin Agrawal wrote:
> > Please check if the attached patch addresses and satisfies all the points
> > discussed so far in this thread.
>
> It looks to be so, please see below for some comments.
>
> > +{
> >  result = ReindexRelationConcurrently(heapOid, options);
> > +
> > +if (!result)
> > +ereport(NOTICE,
> > +(errmsg("table \"%s\" has no indexes that can be
> concurrently reindexed",
> > +relation->relname)));
>
> "concurrently" should be at the end of this string.  I have had the
> exact same argument with Tom for 508300e.
>

Sure modified the same, find attached.


> > @@ -2630,7 +2638,6 @@ ReindexMultipleTables(const char *objectName,
> ReindexObjectType objectKind,
> >  foreach(l, relids)
> >  {
> >  Oidrelid = lfirst_oid(l);
> > -boolresult;
> >
> >  StartTransactionCommand();
> >  /* functions in indexes may want a snapshot set */
> > @@ -2638,11 +2645,12 @@ ReindexMultipleTables(const char *objectName,
> ReindexObjectType objectKind,
> >
> >  if (concurrent)
> >  {
> > -result = ReindexRelationConcurrently(relid, options);
> > +ReindexRelationConcurrently(relid, options);
> >  /* ReindexRelationConcurrently() does the verbose output */
>
> Indeed this variable is not used.  So we could just get rid of it
> completely.
>

The variable is used in else scope hence I moved it there. But yes its
removed completely for this scope.

> +bool result;
> >  result = reindex_relation(relid,
> >REINDEX_REL_PROCESS_TOAST |
> >REINDEX_REL_CHECK_CONSTRAINTS,
> > @@ -2656,7 +2664,6 @@ ReindexMultipleTables(const char *objectName,
> ReindexObjectType objectKind,
> >
> >  PopActiveSnapshot();
> >  }
>
> The table has been considered for reindexing even if nothing has been
> reindexed, so perhaps we'd want to keep this part as-is?  We have the
> same level of reporting for a couple of releases for this part.
>

I don't understand the review comment. I functionally didn't change
anything in that part of code, just have result variable confined to that
scope of code.


> > -
> >  CommitTransactionCommand();
>
> Useless noise diff.
>

Okay, removed it.
From 4486f9d114084e3b484be8d2ec0eb95b8f47 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Mon, 3 Jun 2019 16:30:39 -0700
Subject: [PATCH v3] Avoid confusing error message for REINDEX.

---
 src/backend/commands/indexcmds.c   | 25 --
 src/test/regress/expected/create_index.out | 10 -
 2 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 40ea629ffe7..a09ee059a30 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2438,17 +2438,25 @@ ReindexTable(RangeVar *relation, int options, bool concurrent)
 	   RangeVarCallbackOwnsTable, NULL);
 
 	if (concurrent)
+	{
 		result = ReindexRelationConcurrently(heapOid, options);
+
+		if (!result)
+			ereport(NOTICE,
+	(errmsg("table \"%s\" has no indexes that can be reindexed concurrently",
+			relation->relname)));
+	}
 	else
+	{
 		result = reindex_relation(heapOid,
   REINDEX_REL_PROCESS_TOAST |
   REINDEX_REL_CHECK_CONSTRAINTS,
   options);
-
-	if (!result)
-		ereport(NOTICE,
-(errmsg("table \"%s\" has no indexes",
-		relation->relname)));
+		if (!result)
+			ereport(NOTICE,
+	(errmsg("table \"%s\" has no indexes to reindex",
+			relation->relname)));
+	}
 
 	return heapOid;
 }
@@ -2630,7 +2638,6 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 	foreach(l, relids)
 	{
 		Oid			relid = lfirst_oid(l);
-		bool		result;
 
 		StartTransactionCommand();
 		/* functions in indexes may want a snapshot set */
@@ -2638,11 +2645,12 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 
 		if (concurrent)
 		{
-			result = ReindexRelationConcurrently(relid, options);
+			ReindexRelationConcurrently(relid, options);
 			/* ReindexRelationConcurrently() does the verbose output */
 		}
 		else
 		{
+			bool result;
 			result = reindex_relation(relid,
 	  REINDEX_REL_PROCESS_TOAST |
 	  REINDEX_REL_CHECK_CONSTRAINTS,
@@ -2676,6 +2684,9 @@ ReindexMultipleTables(const cha

Re: Comment typo in tableam.h

2019-06-03 Thread Ashwin Agrawal
On Mon, Jun 3, 2019 at 6:24 PM Andres Freund  wrote:

> Hi,
>
> On 2019-06-03 18:21:56 -0700, Ashwin Agrawal wrote:
> > On Mon, Jun 3, 2019 at 5:26 PM Andres Freund  wrote:
> >
> > > Hi,
> > >
> > > Thanks for these!
> > >
> > > On 2019-06-03 17:24:15 -0700, Ashwin Agrawal wrote:
> > > >   /*
> > > >* Estimate the size of shared memory needed for a parallel
> scan
> > > of this
> > > > -  * relation. The snapshot does not need to be accounted for.
> > > > +  * relation.
> > > >*/
> > > >   Size(*parallelscan_estimate) (Relation rel);
> > >
> > > That's not a typo?
> > >
> >
> > The snapshot is not passed as argument to that function hence seems weird
> > to refer to snapshot in the comment, as anyways callback function can't
> > account for it.
>
> It's part of the parallel scan struct, and it used to be accounted for
> by pre tableam function...
>

Reads like the comment written from past context then, and doesn't have
much value now. Its confusing than helping, to state not to account for
snapshot and not any other field.
table_parallelscan_estimate() has snapshot argument and it accounts for it,
but callback doesn't. I am not sure how a callback would explicitly use
that comment and avoid accounting for snapshot if its using generic
ParallelTableScanDescData. But if you feel is helpful, please feel free to
keep that text.


Re: Comment typo in tableam.h

2019-06-03 Thread Ashwin Agrawal
On Mon, Jun 3, 2019 at 5:26 PM Andres Freund  wrote:

> Hi,
>
> Thanks for these!
>
> On 2019-06-03 17:24:15 -0700, Ashwin Agrawal wrote:
> >   /*
> >* Estimate the size of shared memory needed for a parallel scan
> of this
> > -  * relation. The snapshot does not need to be accounted for.
> > +  * relation.
> >*/
> >   Size(*parallelscan_estimate) (Relation rel);
>
> That's not a typo?
>

The snapshot is not passed as argument to that function hence seems weird
to refer to snapshot in the comment, as anyways callback function can't
account for it. Seems stale piece of comment and hence that piece of text
should be removed. I should have refereed to changes as general comment
fixes instead of explicit typo fixes :-)


Re: Comment typo in tableam.h

2019-06-03 Thread Ashwin Agrawal
There were few more minor typos I had collected for table am, passing them
along here.

Some of the required callback functions are missing Assert checking (minor
thing), adding them in separate patch.
From f32bdf5d0d3af5fd6ee6bf6430905f9c4bf5fefa Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 24 May 2019 16:30:38 -0700
Subject: [PATCH v1 1/2] Fix typos in few tableam comments.

---
 src/backend/access/table/tableamapi.c |  2 +-
 src/include/access/tableam.h  | 14 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 32877e7674f..0df5ba35803 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -91,7 +91,7 @@ GetTableAmRoutine(Oid amhandler)
 
 	Assert(routine->relation_estimate_size != NULL);
 
-	/* optional, but one callback implies presence of hte other */
+	/* optional, but one callback implies presence of the other */
 	Assert((routine->scan_bitmap_next_block == NULL) ==
 		   (routine->scan_bitmap_next_tuple == NULL));
 	Assert(routine->scan_sample_next_block != NULL);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0b6ac15d316..2d06d52d71f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -231,7 +231,7 @@ typedef struct TableAmRoutine
 
 	/*
 	 * Estimate the size of shared memory needed for a parallel scan of this
-	 * relation. The snapshot does not need to be accounted for.
+	 * relation.
 	 */
 	Size		(*parallelscan_estimate) (Relation rel);
 
@@ -434,8 +434,8 @@ typedef struct TableAmRoutine
 	 *
 	 * Note that only the subset of the relcache filled by
 	 * RelationBuildLocalRelation() can be relied upon and that the relation's
-	 * catalog entries either will either not yet exist (new relation), or
-	 * will still reference the old relfilenode.
+	 * catalog entries will either not yet exist (new relation), or will still
+	 * reference the old relfilenode.
 	 *
 	 * As output *freezeXid, *minmulti must be set to the values appropriate
 	 * for pg_class.{relfrozenxid, relminmxid}. For AMs that don't need those
@@ -591,7 +591,7 @@ typedef struct TableAmRoutine
 	 * See table_relation_estimate_size().
 	 *
 	 * While block oriented, it shouldn't be too hard for an AM that doesn't
-	 * doesn't internally use blocks to convert into a usable representation.
+	 * internally use blocks to convert into a usable representation.
 	 *
 	 * This differs from the relation_size callback by returning size
 	 * estimates (both relation size and tuple count) for planning purposes,
@@ -967,7 +967,7 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  *
  * *all_dead, if all_dead is not NULL, will be set to true by
  * table_index_fetch_tuple() iff it is guaranteed that no backend needs to see
- * that tuple. Index AMs can use that do avoid returning that tid in future
+ * that tuple. Index AMs can use that to avoid returning that tid in future
  * searches.
  *
  * The difference between this function and table_fetch_row_version is that
@@ -1014,8 +1014,8 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  * true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
- * these functions is. This function is the correct to use outside of
- * index entry->table tuple lookups.
+ * these functions is. This function is correct to use outside of index
+ * entry->table tuple lookups.
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-- 
2.19.1

From 022569d249918da60d145d7877dc0f8df4ccc6cd Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Mon, 3 Jun 2019 17:07:05 -0700
Subject: [PATCH v1 2/2] Add assertions for required table am callbacks.

---
 src/backend/access/table/tableamapi.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 0df5ba35803..55bd1eea3db 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -50,6 +50,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_begin != NULL);
 	Assert(routine->scan_end != NULL);
 	Assert(routine->scan_rescan != NULL);
+	Assert(routine->scan_getnextslot != NULL);
 
 	Assert(routine->parallelscan_estimate != NULL);
 	Assert(routine->parallelscan_initialize != NULL);
@@ -61,8 +62,12 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->index_fetch_tuple != NULL);
 
 	Assert(routine->tuple_fetch_row_version != NULL);
+	Assert(routine->tuple_tid_valid != NULL);
+	Assert(routine->tuple_get_latest_tid != NULL);
 	Assert(routine->tuple_satisfies_snapshot != NULL);
 
+	Assert(routine->compute_xid_horizon_for_tuples != NULL);
+
 	Assert(routine->tuple_insert != NULL);
 
 	/*
@@ -88,6 +93,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->index_validate_scan != NULL);

Re: Confusing error message for REINDEX TABLE CONCURRENTLY

2019-06-03 Thread Ashwin Agrawal
On Tue, May 28, 2019 at 12:05 PM David Rowley 
wrote:

> On Tue, 28 May 2019 at 01:23, Ashwin Agrawal  wrote:
> > I think we will need to separate out the NOTICE message for concurrent
> and regular case.
> >
> > For example this doesn't sound correct
> > WARNING:  cannot reindex exclusion constraint index
> "public.circles_c_excl" concurrently, skipping
> > NOTICE:  table "circles" has no indexes to reindex
> >
> > As no indexes can't be reindexed *concurrently* but there are still
> indexes which can be reindexed, invalid indexes I think fall in same
> category.
>
> Swap "can't" for "can" and, yeah. I think it would be good to make the
> error messages differ for these two cases. This would serve as a hint
> to the user that they might have better luck trying without the
> "concurrently" option.
>

Please check if the attached patch addresses and satisfies all the points
discussed so far in this thread.

Was thinking of adding explicit errhint for concurrent case NOTICE to
convey, either the table has no indexes or can only be reindexed without
CONCURRENTLY. But thought may be its obvious but feel free to add if would
be helpful.
From 2e9eceff07d9bdadef82a54c17f399263436bf9d Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Mon, 3 Jun 2019 16:30:39 -0700
Subject: [PATCH v2] Avoid confusing error message for REINDEX.

---
 src/backend/commands/indexcmds.c   | 26 +++---
 src/test/regress/expected/create_index.out | 10 -
 2 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 40ea629ffe7..bab20028090 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2438,17 +2438,25 @@ ReindexTable(RangeVar *relation, int options, bool concurrent)
 	   RangeVarCallbackOwnsTable, NULL);
 
 	if (concurrent)
+	{
 		result = ReindexRelationConcurrently(heapOid, options);
+
+		if (!result)
+			ereport(NOTICE,
+	(errmsg("table \"%s\" has no indexes that can be concurrently reindexed",
+			relation->relname)));
+	}
 	else
+	{
 		result = reindex_relation(heapOid,
   REINDEX_REL_PROCESS_TOAST |
   REINDEX_REL_CHECK_CONSTRAINTS,
   options);
-
-	if (!result)
-		ereport(NOTICE,
-(errmsg("table \"%s\" has no indexes",
-		relation->relname)));
+		if (!result)
+			ereport(NOTICE,
+	(errmsg("table \"%s\" has no indexes to reindex",
+			relation->relname)));
+	}
 
 	return heapOid;
 }
@@ -2630,7 +2638,6 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 	foreach(l, relids)
 	{
 		Oid			relid = lfirst_oid(l);
-		bool		result;
 
 		StartTransactionCommand();
 		/* functions in indexes may want a snapshot set */
@@ -2638,11 +2645,12 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 
 		if (concurrent)
 		{
-			result = ReindexRelationConcurrently(relid, options);
+			ReindexRelationConcurrently(relid, options);
 			/* ReindexRelationConcurrently() does the verbose output */
 		}
 		else
 		{
+			bool result;
 			result = reindex_relation(relid,
 	  REINDEX_REL_PROCESS_TOAST |
 	  REINDEX_REL_CHECK_CONSTRAINTS,
@@ -2656,7 +2664,6 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 
 			PopActiveSnapshot();
 		}
-
 		CommitTransactionCommand();
 	}
 	StartTransactionCommand();
@@ -2676,6 +2683,9 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
  * The locks taken on parent tables and involved indexes are kept until the
  * transaction is committed, at which point a session lock is taken on each
  * relation.  Both of these protect against concurrent schema changes.
+ *
+ * Returns true if any indexes have been rebuilt (including toast table's
+ * indexes when relevant).
  */
 static bool
 ReindexRelationConcurrently(Oid relationOid, int options)
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c8bc6be0613..72a8fca48e4 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1944,9 +1944,9 @@ DROP TABLE reindex_verbose;
 CREATE TABLE concur_reindex_tab (c1 int);
 -- REINDEX
 REINDEX TABLE concur_reindex_tab; -- notice
-NOTICE:  table "concur_reindex_tab" has no indexes
+NOTICE:  table "concur_reindex_tab" has no indexes to reindex
 REINDEX TABLE CONCURRENTLY concur_reindex_tab; -- notice
-NOTICE:  table "concur_reindex_tab" has no indexes
+NOTICE:  table "concur_reindex_tab" has no indexes that can be concurrently reindexed
 ALTER TABLE concur_reindex_tab ADD COLUMN c2 text; -- add toast index
 -- Normal index with integer column
 CREATE UNIQUE INDEX con

Re: Confusing error message for REINDEX TABLE CONCURRENTLY

2019-05-27 Thread Ashwin Agrawal
On Sun, May 26, 2019 at 6:43 PM Michael Paquier  wrote:

> As you mention for reindex_relation() no indexes <=> nothing to do,
> still let's not rely on that.  Instead of making the error message
> specific to concurrent operations, I would suggest to change it to
> "table foo has no indexes to reindex".  What do you think about the
> attached?
>

I think we will need to separate out the NOTICE message for concurrent and
regular case.

For example this doesn't sound correct
WARNING:  cannot reindex exclusion constraint index "public.circles_c_excl"
concurrently, skipping
NOTICE:  table "circles" has no indexes to reindex

As no indexes can't be reindexed *concurrently* but there are still indexes
which can be reindexed, invalid indexes I think fall in same category.


Confusing error message for REINDEX TABLE CONCURRENTLY

2019-05-24 Thread Ashwin Agrawal
CREATE TABLE circles (c circle, EXCLUDE USING gist (c WITH &&));

REINDEX TABLE CONCURRENTLY circles;
WARNING:  cannot reindex exclusion constraint index "public.circles_c_excl"
concurrently, skipping
NOTICE:  table "circles" has no indexes
REINDEX

The message "table has no indexes" is confusing, as warning above it states
table has index, just was skipped by reindex.

So, currently for any reason (exclusion or invalid index) reindex table
concurrently skips reindex, it reports the table has no index. Looking at
the behavior of non-concurrent reindex, it emits the NOTICE only if table
really has no indexes (since it has no skip cases).

We need to see what really wish to communicate here, table has no indexes
or just that reindex was *not* performed or keep it simple and completely
avoid emitting anything. If we skip any indexes we anyways emit WARNING, so
that should be sufficient and nothing more needs to be conveyed.

In-case we wish to communicate no reindex was performed, what do we wish to
notify for empty tables?

Seems might be just emit the NOTICE "table xxx has no index", if really no
index for concurrent and non-concurrent case, make it consistent, less
confusing and leave it there. Attaching the patch to just do that. Thoughts?
From dc8119abe180cc14b0720c4de79495de4c62bf12 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 24 May 2019 16:34:48 -0700
Subject: [PATCH v1] Avoid confusing error message for REINDEX CONCURRENTLY.

REINDEX CONCURRENTLY skips reindexing exclusion or invalid
indexes. But in such cases it incorrectly reported table has no
indexes. Make the behavior consistent with not concurrently REINDEX
command to emit the notice only if the table has no indexes.
---
 src/backend/commands/indexcmds.c   | 13 -
 src/test/regress/expected/create_index.out |  1 -
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 40ea629ffe7..402440ebad0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2630,7 +2630,6 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 	foreach(l, relids)
 	{
 		Oid			relid = lfirst_oid(l);
-		bool		result;
 
 		StartTransactionCommand();
 		/* functions in indexes may want a snapshot set */
@@ -2638,11 +2637,12 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 
 		if (concurrent)
 		{
-			result = ReindexRelationConcurrently(relid, options);
+			ReindexRelationConcurrently(relid, options);
 			/* ReindexRelationConcurrently() does the verbose output */
 		}
 		else
 		{
+			bool		result;
 			result = reindex_relation(relid,
 	  REINDEX_REL_PROCESS_TOAST |
 	  REINDEX_REL_CHECK_CONSTRAINTS,
@@ -2656,7 +2656,6 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 
 			PopActiveSnapshot();
 		}
-
 		CommitTransactionCommand();
 	}
 	StartTransactionCommand();
@@ -2693,6 +2692,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	char	   *relationName = NULL;
 	char	   *relationNamespace = NULL;
 	PGRUsage	ru0;
+	bool rel_has_index = false;
 
 	/*
 	 * Create a memory context that will survive forced transaction commits we
@@ -2759,6 +2759,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	Relation	indexRelation = index_open(cellOid,
 		   ShareUpdateExclusiveLock);
 
+	rel_has_index = true;
 	if (!indexRelation->rd_index->indisvalid)
 		ereport(WARNING,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -2805,6 +2806,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		Relation	indexRelation = index_open(cellOid,
 			   ShareUpdateExclusiveLock);
 
+		rel_has_index = true;
 		if (!indexRelation->rd_index->indisvalid)
 			ereport(WARNING,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -2843,6 +2845,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 			 errmsg("cannot reindex a system catalog concurrently")));
 
+rel_has_index = true;
 /* Save the list of relation OIDs in private context */
 oldcontext = MemoryContextSwitchTo(private_context);
 
@@ -2873,11 +2876,11 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			break;
 	}
 
-	/* Definitely no indexes, so leave */
+	/* Definitely no indexes to rebuild, so leave */
 	if (indexIds == NIL)
 	{
 		PopActiveSnapshot();
-		return false;
+		return rel_has_index;
 	}
 
 	Assert(heapRelationIds != NIL);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c8bc6be0613..9a9401c280e 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2150,7 +2150,6 @@ DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
 -- The invalid index is

Re: Zedstore - compressed in-core columnar storage

2019-05-24 Thread Ashwin Agrawal
On Thu, May 23, 2019 at 7:30 PM Ajin Cherian  wrote:

> Hi Ashwin,
>
> - how to pass the "column projection list" to table AM? (as stated in
>   initial email, currently we have modified table am API to pass the
>   projection to AM)
>
> We were working on a similar columnar storage using pluggable APIs; one
> idea that we thought of was to modify the scan slot based on the targetlist
> to have only the relevant columns in the scan descriptor. This way the
> table AMs are passed a slot with only relevant columns in the descriptor.
> Today we do something similar to the result slot using
> ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So
> somewhere after creating the scan slot using ExecInitScanTupleSlot(), call
> a table am handler API to modify the scan tuple slot based on the
> targetlist, a probable name for the new table am handler would be:
> exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).
>

Interesting.

Though this reads hacky and not clean approach to me. Reasons:

- The memory allocation and initialization for slot descriptor was
  done in ExecInitScanTupleSlot().  exec_init_scan_slot_tl() would
  redo lot of work. ExecInitScanTupleSlot() ideally just points to
  tupleDesc from Relation object. But for exec_init_scan_slot_tl()
  will free the existing tupleDesc and reallocate fresh. Plus, can't
  point to Relation tuple desc but essentially need to craft one out.

- As discussed in thread [1], several places want to use different
  slots for the same scan, so that means will have to modify the
  descriptor every time on such occasions even if it remains the same
  throughout the scan. Some extra code can be added to keep around old
  tupledescriptor and then reuse for next slot, but that seems again
  added code complexity.

- AM needs to know the attnum in terms of relation's attribute number
  to scan. How would tupledesc convey that? Like TupleDescData's attrs
  currently carries info for attnum at attrs[attnum - 1]. If TupleDesc
  needs to convey random attributes to scan, seems this relationship
  has to be broken. attrs[offset] will provide info for some attribute
  in relation, means offset != (attrs->attnum + 1). Which I am not
  sure how many places in code rely on that logic to get information.

- The tupledesc provides lot of information not just attribute numbers
  to scan. Like it provides information in TupleConstr about default
  value for column. If AM layer has to modify existing slot's
  tupledesc, it would have to copy over such information as well. This
  information today is fetched using attnum as offset value in
  constr->missing array. If this information will be retained how will
  the constr array constructed? Will the array contain only values for
  columns to scan or will contain constr array as is from Relation's
  tuple descriptor as it does today. Seems will be overhead to
  construct the constr array fresh and if not constructing fresh seems
  will have mismatch between natt and array elements.

Seems with the proposed exec_init_scan_slot_tl() API, will have to
call it after beginscan and before calling getnextslot, to provide
column projection list to AM. Special dedicated API we have for
Zedstore to pass down column projection list, needs same calling
convention which is the reason I don't like it and trying to find
alternative. But at least the api we added for Zedstore seems much
simple, generic and flexible, in comparison, as lets AM decide what it
wishes to do with it. AM can fiddle with slot's TupleDescriptor if
wishes or can handle the column projection some other way.

 So this way the scan am handlers like getnextslot is passed a slot only
> having the relevant columns in the scan descriptor. One issue though is
> that the beginscan is not passed the slot, so if some memory allocation
> needs to be done based on the column list, it can't be done in beginscan.
> Let me know what you think.
>

Yes, ideally would like to see if possible having this information
available on beginscan. But if can't be then seems fine to delay such
allocations on first calls to getnextslot and friends, that's how we
do today for Zedstore.

1]
https://www.postgresql.org/message-id/20190508214627.hw7wuqwawunhynj6%40alap3.anarazel.de


Re: Inconsistency between table am callback and table function names

2019-05-24 Thread Ashwin Agrawal
On Thu, May 23, 2019 at 4:32 PM Andres Freund  wrote:

> Hi,
>
> On 2019-05-14 12:11:46 -0700, Ashwin Agrawal wrote:
> > Thank you. Please find the patch to rename the agreed functions. It would
> > be good to make all consistent instead of applying exception to three
> > functions but seems no consensus on it. Given table_ api are new, we
> could
> > modify them leaving systable_ ones as is, but as objections left that as
> is.
>
> I've pushed a slightly modified version (rebase, some additional
> newlines due to the longer function names) now. Thanks for the patch!
>

Thanks a lot Andres. With pg_intend run before the patch on master, I can
imagine possibly generated additional work for you on this.


Re: Pluggable Storage - Andres's take

2019-05-17 Thread Ashwin Agrawal
On Wed, May 15, 2019 at 11:54 AM Andres Freund  wrote:

> Attached is a prototype of a variation of this. I added a
> table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
> callback / wrapper. Currently it just takes a "plain" scan, but we could
> add a separate table_beginscan variant too.
>
> For heap that just means we can just use HeapScanDesc's rs_nblock to
> filter out invalid tids, and we only need to call
> RelationGetNumberOfBlocks() once, rather than every
> table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
> improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
> CURRENT OF) - which previously computed the relation size once per
> tuple.
>

Question on the patch, if not too late
Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ?
Seems cleaner to have it in ExecInitTidScan().


Re: Pluggable Storage - Andres's take

2019-05-17 Thread Ashwin Agrawal
On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas  wrote:

> On 08/04/2019 20:37, Andres Freund wrote:
> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> >> There's a little bug in index-only scan executor node, where it mixes
> up the
> >> slots to hold a tuple from the index, and from the table. That doesn't
> cause
> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM,
> which
> >> uses a virtual slot, it caused warnings like this from index-only scans:
> >
> > Hm. That's another one that I think I had fixed previously :(, and then
> > concluded that it's not actually necessary for some reason. Your fix
> > looks correct to me.  Do you want to commit it? Otherwise I'll look at
> > it after rebasing zheap, and checking it with that.
>
> I found another slot type confusion bug, while playing with zedstore. In
> an Index Scan, if you have an ORDER BY key that needs to be rechecked,
> so that it uses the reorder queue, then it will sometimes use the
> reorder queue slot, and sometimes the table AM's slot, for the scan
> slot. If they're not of the same type, you get an assertion:
>
> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
> "execExprInterp.c", Line: 1905)
>
> Attached is a test for this, again using the toy table AM, extended to
> be able to test this. And a fix.
>

It seems the two patches from email [1] fixing slot confusion in Index
Scans are pending to be committed.

1]
https://www.postgresql.org/message-id/e71c4da4-3e82-cc4f-32cc-ede387fac8b0%40iki.fi


Re: Create TOAST table only if AM needs

2019-05-17 Thread Ashwin Agrawal
On Fri, May 17, 2019 at 2:42 PM Robert Haas  wrote:

> In any case, it bears saying that
> tableam is a remarkable accomplishment regardless of whatever
> shortcomings it has in this area or elsewhere.
>

Big +1 to this.


Re: Pluggable Storage - Andres's take

2019-05-17 Thread Ashwin Agrawal
On Fri, May 17, 2019 at 12:54 PM Andres Freund  wrote:

> Hi,
>
> On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> > Highlevel this looks good to me. Will look into full details tomorrow.
>
> Ping?
>
> I'll push the first of the patches soon, and unless you'll comment on
> the second soon, I'll also push ahead. There's a beta upcoming...
>

Sorry for the delay, didn't get to it yesterday. Looked into both the
patches. They both look good to me, thank you.

Relation size API still doesn't address the analyze case as you mentioned
but sure something we can improve on later.


Re: Create TOAST table only if AM needs

2019-05-17 Thread Ashwin Agrawal
On Fri, May 17, 2019 at 11:34 AM Andres Freund  wrote:

> Hi,
>
> On 2019-05-17 11:26:29 -0700, Ashwin Agrawal wrote:
> > Currently TOAST table is always created (if needed based on data type
> > properties) independent of table AM. How toasting is handled seems
> > should be AM responsibility. Generic code shouldn't force the use of
> > the separate table for the same. Like for Zedstore we store toasted
> > chunks in separate blocks but within the table file itself and don't
> > need separate toast table. Some other AM may implement the
> > functionality differently. So, similar to relation forks, usage of
> > toast table should be optional and left to AM to handle.
>
> Yea, Robert is also working on this. In fact, we were literally chatting
> about it a few minutes ago. He'll probably chime in too.
>

Thank You.


> My inclination is that it's too late for 12 to do anything about
> this. There are many known limitations, and we'll discover many more, of
> the current tableam interface. If we try to fix them for 12, we'll never
> get anywhere.  It'll take a while to iron out all those wrinkles...
>

Agree on that, most of the stuff would be enhancements. And enhancements
can and will be made as we find them. Plus, will get added to the version
active in development that time. Intent is to start the discussion, and not
to convey a bug or has to be fixed in v12.


Re: Create TOAST table only if AM needs

2019-05-17 Thread Ashwin Agrawal
On Fri, May 17, 2019 at 1:51 PM Andres Freund  wrote:

>
> >  /*
> 
> >   * Planner related callbacks for the heap AM
> > @@ -2558,6 +2615,8 @@ static const TableAmRoutine heapam_methods = {
> >
> >   .relation_estimate_size = heapam_estimate_rel_size,
> >
> > + .needs_toast_table = heapam_needs_toast_table,
> > +
>
> I'd rather see this have a relation_ prefix.
>

+1 to overall patch with that comment incorporated. This seems simple
enough to incorporate for v12. Though stating that blind-folded with what
else is remaining to be must done for v12.


Create TOAST table only if AM needs

2019-05-17 Thread Ashwin Agrawal
Currently TOAST table is always created (if needed based on data type
properties) independent of table AM. How toasting is handled seems
should be AM responsibility. Generic code shouldn't force the use of
the separate table for the same. Like for Zedstore we store toasted
chunks in separate blocks but within the table file itself and don't
need separate toast table. Some other AM may implement the
functionality differently. So, similar to relation forks, usage of
toast table should be optional and left to AM to handle.

Wish to discuss ways on how best to achieve it. Attaching patch just
to showcase a way could be done. The patch adds property to
TableAmRoutine to convey if AM uses separate Toast table or not.

Other possibility could be with some refactoring move toast table
creation inside relation_set_new_filenode callback or provide separate
callback for Toast Table creation to AM.
From fcbf6b8187ef8389cf38cd185b709bcb6ef1152a Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 10 May 2019 17:42:06 -0700
Subject: [PATCH 1/1] Create toast table only if AM needs it.

All AM may not need toast table, hence add property to TableAmRoutine
to define if uses toast table or not. Only for AM using the toast
table create the toast table, otherwise skip the same. Heap AM sets
the property as true.
---
 src/backend/access/heap/heapam_handler.c | 1 +
 src/backend/catalog/toasting.c   | 5 +
 src/include/access/tableam.h | 7 +++
 3 files changed, 13 insertions(+)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 00505ec3f4d..243a221c923 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2515,6 +2515,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 
 static const TableAmRoutine heapam_methods = {
 	.type = T_TableAmRoutine,
+	.uses_toast_table = true,
 
 	.slot_callbacks = heapam_slot_callbacks,
 
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 2276d3c5d36..2345ef8ee84 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -407,6 +407,11 @@ needs_toast_table(Relation rel)
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		return false;
 
+	Assert(rel->rd_tableam != NULL);
+	/* No need to create TOAST table if AM doesn't need one */
+	if (!table_uses_toast_table(rel))
+		return false;
+
 	/*
 	 * We cannot allow toasting a shared relation after initdb (because
 	 * there's no way to mark it toasted in other databases' pg_class).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ebfa0d51855..ed116e18f10 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -142,6 +142,8 @@ typedef struct TableAmRoutine
 {
 	/* this must be set to T_TableAmRoutine */
 	NodeTag		type;
+	/* does AM need separate TOAST table */
+	bool uses_toast_table;
 
 
 	/* 
@@ -657,6 +659,11 @@ typedef struct TableAmRoutine
 
 } TableAmRoutine;
 
+static inline bool
+table_uses_toast_table(Relation relation)
+{
+	return relation->rd_tableam->uses_toast_table;
+}
 
 /* 
  * Slot functions.
-- 
2.19.1



Re: Pluggable Storage - Andres's take

2019-05-16 Thread Ashwin Agrawal
On Wed, May 15, 2019 at 11:54 AM Andres Freund  wrote:

> Hi,
>
> On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
>
> > > 3) nodeTidscan, skipping over too large tids
> > >I think this should just be moved into the AMs, there's no need to
> > >have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> >   /*
> >* We silently discard any TIDs that are out of range at the time
> of scan
> >* start.  (Since we hold at least AccessShareLock on the table,
> it won't
> >* be possible for someone to truncate away the blocks we intend to
> >* visit.)
> >*/
> >   nblocks =
> RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
> >
> >
> > The obvious answer would be to just move that check into the
> > table_fetch_row_version implementation (currently just calling
> > heap_fetch()) - but that doesn't seem OK from a performance POV, because
> > we'd then determine the relation size once for each tid, rather than
> > once per tidscan.  And it'd also check in cases where we know the tid is
> > supposed to be valid (e.g. fetching trigger tuples and such).
> >
> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a parameter.  But it seems we'd have to introduce that as a
> > separate tableam callback, because we'd not want to incur the overhead
> > of creating an additional scan / RelationGetNumberOfBlocks() checks for
> > triggers et al.
>
> Attached is a prototype of a variation of this. I added a
> table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
> callback / wrapper. Currently it just takes a "plain" scan, but we could
> add a separate table_beginscan variant too.
>
> For heap that just means we can just use HeapScanDesc's rs_nblock to
> filter out invalid tids, and we only need to call
> RelationGetNumberOfBlocks() once, rather than every
> table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
> improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
> CURRENT OF) - which previously computed the relation size once per
> tuple.
>
> Needs a bit of polishing, but I think this is the right direction?
>

Highlevel this looks good to me. Will look into full details tomorrow. This
alligns with the high level thought I made but implemented in much better
way, to consult with the AM to perform the optimization or not. So, now
using the new callback table_tuple_tid_valid() AM either can implement some
way to perform the validation for TID to optimize the scan, or if has no
way to check based on scan descriptor then can decide to always pass true
and let table_fetch_row_version() handle the things.


Re: Adding a test for speculative insert abort case

2019-05-15 Thread Ashwin Agrawal
On Wed, May 15, 2019 at 8:36 PM Melanie Plageman 
wrote:

>
> On Wed, May 15, 2019 at 6:50 PM Andres Freund  wrote:
>
>>
>> > I noticed that there is not a test case which would cover the
>> speculative
>> > wait
>> > codepath. This seems much more challenging, however, it does seem like a
>> > worthwhile test to have.
>>
>> Shouldn't be that hard to create, I think. I think acquiring another
>> lock in a second, non-unique, expression index, ought to do the trick?
>> It probably has to be created after the unique index (so it's later in
>> the
>>
>> I would think that the sequence would be s1 and s2 probe the index, s1
> and s2
> insert into the table, s1 updates the index but does not complete the
> speculative insert and clear the token (pause before
> table_complete_speculative). s2 is in speculative wait when attempting to
> update
> the index.
>
> Something like
>
> permutation
>"controller_locks"
>"controller_show"
>"s1_upsert" "s2_upsert"
>"controller_show"
>"controller_unlock_1_1" "controller_unlock_2_1"
>"controller_unlock_1_3" "controller_unlock_2_3"
>"controller_unlock_1_2"
>"s1_magically_pause_before_complete_speculative"
># put s2 in speculative wait
>"controller_unlock_2_2"
>"s1_magically_unpause_before_complete_speculative"
>
> So, how would another lock on another index keep s1 from clearing the
> speculative token after it has updated the index?
>

The second index would help to hold the session after inserting the tuple
in unique index but before completing the speculative insert. Hence, helps
to create the condition easily. I believe order of index insertion is
helping here that unique index is inserted and then non-unique index is
inserted too.

Attaching patch with the test using the idea Andres mentioned and it works
to excercise the speculative wait.
From 24e44f1a971e8957503205376d655e22b0e5fbdc Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Wed, 15 May 2019 22:21:48 -0700
Subject: [PATCH] Add test to validate speculative wait is performed.

---
 .../expected/insert-conflict-specconflict.out | 69 +++
 .../specs/insert-conflict-specconflict.spec   | 44 
 2 files changed, 113 insertions(+)

diff --git a/src/test/isolation/expected/insert-conflict-specconflict.out b/src/test/isolation/expected/insert-conflict-specconflict.out
index 5726bdb8e8..ca2b65ca56 100644
--- a/src/test/isolation/expected/insert-conflict-specconflict.out
+++ b/src/test/isolation/expected/insert-conflict-specconflict.out
@@ -177,3 +177,72 @@ step controller_show: SELECT * FROM upserttest;
 keydata   
 
 k1 inserted s1 with conflict update s2
+
+starting permutation: s1_create_non_unique_index controller_locks controller_show s1_upsert s2_upsert controller_show controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show controller_lock_2_4 controller_unlock_2_2 controller_show controller_unlock_1_2 controller_print_speculative_locks controller_unlock_2_4 controller_show
+step s1_create_non_unique_index: CREATE INDEX ON upserttest((blurt_and_lock2(key)));
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess   lock   
+
+   1  1  
+   1  2  
+   1  3  
+   2  1  
+   2  2  
+   2  3  
+step controller_show: SELECT * FROM upserttest;
+keydata   
+
+step s1_upsert: INSERT INTO upserttest(key, data) VALUES('k1', 'inserted s1') ON CONFLICT (blurt_and_lock(key)) DO UPDATE SET data = upserttest.data || ' with conflict update s1'; 
+step s2_upsert: INSERT INTO upserttest(key, data) VALUES('k1', 'inserted s2') ON CONFLICT (blurt_and_lock(key)) DO UPDATE SET data = upserttest.data || ' with conflict update s2'; 
+step controller_show: SELECT * FROM upserttest;
+keydata   
+
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t  
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t  
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t  
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t  
+step controller_show: SELECT * FROM upserttest;
+keydata   
+
+step controller_lock_2_4: SELECT 

Re: Inconsistency between table am callback and table function names

2019-05-14 Thread Ashwin Agrawal
On Wed, May 8, 2019 at 5:05 PM Ashwin Agrawal  wrote:

> Not having consistency is the main aspect I wish to bring to
> attention. Like for some callback functions the comment is
>
> /* see table_insert() for reference about parameters */
> void(*tuple_insert) (Relation rel, TupleTableSlot *slot,
>  CommandId cid, int options,
>  struct BulkInsertStateData *bistate);
>
> /* see table_insert_speculative() for reference about parameters
> */
> void(*tuple_insert_speculative) (Relation rel,
>  TupleTableSlot *slot,
>  CommandId cid,
>  int options,
>  struct
> BulkInsertStateData *bistate,
>  uint32 specToken);
>
> Whereas for some other callbacks the parameter explanation exist in
> both the places. Seems we should be consistent.
> I feel in long run becomes pain to keep them in sync as comments
> evolve. Like for example
>
> /*
>  * Estimate the size of shared memory needed for a parallel scan
> of this
>  * relation. The snapshot does not need to be accounted for.
>  */
> Size(*parallelscan_estimate) (Relation rel);
>
> parallescan_estimate is not having snapshot argument passed to it, but
> table_parallescan_estimate does. So, this way chances are high they
> going out of sync and missing to modify in both the places. Agree
> though on audience being different for both. So, seems going with the
> refer XXX for parameters seems fine approach for all the callbacks and
> then only specific things to flag out for the AM layer to be aware can
> live above the callback function.
>

The topic of consistency for comment style for all wrappers seems got lost
with other discussion, would like to seek opinion on the same.


Re: Inconsistency between table am callback and table function names

2019-05-14 Thread Ashwin Agrawal
On Mon, May 13, 2019 at 12:51 PM Robert Haas  wrote:

> On Fri, May 10, 2019 at 3:43 PM Ashwin Agrawal 
> wrote:
> > Meant to stick the question mark in that email, somehow missed. Yes
> > not planning to spend any time on it if objections. Here is the list
> > of renames I wish to perform.
> >
> > Lets start with low hanging ones.
> >
> > table_rescan -> table_scan_rescan
> > table_insert -> table_tuple_insert
> > table_insert_speculative -> table_tuple_insert_speculative
> > table_delete -> table_tuple_delete
> > table_update -> table_tuple_update
> > table_lock_tuple -> table_tuple_lock
> >
> > Below two you already mentioned no objections to rename
> > table_fetch_row_version -> table_tuple_fetch_row_version
> > table_get_latest_tid -> table_tuple_get_latest_tid
> >
> > Now, table_beginscan and table_endscan are the ones which are
> > wide-spread.
>
> I vote to rename all the ones where the new name would contain "tuple"
> and to leave the others alone.  i.e. leave table_beginscan,
> table_endscan, and table_rescan as they are.  I think that there's
> little benefit in standardizing table_rescan but not the other two,
> and we seem to agree that tinkering with the other two gets into a
> painful amount of churn.
>

Thank you. Please find the patch to rename the agreed functions. It would
be good to make all consistent instead of applying exception to three
functions but seems no consensus on it. Given table_ api are new, we could
modify them leaving systable_ ones as is, but as objections left that as is.
From 39fcc223a0aaeacf545e09cfe29b8d565d970234 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Tue, 14 May 2019 10:10:21 -0700
Subject: [PATCH] Rename table am wrappers to match table am callback names.

Now all the table wrapper functions match the
"table_" theme, except table_beginscan(),
table_endscan() and table_rescan(). The exception was applied and
decision is not to rename the three functions based on discussion to
avoid code churn.
---
 src/backend/access/heap/heapam.c   | 10 ++--
 src/backend/access/table/tableam.c | 46 +++
 src/backend/commands/copy.c| 12 ++--
 src/backend/commands/createas.c| 14 ++---
 src/backend/commands/matview.c | 14 ++---
 src/backend/commands/tablecmds.c   |  4 +-
 src/backend/commands/trigger.c | 12 ++--
 src/backend/executor/execMain.c|  8 +--
 src/backend/executor/execReplication.c | 16 +++---
 src/backend/executor/nodeLockRows.c|  8 +--
 src/backend/executor/nodeModifyTable.c | 78 -
 src/backend/executor/nodeTidscan.c |  4 +-
 src/backend/utils/adt/tid.c|  4 +-
 src/include/access/tableam.h   | 80 +-
 14 files changed, 155 insertions(+), 155 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ec9853603fd..86a964c4f48 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1856,12 +1856,12 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * See table_insert for comments about most of the input flags, except that
+ * See table_tuple_insert for comments about most of the input flags, except that
  * this routine directly takes a tuple rather than a slot.
  *
  * There's corresponding HEAP_INSERT_ options to all the TABLE_INSERT_
  * options, and there additionally is HEAP_INSERT_SPECULATIVE which is used to
- * implement table_insert_speculative().
+ * implement table_tuple_insert_speculative().
  *
  * On return the header fields of *tup are updated to match the stored tuple;
  * in particular tup->t_self receives the actual TID where the tuple was
@@ -2434,7 +2434,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 /*
  *	heap_delete - delete a tuple
  *
- * See table_delete() for an explanation of the parameters, except that this
+ * See table_tuple_delete() for an explanation of the parameters, except that this
  * routine directly takes a tuple rather than a slot.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
@@ -2880,7 +2880,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 /*
  *	heap_update - replace a tuple
  *
- * See table_update() for an explanation of the parameters, except that this
+ * See table_tuple_update() for an explanation of the parameters, except that this
  * routine directly takes a tuple rather than a slot.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
@@ -3951,7 +3951,7 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  *	*buffer: set to buffer holding tuple (pinned but not locked at exit)
  *	*tmfd: filled in failure

Re: Inconsistency between table am callback and table function names

2019-05-10 Thread Ashwin Agrawal
On Fri, May 10, 2019 at 10:51 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-05-10 10:43:44 -0700, Ashwin Agrawal wrote:
> > On Thu, May 9, 2019 at 8:52 AM Andres Freund  wrote:
> > > The changes necessary for tableam were already huge. Changing naming
> > > schemes for functions that are used all over the backend (e.g. ~80 calls
> > > to table_beginscan), and where there's other wrapper functions that also
> > > widely used (237 calls to systable_beginscan) which didn't have to be
> > > touched, at the same time would have made it even harder to review.
> >
> > If there are no objections to renaming now, as separate independent
> > patch, I am happy to do the same and send it across. Will rename to
> > make it consistent as mentioned at start of the thread leaving
> > table_relation_xxx() ones as is today.
>
> What would you want to rename precisely? Don't think it's useful to
> start sending patches before we agree on something concrete.  I'm not on
> board with patching hundreds systable_beginscan calls (that'll break a
> lot of external code, besides the churn of in-core code), nor with the
> APIs around that having a diverging name scheme.

Meant to stick the question mark in that email, somehow missed. Yes
not planning to spend any time on it if objections. Here is the list
of renames I wish to perform.

Lets start with low hanging ones.

table_rescan -> table_scan_rescan
git grep table_rescan | wc -l
6

table_insert -> table_tuple_insert
git grep tuple_insert | wc -l
13

table_insert_speculative -> table_tuple_insert_speculative
git grep tuple_insert_speculative | wc -l
5

table_delete -> table_tuple_delete (table_delete reads incorrect as
not deleting the table)
git grep tuple_delete | wc -l
8

table_update -> table_tuple_update
git grep tuple_update | wc -l
5

table_lock_tuple -> table_tuple_lock
git grep tuple_lock | wc -l
26


Below two you already mentioned no objections to rename
table_fetch_row_version -> table_tuple_fetch_row_version
table_get_latest_tid -> table_tuple_get_latest_tid


Now, table_beginscan and table_endscan are the ones which are
wide-spread. Desire seems we should keep it consistent with
systable_beginscan. Understand the constraints and churn aspect, given
that diverging naming scheme is unavoidable. Why not leave
systable_beginscan as it is and only rename table_beginscan and its
counterparts table_beginscan_xxx() atleast?

Index interfaces and table interfaces have some diverged naming scheme
already like index_getnext_slot and table_scan_getnextslot. Not
proposing to change them. But at least reducing wherever possible
sooner would be helpful.




Re: Inconsistency between table am callback and table function names

2019-05-10 Thread Ashwin Agrawal
On Thu, May 9, 2019 at 8:52 AM Andres Freund  wrote:
> The changes necessary for tableam were already huge. Changing naming
> schemes for functions that are used all over the backend (e.g. ~80 calls
> to table_beginscan), and where there's other wrapper functions that also
> widely used (237 calls to systable_beginscan) which didn't have to be
> touched, at the same time would have made it even harder to review.

If there are no objections to renaming now, as separate independent
patch, I am happy to do the same and send it across. Will rename to
make it consistent as mentioned at start of the thread leaving
table_relation_xxx() ones as is today.




Re: Pluggable Storage - Andres's take

2019-05-09 Thread Ashwin Agrawal
On Wed, May 8, 2019 at 2:46 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote:
> > On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal  wrote:
> > > Also wish to point out, while working on Zedstore, we realized that
> > > TupleDesc from Relation object can be trusted at AM layer for
> > > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> > > catalog is updated first and hence the relation object passed to AM
> > > layer reflects new TupleDesc. For heapam its fine as it doesn't use
> > > the TupleDesc today during scans in AM layer for scan_getnextslot().
> > > Only TupleDesc which can trusted and matches the on-disk layout of the
> > > tuple for scans hence is from TupleTableSlot. Which is little
> > > unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> > > and not in scan_begin(). Means if AM wishes to do some initialization
> > > based on TupleDesc for scans can't be done in scan_begin() and forced
> > > to delay till has access to TupleTableSlot. We should at least add
> > > comment for scan_begin() to strongly clarify not to trust Relation
> > > object TupleDesc. Or maybe other alternative would be have separate
> > > API for rewrite case.
> >
> > Just to correct my typo, I wish to say, TupleDesc from Relation object can't
> > be trusted at AM layer for scan_begin() API.
> >
> > Andres, any thoughts on above. I see you had proposed "change the
> > table_beginscan* API so it
> > provides a slot" in [1], but seems received no response/comments that time.
> > [1]
> > https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de
>
> I don't think passing a slot at beginscan time is a good idea. There's
> several places that want to use different slots for the same scan, and
> we probably want to increase that over time (e.g. for batching), not
> decrease it.
>
> What kind of initialization do you want to do based on the tuple desc at
> beginscan time?

For Zedstore (column store) need to allocate map (array or bitmask) to
mark which columns to project for the scan. Also need to allocate AM
internal scan descriptors corresponding to number of attributes for
the scan. Hence, need access to number of attributes involved in the
scan. Currently, not able to trust Relation's TupleDesc, for Zedstore
we worked-around the same by allocating these things on first call to
getnextslot, when have access to slot (by switching to memory context
used during scan_begin()).




Re: Inconsistency between table am callback and table function names

2019-05-08 Thread Ashwin Agrawal
On Wed, May 8, 2019 at 2:51 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-05-08 00:32:22 -0700, Ashwin Agrawal wrote:
> > The general theme for table function names seem to be
> > "table_". For example table_scan_getnextslot() and its
> > corresponding callback scan_getnextslot(). Most of the table functions and
> > callbacks follow mentioned convention except following ones
> >
> >  table_beginscan
> >  table_endscan
> >  table_rescan
> >  table_fetch_row_version
> >  table_get_latest_tid
> >  table_insert
> >  table_insert_speculative
> >  table_complete_speculative
> >  table_delete
> >  table_update
> >  table_lock_tuple
> >
> > the corresponding callback names for them are
> >
> >  scan_begin
> >  scan_end
> >  scan_rescan
>
> The mismatch here is just due of backward compat with the existing
> function names.

I am missing something here, would like to know more. table_ seem all
new fresh naming. Hence IMO having consistency with surrounding and
related code carries more weight as I don't know backward compat
serving what purpose. Heap function names can continue to call with
same old names for backward compat if required.


> > Also, a question about comments. Currently, redundant comments are written
> > above callback functions as also above table functions. They differ
> > sometimes a little bit on descriptions but majority content still being the
> > same. Should we have only one place finalized to have the comments to keep
> > them in sync and also know which one to refer to?
>
> Note that the non-differing comments usually just refer to the other
> place. And there's legitimate differences in most of the ones that are
> both at the callback and the external functions - since the audience of
> both are difference, that IMO makes sense.
>

Not having consistency is the main aspect I wish to bring to
attention. Like for some callback functions the comment is

/* see table_insert() for reference about parameters */
void(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 CommandId cid, int options,
 struct BulkInsertStateData *bistate);

/* see table_insert_speculative() for reference about parameters
*/
void(*tuple_insert_speculative) (Relation rel,
 TupleTableSlot *slot,
 CommandId cid,
 int options,
 struct
BulkInsertStateData *bistate,
 uint32 specToken);

Whereas for some other callbacks the parameter explanation exist in
both the places. Seems we should be consistent.
I feel in long run becomes pain to keep them in sync as comments
evolve. Like for example

/*
 * Estimate the size of shared memory needed for a parallel scan
of this
 * relation. The snapshot does not need to be accounted for.
 */
Size(*parallelscan_estimate) (Relation rel);

parallescan_estimate is not having snapshot argument passed to it, but
table_parallescan_estimate does. So, this way chances are high they
going out of sync and missing to modify in both the places. Agree
though on audience being different for both. So, seems going with the
refer XXX for parameters seems fine approach for all the callbacks and
then only specific things to flag out for the AM layer to be aware can
live above the callback function.

> > Plus, file name amapi.h now seems very broad, if possible should be renamed
> > to indexamapi.h or indexam.h to follow tableam.h. No idea what's our policy
> > around header file renames.
>
> We probably should rename it, but not in 12...

Okay good to know.




Inconsistency between table am callback and table function names

2019-05-08 Thread Ashwin Agrawal
The general theme for table function names seem to be
"table_". For example table_scan_getnextslot() and its
corresponding callback scan_getnextslot(). Most of the table functions and
callbacks follow mentioned convention except following ones

 table_beginscan
 table_endscan
 table_rescan
 table_fetch_row_version
 table_get_latest_tid
 table_insert
 table_insert_speculative
 table_complete_speculative
 table_delete
 table_update
 table_lock_tuple

the corresponding callback names for them are

 scan_begin
 scan_end
 scan_rescan
 tuple_fetch_row_version
 tuple_get_latest_tid
 tuple_insert
 tuple_insert_speculative
 tuple_delete
 tuple_update
 tuple_lock

It confuses while browsing through the code and hence I would like to
propose we make them consistent. Either fix the callback names or table
functions but all should follow the same convention, makes it easy to
browse around and refer to as well. Personally, I would say fix the table
function names as callback names seem fine. So, for example, make it
table_scan_begin().

Also, some of these table function names read little odd

table_relation_set_new_filenode
table_relation_nontransactional_truncate
table_relation_copy_data
table_relation_copy_for_cluster
table_relation_vacuum
table_relation_estimate_size

Can we drop relation word from callback names and as a result from these
function names as well? Just have callback names as set_new_filenode,
copy_data, estimate_size?

Also, a question about comments. Currently, redundant comments are written
above callback functions as also above table functions. They differ
sometimes a little bit on descriptions but majority content still being the
same. Should we have only one place finalized to have the comments to keep
them in sync and also know which one to refer to?

Plus, file name amapi.h now seems very broad, if possible should be renamed
to indexamapi.h or indexam.h to follow tableam.h. No idea what's our policy
around header file renames.


Re: Pluggable Storage - Andres's take

2019-05-08 Thread Ashwin Agrawal
On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal  wrote:
>
> Also wish to point out, while working on Zedstore, we realized that
> TupleDesc from Relation object can be trusted at AM layer for
> scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> catalog is updated first and hence the relation object passed to AM
> layer reflects new TupleDesc. For heapam its fine as it doesn't use
> the TupleDesc today during scans in AM layer for scan_getnextslot().
> Only TupleDesc which can trusted and matches the on-disk layout of the
> tuple for scans hence is from TupleTableSlot. Which is little
> unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> and not in scan_begin(). Means if AM wishes to do some initialization
> based on TupleDesc for scans can't be done in scan_begin() and forced
> to delay till has access to TupleTableSlot. We should at least add
> comment for scan_begin() to strongly clarify not to trust Relation
> object TupleDesc. Or maybe other alternative would be have separate
> API for rewrite case.

Just to correct my typo, I wish to say, TupleDesc from Relation object can't
be trusted at AM layer for scan_begin() API.

Andres, any thoughts on above. I see you had proposed "change the
table_beginscan* API so it
provides a slot" in [1], but seems received no response/comments that time.

[1]
https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de


Re: Pluggable Storage - Andres's take

2019-05-06 Thread Ashwin Agrawal
On Mon, May 6, 2019 at 7:14 AM Andres Freund  wrote:
>
> Hi,
>
> On May 6, 2019 3:40:55 AM PDT, Rafia Sabih  wrote:
> >I was trying the toyam patch and on make check it failed with
> >segmentation fault at
> >
> >static void
> >toyam_relation_set_new_filenode(Relation rel,
> > char persistence,
> > TransactionId *freezeXid,
> > MultiXactId *minmulti)
> >{
> > *freezeXid = InvalidTransactionId;
> >
> >Basically, on running create table t (i int, j int) using toytable,
> >leads to this segmentation fault.
> >
> >Am I missing something here?
>
> I assume you got compiler warmings compiling it? The API for some callbacks 
> changed a bit.

Attached patch gets toy table AM implementation to match latest master API.
The patch builds on top of patch from Heikki in [1].
Compiles and works but the test still continues to fail with WARNING
for issue mentioned in [1]


Noticed the typo in recently added comment for relation_set_new_filenode().

 * Note that only the subset of the relcache filled by
 * RelationBuildLocalRelation() can be relied upon and that the
relation's
 * catalog entries either will either not yet exist (new
relation), or
 * will still reference the old relfilenode.

seems should be

 * Note that only the subset of the relcache filled by
 * RelationBuildLocalRelation() can be relied upon and that the
relation's
 * catalog entries will either not yet exist (new relation), or
still
 * reference the old relfilenode.

Also wish to point out, while working on Zedstore, we realized that
TupleDesc from Relation object can be trusted at AM layer for
scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
catalog is updated first and hence the relation object passed to AM
layer reflects new TupleDesc. For heapam its fine as it doesn't use
the TupleDesc today during scans in AM layer for scan_getnextslot().
Only TupleDesc which can trusted and matches the on-disk layout of the
tuple for scans hence is from TupleTableSlot. Which is little
unfortunate as TupleTableSlot is only available in scan_getnextslot(),
and not in scan_begin(). Means if AM wishes to do some initialization
based on TupleDesc for scans can't be done in scan_begin() and forced
to delay till has access to TupleTableSlot. We should at least add
comment for scan_begin() to strongly clarify not to trust Relation
object TupleDesc. Or maybe other alternative would be have separate
API for rewrite case.


1] 
https://www.postgresql.org/message-id/9a7fb9cc-2419-5db7-8840-ddc10c93f122%40iki.fi
diff --git a/src/test/modules/toytable/toytableam.c b/src/test/modules/toytable/toytableam.c
index 4cb2b5d75db..9f1ab534822 100644
--- a/src/test/modules/toytable/toytableam.c
+++ b/src/test/modules/toytable/toytableam.c
@@ -398,6 +398,7 @@ toyam_finish_bulk_insert(Relation rel, int options)
 
 static void
 toyam_relation_set_new_filenode(Relation rel,
+const RelFileNode *newrnode,
 char persistence,
 TransactionId *freezeXid,
 MultiXactId *minmulti)
@@ -410,7 +411,7 @@ toyam_relation_set_new_filenode(Relation rel,
 	 * RelationGetNumberOfBlocks, from index_update_stats(), and that
 	 * fails if the underlying file doesn't exist.
 	 */
-	RelationCreateStorage(rel->rd_node, persistence);
+	RelationCreateStorage(*newrnode, persistence);
 }
 
 static void
@@ -422,7 +423,7 @@ toyam_relation_nontransactional_truncate(Relation rel)
 }
 
 static void
-toyam_relation_copy_data(Relation rel, RelFileNode newrnode)
+toyam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 {
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -435,8 +436,8 @@ toyam_relation_copy_for_cluster(Relation NewHeap,
 Relation OldIndex,
 bool use_sort,
 TransactionId OldestXmin,
-TransactionId FreezeXid,
-MultiXactId MultiXactCutoff,
+TransactionId *xid_cutoff,
+MultiXactId *multi_cutoff,
 double *num_tuples,
 double *tups_vacuumed,
 double *tups_recently_dead)


Match table_complete_speculative() code to comment

2019-04-30 Thread Ashwin Agrawal
Comment for table_complete_speculative() says

/*
 * Complete "speculative insertion" started in the same transaction.
If
 * succeeded is true, the tuple is fully inserted, if false, it's
removed.
 */
static inline void
table_complete_speculative(Relation rel, TupleTableSlot *slot,
   uint32 specToken, bool succeeded)
{
rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
succeeded);
}

but code really refers to succeeded as failure. Since that argument is
passed as specConflict, means conflict happened and hence the tuple
should be removed. It would be better to fix the code to match the
comment as in AM layer its better to deal with succeeded to finish the
insertion and not other way round.

diff --git a/src/backend/access/heap/heapam_handler.c
b/src/backend/access/heap/heapam_handler.c
index 4d179881f27..241639cfc20 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -282,7 +282,7 @@ heapam_tuple_complete_speculative(Relation
relation, TupleTableSlot *slot,
HeapTuple   tuple = ExecFetchSlotHeapTuple(slot, true, );

/* adjust the tuple's state accordingly */
-   if (!succeeded)
+   if (succeeded)
heap_finish_speculative(relation, >tts_tid);
else
heap_abort_speculative(relation, >tts_tid);
diff --git a/src/backend/executor/nodeModifyTable.c
b/src/backend/executor/nodeModifyTable.c
index 444c0c05746..d545bbce8a2 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -556,7 +556,7 @@ ExecInsert(ModifyTableState *mtstate,

/* adjust the tuple's state accordingly */
table_complete_speculative(resultRelationDesc, slot,
-
specToken, specConflict);
+
specToken, !specConflict);

/*
 * Wake up anyone waiting for our decision.
They will re-check

- Ashwin and Melanie




Re: Pluggable Storage - Andres's take

2019-04-29 Thread Ashwin Agrawal
On Thu, Apr 25, 2019 at 3:43 PM Andres Freund  wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).
>
> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> think that's OK.

I will provide my inputs, Heikki please correct me or add your inputs.

I am not sure how much gain this practically provides, if rest of the
system continues to use the value returned in-terms of blocks. I
understand things being block based (and not just really block based
but all the blocks of relation are storing data and full tuple) is
engraved in the system. So, breaking out of it is yes much larger
change and not just limited to table AM API.

I feel most of the issues discussed here should be faced by zheap as
well, as not all blocks/pages contain data like TPD pages should be
excluded from sampling and TID scans, etc...

> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size.  I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
> somehow.

Yes, I do think we should have mechanism to get total size and just
size for specific purpose. Zedstore currently doesn't use forks. Just
a thought, instead of calling it forknum as argument, call it
something like data and meta-data or main-data and auxiliary-data
size. Though I don't know if usage exists where wish to get size of
just some non MAIN fork for heap/zheap, those pieces of code shouldn't
be in generic areas instead in AM specific code only.

>
> > 2) commands/analyze.c, computing pg_class.relpages
> >
> >This should imo be moved to the tableam callback. It's currently done
> >a bit weirdly imo, with fdws computing relpages the callback, but
> >then also returning the acquirefunc. Seems like it should entirely be
> >computed as part of calling acquirefunc.
>
> Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> tableam wouldn't be the right minimal approach too. It has the
> disadvantage of implying certain values for the
> RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
> would be to return the desire sampling range in
> table_beginscan_analyze() - but that'd require some duplication because
> currently that just uses the generic scan_begin() callback.

Yes, just routing relation size via AM layer and using its returned
value in terms of blocks still and performing sampling based on blocks
based on it, doesn't feel resolves the issue. Maybe need to delegate
sampling completely to AM layer. Code duplication can be avoided by
similar AMs (heap and zheap) possible using some common utility
functions to achieve intended result.

>
> I suspect - as previously mentioned- that we're going to have to extend
> statistics collection beyond the current approach at some point, but I
> don't think that's now. At least to me it's not clear how to best
> represent the stats, and how to best use them, if the underlying storage
> is fundamentally not block best.  Nor how we'd avoid code duplication...

Yes, will have to give more thoughts into this.

>
> > 3) nodeTidscan, skipping over too large tids
> >I think this should just be moved into the AMs, there's no need to
> >have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
> /*
>  * We silently discard any TIDs that are out of range at the time of 
> scan
>  * start.  (Since we hold at least AccessShareLock on the table, it 
> won't
>  * be possible for someone to truncate away the blocks we intend to
>  * visit.)
>  */
> nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.

Agree, its not nice to have that optimization being performed based on
number of block in generic layer. I feel its not efficient either for
zheap too due to TPD pages as mentioned above, as number of blocks
returned will be higher compared to actually data blocks.

>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> 

Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-29 Thread Ashwin Agrawal
On Thu, Apr 25, 2019 at 11:53 PM Michael Paquier  wrote:
>
> On Thu, Apr 25, 2019 at 12:45:03PM -0400, Tom Lane wrote:
> > I still remain concerned that invoking catalog lookups from fd.c is a darn
> > bad idea, even if we have a fallback for it to work (for some value of
> > "work") in non-transactional states.  It's not really hard to envision
> > that kind of thing leading to infinite recursion.  I think it's safe
> > right now, because catalog fetches shouldn't lead to any temp-file
> > access, but that's sort of a rickety assumption isn't it?
>
> Introducing catalog lookups into fd.c which is not a layer designed
> for that is a choice that I find strange, and I fear that it may bite
> in the future.  I think that the choice proposed upthread to add
> an assertion on TempTablespacesAreSet() when calling a function
> working on temporary data is just but fine, and that we should just
> make sure that the gist code calls PrepareTempTablespaces()
> correctly.  So [1] is a proposal I find much more acceptable than the
> other one.

Well the one thing I wish to point out explicitly is just taking fd.c
changes from [1], and running make check hits no assertions and
doesn't flag issue exist for gistbuildbuffers.c. Means its missing
coverage and in future same can happen as well.

[1]: https://postgr.es/m/11777.1556133...@sss.pgh.pa.us




Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Ashwin Agrawal
On Mon, Apr 29, 2019 at 10:36 AM Tom Lane  wrote:
>
> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
> to see if it fixes the problem for you?

Yes, will give it a try on greenplum and report back the result.

Have we decided if this will be applied to back branches?




Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Ashwin Agrawal
On Sat, Apr 27, 2019 at 5:57 PM Tom Lane  wrote:
>
> I have spent a fair amount of time trying to replicate these failures
> locally, with little success.  I now think that the most promising theory
> is Munro's idea in [1] that the walreceiver is hanging up during its
> unsafe attempt to do ereport(FATAL) from inside a signal handler.  It's
> extremely plausible that that could result in a deadlock inside libc's
> malloc/free, or some similar place.  Moreover, if that's what's causing
> it, then the windows for trouble are fixed by the length of time that
> malloc might hold internal locks, which fits with the results I've gotten
> that inserting delays in various promising-looking places doesn't do a
> thing towards making this reproducible.

For Greenplum (based on 9.4 but current master code looks the same) we
did see deadlocks recently hit in CI many times for walreceiver which
I believe confirms above finding.

#0  __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x7f0637ee72bd in _int_free (av=0x7f063822bb20 ,
p=0x26bb3b0, have_lock=0) at malloc.c:3962
#2  0x7f0637eeb53c in __GI___libc_free (mem=) at
malloc.c:2968
#3  0x7f0636629464 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#4  0x7f0636630720 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#5  0x7f063b5cede7 in _dl_fini () at dl-fini.c:235
#6  0x7f0637ea0ff8 in __run_exit_handlers (status=1,
listp=0x7f063822b5f8 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#7  0x7f0637ea1045 in __GI_exit (status=) at exit.c:104
#8  0x008c72c7 in proc_exit ()
#9  0x00a75867 in errfinish ()
#10 0x0089ea53 in ProcessWalRcvInterrupts ()
#11 0x0089eac5 in WalRcvShutdownHandler ()
#12 
#13 _int_malloc (av=av@entry=0x7f063822bb20 ,
bytes=bytes@entry=16384) at malloc.c:3802
#14 0x7f0637eeb184 in __GI___libc_malloc (bytes=16384) at malloc.c:2913
#15 0x007754c3 in makeEmptyPGconn ()
#16 0x00779686 in PQconnectStart ()
#17 0x00779b8b in PQconnectdb ()
#18 0x008aae52 in libpqrcv_connect ()
#19 0x0089f735 in WalReceiverMain ()
#20 0x005c5eab in AuxiliaryProcessMain ()
#21 0x004cd5f1 in ServerLoop ()
#22 0x0086fb18 in PostmasterMain ()
#23 0x004d2e28 in main ()

ImmediateInterruptOK was removed from regular backends but not for
walreceiver and walreceiver performing elog(FATAL) inside signal
handler is dangerous.




Re: What is an item pointer, anyway?

2019-04-26 Thread Ashwin Agrawal
On Fri, Apr 26, 2019 at 2:19 PM Peter Geoghegan  wrote:

> itemid.h introduces the struct ItemIdData as follows:
>
> /*
>  * An item pointer (also called line pointer) on a buffer page
>
> Meanwhile, itemptr.h introduces the struct ItemPointerData as follows:
>
> /*
>  * ItemPointer:
>  *
>  * This is a pointer to an item within a disk page of a known file
>  * (for example, a cross-link from an index to its parent table).
>
> It doesn't seem reasonable to assume that you should know the
> difference based on context. The two concepts are closely related. An
> ItemPointerData points to a block, as well as the, uh, item pointer
> within that block.
>
> This ambiguity is avoidable, and should be avoided.


Agree.


> ISTM that the
> least confusing way of removing the ambiguity would be to no longer
> refer to ItemIds as item pointers, without changing anything else.
>

How about we rename ItemPointerData to TupleIdentifier or ItemIdentifier
instead and leave ItemPointer or Item confined to AM term, where item can
be tuple, datum or anything else ?


Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-25 Thread Ashwin Agrawal
On Thu, Apr 25, 2019 at 9:45 AM Tom Lane  wrote:

> However, by that argument we should change all 3 of these functions to
> set up the data.  If we're eating the layering violation to the extent
> of letting OpenTemporaryFile call into commands/tablespace, then there's
> little reason for the other 2 not to do likewise.
>

I agree to that point, same logic should be used for all three calls
irrespective of the approach we pick.

I still remain concerned that invoking catalog lookups from fd.c is a darn
> bad idea, even if we have a fallback for it to work (for some value of
> "work") in non-transactional states.  It's not really hard to envision
> that kind of thing leading to infinite recursion.  I think it's safe
> right now, because catalog fetches shouldn't lead to any temp-file
> access, but that's sort of a rickety assumption isn't it?
>

Is there (easy) way to assert for that assumption? If yes, then can add the
same and make it not rickety.

Though I agree any exceptions/violations coded generally bites in long run
somewhere later.


Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-25 Thread Ashwin Agrawal
On Wed, Apr 24, 2019 at 5:48 PM Peter Geoghegan  wrote:

> On Wed, Apr 24, 2019 at 12:17 PM Tom Lane  wrote:
> > After a bit more thought it seemed like another answer would be to
> > make all three of those functions assert that the caller did the
> > right thing, as per attached.  This addresses the layering-violation
> > complaint, but might be more of a pain in the rear for developers.
>
> In what sense is it not already a layering violation to call
> PrepareTempTablespaces() as often as we do? PrepareTempTablespaces()
> parses and validates the GUC variable and passes it to fd.c, but to me
> that seems almost the same as calling the fd.c function
> SetTempTablespaces() directly. PrepareTempTablespaces() allocates
> memory that it won't free itself within TopTransactionContext. I'm not
> seeing why the context that the PrepareTempTablespaces() catalog
> access occurs in actually matters.
>
> Like you, I find it hard to prefer one of the approaches over the
> other, though I don't really know how to assess this layering
> business. I'm glad that either approach will prevent oversights,
> though.
>

Just to provide my opinion, since we are at intersection and can go
either way on this. Second approach (just adding assert) only helps
if the code path for ALL future callers gets excersied and test exist for
the
same, to expose potential breakage. But with first approach fixes the issue
for current and future users, plus excersicing the same just with a single
test
already tests it for future callers as well. So, that way first approach
sounds
more promising if we are fetch between the two.


Re: Zedstore - compressed in-core columnar storage

2019-04-24 Thread Ashwin Agrawal
On Tue, Apr 16, 2019 at 9:15 AM Tomas Vondra 
wrote:

>
> I'm not sure it's that clear cut, actually. Sure, it's not the usual
> (block,item) pair so it's not possible to jump to the exact location, so
> it's not the raw physical identifier as regular TID. But the data are
> organized in a btree, with the TID as a key, so it does actually provide
> some information about the location.
>

>From representation perspective its logical identifier. But yes
since
is used as used as key to layout datum's, there exists pretty
good
correlation between TIDs and physical location. Can consider it
as
clustered based on TID.

I've asked about BRIN indexes elsewhere in this thread, which I think is
> related to this question, because that index type relies on TID providing
> sufficient information about location. And I think BRIN indexes are going
> to be rather important for colstores (and formats like ORC have something
> very similar built-in).
>
> But maybe all we'll have to do is define the ranges differently - instead
> of "number of pages" we may define them as "number of rows" and it might
> be working.
>

BRIN indexes work for zedstore right now. A block range maps
to
just a range of TIDs in zedstore, as pointed out above. When one converts
a
zstid to an ItemPointer, can get the "block number" from
the

ItemPointer, like from a normal heap TID. It doesn't mean the
direct
physical location of the row in zedstore, but that's
fine.



It might be sub-optimal in some cases. For example if one
zedstore

page contains TIDs 1-1000, and another 1000-2000, and the entry in
the
BRIN index covers TIDs 500-1500, have to access both
zedstore

pages. Would be better if the cutoff points in the BRIN index
would
match the physical pages of the zedstore. But it still works, and
is
probably fine in
practice.



Plan is to add integrated BRIN index in zedstore, means keep
min-max
values for appropriate columns within page. This will not help
to
eliminate the IO as external BRIN index does but helps to
skip

uncompression and visibility checks etc... for blocks not matching
the
conditions.



Just to showcase brin works for zedstore, played with hands-on example
mentioned in
[1].



With btree index on zedstore

QUERY
PLAN

-

 Aggregate  (cost=4351.50..4351.51 rows=1 width=32) (actual
time=1267.140..1267.140 rows=1
loops=1)

   ->  Index Scan using idx_ztemperature_log_log_timestamp on
ztemperature_log  (cost=0.56..4122.28 rows=91686 width=4) (actual
time=0.117..1244.112 rows=86400
loops=1)

 Index Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp
without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp
without time
zone))

 Planning Time: 0.240
ms

 Execution Time: 1269.016
ms

(5
rows)



With brin index on zedstore.
Note: Bitmap index for zedstore currently scans all the columns.
Scanning only required columns for query is yet to be implemented.





QUERY
PLAN



 Finalize Aggregate  (cost=217538.85..217538.86 rows=1 width=32) (actual
time=54.167..54.167 rows=1
loops=1)

   ->  Gather  (cost=217538.63..217538.84 rows=2 width=32) (actual
time=53.967..55.184 rows=3
loops=1)

 Workers Planned:
2

 Workers Launched:
2

 ->  Partial Aggregate  (cost=216538.63..216538.64 rows=1 width=32)
(actual time=42.956..42.957 rows=1
loops=3)

   ->  Parallel Bitmap Heap Scan on ztemperature_log
(cost=59.19..216446.98 rows=36660 width=4) (actual time=3.571..35.904
rows=28800
loops=3)

 Recheck Cond: ((log_timestamp >= '2016-04-04
00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05
00:00:00'::timestamp without time
zone))

 Rows Removed by Index Recheck:
3968

 Heap Blocks:
lossy=381

 ->  Bitmap Index Scan on
idx_ztemperature_log_log_timestamp  (cost=0.00..37.19 rows=98270 width=0)
(actual time=1.201..1.201 rows=7680
loops=1)

   Index Cond: ((log_timestamp >= '2016-04-04
00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05
00:00:00'::timestamp without time
zone))

 Planning Time: 0.240
ms

 Execution Time: 55.341
ms

(13
rows)



 schema_name | index_name | index_ratio |
index_size |
table_size

-++-++

 public  | idx_ztemperature_log_log_timestamp |   0 | 80
kB  | 1235
MB

(1
row)


1]
https://www.postgresql.fastware.com/blog/brin-indexes-what-are-they-and-how-do-you-use-them


Re: Regression test PANICs with master-standby setup on same machine

2019-04-24 Thread Ashwin Agrawal
On Wed, Apr 24, 2019 at 9:25 AM Andres Freund  wrote:

>  The inability
> to reasonably test master/standby setups on a single machine is pretty
> jarring (yes, one can use basebackup tablespace maps - but that doesn't
> work well for new tablespaces).


+1 agree. Feature which can't be easily tested becomes hurdle for
development and this is one of them. For reference the bug reported in [1]
is hard to test and fix without having easy ability to setup master/standby
on same node. We discussed few ways to eliminate the issue in thread [2]
but I wasn't able to find a workable solution. It would be really helpful
to lift this testing limitation.

1]
https://www.postgresql.org/message-id/flat/20190423.163949.36763221.horiguchi.kyotaro%40lab.ntt.co.jp#7fdeee86f3050df6315c04f5f6f93672
2]
https://www.postgresql.org/message-id/flat/CALfoeivGMTmCmSXRSWDf%3DujWS7L8QmoUoziv-A61f2R8DcmwiA%40mail.gmail.com#709b53c078ebe549cff2462c092a8f09


Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal
On Mon, Apr 15, 2019 at 12:50 PM Peter Geoghegan  wrote:

> On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal 
> wrote:
> > Would like to know more specifics on this Peter. We may be having
> different context on hybrid row/column design.
>
> I'm confused about how close your idea of a TID is to the traditional
> definition from heapam (and even zheap). If it's a purely logical
> identifier, then why would it have two components like a TID? Is that
> just a short-term convenience or something?
>

TID is purely logical identifier. Hence, stated in initial email that for
Zedstore TID, block number and offset split carries no meaning at all. It's
purely 48 bit integer entity assigned to datum of first column during
insertion, based on where in BTree it gets inserted. Rest of the column
datums are inserted using this assigned TID value. Just due to rest to
system restrictions discussed by Heikki and Andres on table am thread poses
limitations of value it can carry currently otherwise from zedstore design
perspective it just integer number.



> > Yes, the plan to optimize out TID space per datum, either by prefix
> compression or delta compression or some other trick.
>
> It would be easier to do this if you knew for sure that the TID
> behaves almost the same as a bigserial column -- a purely logical
> monotonically increasing identifier. That's why I'm interested in what
> exactly you mean by TID, the stability of a TID value, etc. If a leaf
> page almost always stores a range covering no more than few hundred
> contiguous logical values,  you can justify aggressively compressing
> the representation in the B-Tree entries. Compression would still be
> based on prefix compression, but the representation itself can be
> specialized.
>

Yes, it's for sure logical increasing number. With only inserts the number
is monotonically increasing. With deletes and updates, insert could use the
previously free'd TID values. Since TID is logical datums can be easily
moved around to split or merge pages as required.


Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal
On Mon, Apr 15, 2019 at 11:18 AM Tomas Vondra 
wrote:

>
> Maybe. I'm not going to pretend I fully understand the internals. Does
> that mean the container contains ZSUncompressedBtreeItem as elements? Or
> just the plain Datum values?
>

First, your reading of code and all the comments/questions so far have been
highly encouraging. Thanks a lot for the same.

Container contains ZSUncompressedBtreeItem as elements. As for Item will
have to store meta-data like size, undo and such info. We don't wish to
restrict compressing only items from same insertion sessions only. Hence,
yes doens't just store Datum values. Wish to consider it more tuple level
operations and have meta-data for it and able to work with tuple level
granularity than block level.

Definitely many more tricks can be and need to be applied to optimize
storage format, like for fixed width columns no need to store the size in
every item. Keep it simple is theme have been trying to maintain.
Compression ideally should compress duplicate data pretty easily and
efficiently as well, but we will try to optimize as much we can without the
same.


Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal
On Mon, Apr 15, 2019 at 10:33 AM Tomas Vondra 
wrote:

> On Mon, Apr 15, 2019 at 09:29:37AM -0700, Ashwin Agrawal wrote:
> >   On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra
> >wrote:
> >
> > On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
> > >On 11/04/2019 17:54, Tom Lane wrote:
> > >>Ashwin Agrawal  writes:
> > >>>Thank you for trying it out. Yes, noticed for certain patterns
> > pg_lzcompress() actually requires much larger output buffers. Like
> for
> > one 86 len source it required 2296 len output buffer. Current
> zedstore
> > code doesn’t handle this case and errors out. LZ4 for same patterns
> > works fine, would highly recommend using LZ4 only, as anyways speed
> is
> > very fast as well with it.
> > >>
> > >>You realize of course that *every* compression method has some
> inputs
> > that
> > >>it makes bigger.  If your code assumes that compression always
> > produces a
> > >>smaller string, that's a bug in your code, not the compression
> > algorithm.
> > >
> > >Of course. The code is not making that assumption, although clearly
> > >there is a bug there somewhere because it throws that error. It's
> > >early days..
> > >
> > >In practice it's easy to weasel out of that, by storing the data
> > >uncompressed, if compression would make it longer. Then you need an
> > >extra flag somewhere to indicate whether it's compressed or not. It
> > >doesn't break the theoretical limit because the actual stored length
> > >is then original length + 1 bit, but it's usually not hard to find a
> > >place for one extra bit.
> > >
> >
> > Don't we already have that flag, though? I see ZSCompressedBtreeItem
> has
> > t_flags, and there's ZSBT_COMPRESSED, but maybe it's more
> complicated.
> >
> >   The flag ZSBT_COMPRESSED differentiates between container (compressed)
> >   item and plain (uncompressed item). Current code is writtten such that
> >   within container (compressed) item, all the data is compressed. If need
> >   exists to store some part of uncompressed data inside container item,
> then
> >   this additional flag would be required to indicate the same. Hence its
> >   different than ZSBT_COMPRESSED. I am thinking one of the ways could be
> to
> >   just not store this datum in container item if can't be compressed and
> >   just store it as plain item with uncompressed data, this additional
> flag
> >   won't be required. Will know more once write code for this.
>
> I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it
> means "this is a container". And then have another flag to track whether
> the container is compressed or not. But as I suggested elsewhere in this
> thread, I think it might be better to store some ID of the compression
> algorithm used instead of a simple flag.
>
> FWIW when I had to deal with incremental compression (adding data into
> already compressed buffers), which is what seems to be happening here, I
> found it very useful/efficient to allow partially compressed buffers and
> only trigger recompressin when absolutely needed.
>
> Applied to this case, the container would first store compressed chunk,
> followed by raw (uncompressed) data. Say, like this:
>
> ZSContainerData {
>
> // header etc.
>
> int nbytes; /* total bytes in data */
> int ncompressed;/* ncompressed <= nbytes, fully compressed when
>  * (ncompressed == nbytes) */
>
> char data[FLEXIBLE_ARRAY_MEMBER];
> }
>
> When adding a value to the buffer, it'd be simply appended to the data
> array. When the container would grow too much (can't fit on the page or
> something), recompression is triggered.
>

I think what you suggested here is exactly how its handled currently, just
the mechanics are little different. Plain items are added to page as
insertions are performed. Then when page becomes full, compression is
triggerred container item is created for them to store the compressed data.
Then new insertions are stored as plain items, once again when page becomes
full, they are compressed and container item created for it. So, never,
compressed data is attempted to be compressed again. So, on page plain
items are acting as data section you mentioned above. A page can have mix
of n plain and n container items.


Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal
On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra 
wrote:

> On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
> >On 11/04/2019 17:54, Tom Lane wrote:
> >>Ashwin Agrawal  writes:
> >>>Thank you for trying it out. Yes, noticed for certain patterns
> pg_lzcompress() actually requires much larger output buffers. Like for one
> 86 len source it required 2296 len output buffer. Current zedstore code
> doesn’t handle this case and errors out. LZ4 for same patterns works fine,
> would highly recommend using LZ4 only, as anyways speed is very fast as
> well with it.
> >>
> >>You realize of course that *every* compression method has some inputs
> that
> >>it makes bigger.  If your code assumes that compression always produces a
> >>smaller string, that's a bug in your code, not the compression algorithm.
> >
> >Of course. The code is not making that assumption, although clearly
> >there is a bug there somewhere because it throws that error. It's
> >early days..
> >
> >In practice it's easy to weasel out of that, by storing the data
> >uncompressed, if compression would make it longer. Then you need an
> >extra flag somewhere to indicate whether it's compressed or not. It
> >doesn't break the theoretical limit because the actual stored length
> >is then original length + 1 bit, but it's usually not hard to find a
> >place for one extra bit.
> >
>
> Don't we already have that flag, though? I see ZSCompressedBtreeItem has
> t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.
>

The flag ZSBT_COMPRESSED differentiates between container (compressed) item
and plain (uncompressed item). Current code is writtten such that within
container (compressed) item, all the data is compressed. If need exists to
store some part of uncompressed data inside container item, then this
additional flag would be required to indicate the same. Hence its different
than ZSBT_COMPRESSED. I am thinking one of the ways could be to just not
store this datum in container item if can't be compressed and just store it
as plain item with uncompressed data, this additional flag won't be
required. Will know more once write code for this.


Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal
On Sat, Apr 13, 2019 at 4:22 PM Peter Geoghegan  wrote:

> On Thu, Apr 11, 2019 at 6:06 AM Rafia Sabih 
> wrote:
> > Reading about it reminds me of this work -- TAG column storage(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www09.sigmod.org_sigmod_record_issues_0703_03.article-2Dgraefe.pdf=DwIBaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM=H2hOVqCm9svWVOW1xh7FhoURKEP-WWpWso6lKD1fLoM=KNOse_VUg9-BW7SyDXt1vw92n6x_B92N9SJHZKrdoIo=
> ).
> > Isn't this storage system inspired from there, with TID as the TAG?
> >
> > It is not referenced here so made me wonder.
>
> I don't think they're particularly similar, because that paper
> describes an architecture based on using purely logical row
> identifiers, which is not what a TID is. TID is a hybrid
> physical/logical identifier, sometimes called a "physiological"
> identifier, which will have significant overhead.


Storage system wasn't inspired by that paper, but yes seems it also talks
about laying out column data in btrees, which is good to see. But yes as
pointed out by Peter, the main aspect the paper is focusing on to save
space for TAG, isn't something zedstore plan's to leverage, it being more
restrictive. As discussed below we can use other alternatives to save space.


> Ashwin said that
> ZedStore TIDs are logical identifiers, but I don't see how that's
> compatible with a hybrid row/column design (unless you map heap TID to
> logical row identifier using a separate B-Tree).
>

Would like to know more specifics on this Peter. We may be having different
context on hybrid row/column design. When we referenced design supports
hybrid row/column families, it meant not within same table. So, not inside
a table one can have some data in row and some in column nature. For a
table, the structure will be homogenous. But it can easily support storing
all the columns together, or subset of columns together or single column
all connected together by TID.


> The big idea with Graefe's TAG design is that there is practically no
> storage overhead for these logical identifiers, because each entry's
> identifier is calculated by adding its slot number to the page's
> tag/low key. The ZedStore design, in contrast, explicitly stores TID
> for every entry. ZedStore seems more flexible for that reason, but at
> the same time the per-datum overhead seems very high to me. Maybe
> prefix compression could help here, which a low key and high key can
> do rather well.
>

Yes, the plan to optimize out TID space per datum, either by prefix
compression or delta compression or some other trick.


Re: finding changed blocks using WAL scanning

2019-04-11 Thread Ashwin Agrawal
On Thu, Apr 11, 2019 at 6:27 AM Robert Haas  wrote:

> On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
>  wrote:
> > I had in mind that you could have different overlapping incremental
> > backup jobs in existence at the same time.  Maybe a daily one to a
> > nearby disk and a weekly one to a faraway cloud.  Each one of these
> > would need a separate replication slot, so that the information that is
> > required for *that* incremental backup series is preserved between runs.
> >  So just one reserved replication slot that feeds the block summaries
> > wouldn't work.  Perhaps what would work is a flag on the replication
> > slot itself "keep block summaries for this slot".  Then when all the
> > slots with the block summary flag are past an LSN, you can clean up the
> > summaries before that LSN.
>
> I don't think that quite works.  There are two different LSNs.  One is
> the LSN of the oldest WAL archive that we need to keep around so that
> it can be summarized, and the other is the LSN of the oldest summary
> we need to keep around so it can be used for incremental backup
> purposes.  You can't keep both of those LSNs in the same slot.
> Furthermore, the LSN stored in the slot is defined as the amount of
> WAL we need to keep, not the amount of something else (summaries) that
> we need to keep.  Reusing that same field to mean something different
> sounds inadvisable.
>
> In other words, I think there are two problems which we need to
> clearly separate: one is retaining WAL so we can generate summaries,
> and the other is retaining summaries so we can generate incremental
> backups.  Even if we solve the second problem by using some kind of
> replication slot, we still need to solve the first problem somehow.
>

Just a thought for first problem, may not to simpler, can replication slot
be enhanced to define X amount of WAL to retain, after reaching such limit
collect summary and let the WAL be deleted.


Re: finding changed blocks using WAL scanning

2019-04-11 Thread Ashwin Agrawal
On Wed, Apr 10, 2019 at 2:50 PM Robert Haas  wrote:

> Over at
> https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com
> I mentioned parsing the WAL to extract block references so that
> incremental backup could efficiently determine which blocks needed to
> be copied.  Ashwin replied in
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__postgr.es_m_CALfoeitO-2DvkfjubMFQRmgyXghL-2DuUnZLNxbr-3DobrQQsm8kFO4A-40mail.gmail.com=DwIBaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM=W07oy16p6VEfYKCgfRXQpRz9pfy_of-a_8DAjAg5TGk=YAtoa9HWqQ1PPjt1CGui1Fo_a20j0n95LRonCXucBz4=
> to mention that the same approach could be useful for pg_upgrade:
>

Thank you for initiating this separate thread. Just typo above not
pg_upgrade but pg_rewind.

Let me explain first the thought I have around how to leverage this for
pg_rewind, actually any type of incremental recovery to be exact. Would
love to hear thoughts on it.

Currently, incremental recovery of any form, if replica goes down and comes
up or trying to bring back primary after failover to replica, requires
*all* the WAL to be present from point of disconnect. So, its boolean in
those terms, if WAL available can incrementally recovery otherwise have to
perform full basebackup. If we come up with this mechanism to find and
store changed blocks from WAL, we can provide intermediate level of
incremental recovery which will be better than full recovery.

WAL allows tuple level granularity for recovery (if we ignore FPI for a
moment). Modified blocks from WAL, if WAL is not available will provide
block level incremental recovery.

So, pg_basebackup (or some other tool or just option to it) and pg_rewind
can leverage the changed blocks if WAL can't be retained due to space
constraints and perform the recovery.

pg_rewind can also be optimized as it currently copies blocks from src to
target which were present in target WAL to rewind. So, such blocks can be
easily skipped from copying again.

Depending on pattern of changes in WAL and size, instead of replaying all
the WAL logs for incremental recovery, just copying over the changed blocks
could prove more efficient.

It seems to me that there are basically two ways of storing this kind
> of information, plus a bunch of variants.  One way is to store files
> that cover a range of LSNs, and basically contain a synopsis of the
> WAL for those LSNs.  You omit all the actual data and just mention
> which blocks were changed by some record in that part of the WAL.  In
> this type of scheme, the storage required is roughly proportional to
> the volume of WAL for which you wish to retain data.  Pruning old data
> is easy; just remove the files that provide information about LSNs
> that you don't care about any more.  The other way is to store data
> about each block, or each range of blocks, or all the blocks that hash
> onto a certain slot; for each, store the newest LSN that has modified
> that block, or a block in that range, or a block that hashes onto that
> that slot.  In this system, storage is roughly proportional to the
> size of the database cluster, except maybe in the hashing case, but I
> *think* that case will degrade unless you basically expand the map to
> be roughly proportional to the size of the cluster anyway.  I may be
> wrong.
>
> Of these two variants, I am inclined to prefer the version where each
> file is a summary of the block references within some range of LSNs.
> It seems simpler to implement to me.  You just read a bunch of WAL
> files and then when you get tired you stop and emit an output file.
> You need to protect yourself against untimely crashes.  One way is to
> stick a checksum into the output file.  After you finish writing it,
> fsync() it before you start writing the next one.  After a restart,
> read the latest such file and see if the checksum is OK.  If not,
> regenerate it; if not, assume it's good and move on.  Files other than
> the last one can be assumed good.  Another way is to create the file
> with a temporary name, fsync() it, and then rename it into place and
> fsync() again.  The background worker that generates the files can
> have a GUC to remove them when they are older than some threshold
> amount of time, or you can keep them forever and let the user manually
> remove stuff they no longer want based on LSN.  That's pretty much it.
>

+1 for first option. Seems simpler and straight-forward.


  1   2   >