from:"Kuntal Ghosh"

Re: Allow logical replication to copy tables in binary format

2023-03-02 Thread Kuntal Ghosh

On Thu, Mar 2, 2023 at 10:30 AM Amit Kapila  wrote:
>
> > TBH I am not sure anymore if the complications justify the patch.
> >
> > It seems we have to choose from 2 bad choices:
> > - separate options = this works but would be more confusing for the user
> > - unified option = this would be simpler and faster, but risks
> > breaking existing applications currently using 'binary=true'
> >
>
> I would prefer a unified option as apart from other things you and
> others mentioned that will be less of a maintenance burden in the
> future.
>
+1
When someone sets the binary=true while creating a subscription, the
expectation would be that the data transfer will happen in binary mode
if binary in/out functions are available. As per current
implementation, that's not happening in the table-sync phase. So, it
makes sense to fix that behaviour in a major version release.
For the existing applications that are using (or unknowingly misusing)
the feature, as Amit mentioned, they have a workaround.


-- 
Thanks & Regards,
Kuntal Ghosh

Re: Avoid multiple SetLatch() calls in procsignal_sigusr1_handler()

2023-03-01 Thread Kuntal Ghosh

On Tue, Feb 28, 2023 at 9:01 PM Bharath Rupireddy
 wrote:
>
> Most of the multiplexed SIGUSR1 handlers are setting latch explicitly
> when the procsignal_sigusr1_handler() can do that for them at the end.
> These multiplexed handlers are currently being used as SIGUSR1
> handlers, not as independent handlers, so no problem if SetLatch() is
> removed from them. A few others do it right by saying /* latch will be
> set by procsignal_sigusr1_handler */. Although, calling SetLatch() in
> quick succession does no harm (it just returns if the latch was
> previously set), it seems unnecessary.
>
+1



-- 
Thanks & Regards,
Kuntal Ghosh

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-28 Thread Kuntal Ghosh

On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy
 wrote:
>
+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
s/fetched WAL buffers/fetched from WAL buffers


+ else if (nread < nbytes)
+ {
+ /*
+ * We read some of the requested bytes. Continue to read remaining
+ * bytes.
+ */
+ ptr += nread;
+ nbytes -= nread;
+ dst += nread;
+ *read_bytes += nread;
+ }

The 'if' condition should always be true. You can replace the same
with an assertion instead.
s/Continue to read remaining/Continue to read the remaining

The good thing about this patch is that it reduces read IO calls
without impacting the write performance (at least not that
noticeable). It also takes us one step forward towards the
enhancements mentioned in the thread. If performance is a concern, we
can introduce a GUC to enable/disable this feature.

-- 
Thanks & Regards,
Kuntal Ghosh

Re: Invalid memory access in pg_stat_get_subscription

2022-06-08 Thread Kuntal Ghosh

Hello Tom,

On Wed, Jun 8, 2022 at 12:44 AM Tom Lane  wrote:
>
> Kuntal Ghosh  writes:
> > While exploring some code in logical replication worker
> > implementation, I noticed that we're accessing an invalid memory while
> > traversing LogicalRepCtx->workers[i].
> > For the above structure, we're allocating
> > max_logical_replication_workers times LogicalRepWorker amount of
> > memory in ApplyLauncherShmemSize. But, in the for loop, we're
> > accessing the max_logical_replication_workers + 1 location which is
> > resulting in random crashes.
>
> I concur that that's a bug, but eyeing the code, it seems like an
> actual crash would be improbable.  Have you seen one?  Can you
> reproduce it?
Thank you for looking into it. Unfortunately, I'm not able to
reproduce the crash, but I've seen one crash while executing the
function. The crash occurred at the following line:
> if (!worker.proc || !IsBackendPid(worker.proc->pid))
(gdb) p worker.proc
$6 = (PGPROC *) 0x2bf0b9
The PGPROC structure was pointing to an invalid memory location.

-- 
Thanks & Regards,
Kuntal Ghosh

Invalid memory access in pg_stat_get_subscription

2022-06-07 Thread Kuntal Ghosh

Hello hackers,

While exploring some code in logical replication worker
implementation, I noticed that we're accessing an invalid memory while
traversing LogicalRepCtx->workers[i].
For the above structure, we're allocating
max_logical_replication_workers times LogicalRepWorker amount of
memory in ApplyLauncherShmemSize. But, in the for loop, we're
accessing the max_logical_replication_workers + 1 location which is
resulting in random crashes.

Please find the patch that fixes the issue. I'm not sure whether we
should add a regression test for the same.

-- 
Thanks & Regards,
Kuntal Ghosh


0001-Fix-terminating-condition-while-fetching-the-replica.patch
Description: Binary data

Re: Add connection active, idle time to pg_stat_activity

2021-11-29 Thread Kuntal Ghosh

On Fri, Oct 22, 2021 at 1:53 PM Rafia Sabih  wrote:
> To provide this information I was digging into how the statistics
> collector is working and found out there is already information like
> total time that a connection is active as well as idle computed in
> pgstat_report_activity[1]. Ideally, this would be the values we would
> like to see per process in pg_stat_activity.
It's definitely useful to know how much time a backend has spent for
query executions. Once you've this info, you can easily calculate the
idle time using this information: (now() - backend_start) -
active_time. But, I'm wondering why you need to distinguish between
idle and idle in transactions - what's the usage? Either the backend
is doing some work or it sits idle. Another useful information would
be when the last query execution was ended. From this information, you
can figure out whether a backend is idle for a long time since the
last execution and the execution time of the last query (query_end -
query_start).

You also need to update the documentation.

-- 
Thanks & Regards,
Kuntal Ghosh

Re: [BUG]Update Toast data failure in logical replication

2021-06-02 Thread Kuntal Ghosh

On Wed, Jun 2, 2021 at 3:10 PM Dilip Kumar  wrote:
>
> Yes, you are right.  I will change it in the next version, along with
> the test case.
>
+/*
+ * if the key hasn't changed and we're only logging the key, we're done.
+ * But if tuple has external data then we might have to detoast the key.
+ */
This doesn't really mention why we need to detoast the key even when
the key remains the same. I guess we can add some more details here.

-- 
Thanks & Regards,
Kuntal Ghosh

Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Kuntal Ghosh

On Thu, Feb 18, 2021 at 9:32 PM Jan Wieck  wrote:
>
And, here is how it looks with the following configuration:
telnet_srv.port = 1433
telnet_srv.listen_addresses = '*'

telnet localhost 1433


   master
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
PostgreSQL Telnet Interface

database name: postgres
username: kuntal
password: changeme
> select 1;
?column?

1

SELECT 1
> select 1/0;
Message: ERROR - division by zero

Few comments in the extension code (although experimental):

1. In telnet_srv.c,
+ static intpe_port;
..
+   DefineCustomIntVariable("telnet_srv.port",
+   "Telnet server port.",
+   NULL,
+   &pe_port,
+   pe_port,
+   1024,
+   65536,
+   PGC_POSTMASTER,
+   0,
+   NULL,
+   NULL,
+   NULL);

The variable pe_port should be initialized to a value which is > 1024
and < 65536. Otherwise, the following assert will fail,
TRAP: FailedAssertion("newval >= conf->min", File: "guc.c", Line:
5541, PID: 12100)

2. The function pq_putbytes shouldn't be used by anyone other than
old-style COPY out.
+   pq_putbytes(msg, strlen(msg));
Otherwise, the following assert will fail in the same function:
/* Should only be called by old-style COPY OUT */
Assert(DoingCopyOut);

-- 
Thanks & Regards,
Kuntal Ghosh
Amazon Web Services

Re: Incorrect assumption in heap_prepare_freeze_tuple

2020-10-04 Thread Kuntal Ghosh

On Sun, Oct 4, 2020 at 12:33 AM Andres Freund  wrote:
>
> To get to this point heap_page_prune() has to have been called for the
> page. That removes all tuple [versions] that are DEAD. But not
> RECENTLY_DEAD. But RECENTLY_DEAD can only happen for tuples that are
> newere than OldestXmin. Thus the only tuples that the HTSV() we're
> talking about can return DEAD for are ones that were RECENTLY_DEAD
> in heap_page_prune().
>
Got it. Thank you for the explanations. :-)



-- 
Thanks & Regards,
Kuntal Ghosh

Re: Incorrect assumption in heap_prepare_freeze_tuple

2020-10-03 Thread Kuntal Ghosh

On Sat, Oct 3, 2020 at 1:06 PM Andres Freund  wrote:
> On 2020-10-03 12:58:01 +0530, Kuntal Ghosh wrote:
> > IIUC, there can be a HOT-updated tuple which is not initially pruned
> > by heap_page_prune but later diagnosed HEAPTUPLE_DEAD by
> > HeapTupleSatisfiesVacuum (Since OldestXmin can be updated by the time
> > we call HeapTupleSatisfiesVacuum and xmax becomes older than
> > OldestXmin).
>
> Hm? OldestXmin is constant over the course of vacuum, no?
>
Yeah, it's constant. And, my understanding was completely wrong.

Actually, I misunderstood the following commit message:

commit dd69597988859c51131e0cbff3e30432db4259e1
Date:   Thu May 2 10:07:13 2019 -0400

Fix some problems with VACUUM (INDEX_CLEANUP FALSE)

...
Change the logic for the case where a tuple is not initially pruned
by heap_page_prune but later diagnosed HEAPTUPLE_DEAD by
HeapTupleSatisfiesVacuum.
.
I thought the reason is OldestXmin will be updated in between, but
that's just a stupid assumption I've made. :-(

I still need to understand the scenario that the above commit is
referring to. If those kind of tuples can't have committed xmax older
than OldestXmin/cutoff_xid, then we're good. Otherwise, these kind of
tuples can create problems if we're performing vacuum with index
cleanup disabled. Because, if index cleanup is disabled, we set
tupgone as false although the status of the tuple is HEAPTUPLE_DEAD
and try to freeze those tuple later.

You've also mentioned that HEAPTUPLE_DEAD case I'm referring to can
only be hit for for tuples that are *newer* than OldestXmin but become
DEAD (instead of RECENTLY_DEAD) because the inserting transaction
aborted. But, I don't see that's the only case when
HeapTupleSatisfiesVacuum returns HEAPTUPLE_DEAD. If
HeapTupleSatisfiesVacuumHorizon returns HEAPTUPLE_RECENTLY_DEAD and if
tuple xmax(dead_after) precedes OlestXmin, we set it as
HEAPTUPLE_DEAD.

res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);

if (res == HEAPTUPLE_RECENTLY_DEAD)
{
Assert(TransactionIdIsValid(dead_after));

if (TransactionIdPrecedes(dead_after, OldestXmin))
res = HEAPTUPLE_DEAD;
}
else
Assert(!TransactionIdIsValid(dead_after));

Am I missing something here?

--
Thanks & Regards,
Kuntal Ghosh

Re: Incorrect assumption in heap_prepare_freeze_tuple

2020-10-03 Thread Kuntal Ghosh

On Sat, Oct 3, 2020 at 12:05 AM Andres Freund  wrote:
Thank you for your quick response and detailed explanation.

>
> >  * It is assumed that the caller has checked the tuple with
> >  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
> >  * (else we should be removing the tuple, not freezing it).
> >
> > Thus, when we see a committed xmax that precedes the cutoff_xid, we throw
> > the following data corruption error:
> > errmsg_internal("cannot freeze committed xmax %u", xid)
> >
> > However, in the caller (lazy_scan_heap), HeapTupleSatisfiesVacuum may
> > return HEAPTUPLE_DEAD for an updated/deleted tuple that got modified by a
> > transaction older than OldestXmin. And, if the tuple is HOT-updated, it
> > should only be removed by a hot-chain prune operation. So, we treat the
> > tuple as RECENTLY_DEAD and don't remove the tuple.
>
> This code is so terrible :(
>
> We really should just merge the HOT pruning and lazy_scan_heap()
> removal/freeze operations. That'd avoid this corner case and *also*
> would significantly reduce the WAL volume of VACUUM. And safe a good bit
> of CPU time.
>
+1

>
> > So, it may lead to an incorrect data corruption error. IIUC, following will
> > be the exact scenario where the error may happen,
> >
> > An updated/deleted tuple whose xamx is in between cutoff_xid and
> > OldestXmin. Since cutoff_xid depends on vacuum_freeze_min_age and
> > autovacuum_freeze_max_age, it'll not be encountered  easily. But, I think
> > it can be reproduced with some xid burner patch.
>
> I don't think this case is possible (*). By definition, there cannot be any
> transactions needing tuples from before OldestXmin. Which means that the
> heap_page_prune() earlier in lazy_scan_heap() would have pruned away a
> DEAD tuple version that is part of a hot chain.
>
> The HEAPTUPLE_DEAD branch you're referring to really can only be hit for
> tuples that are *newer* than OldestXmin but become DEAD (instead of
> RECENTLY_DEAD) because the inserting transaction aborted.
>
>
> (*) with the exception of temp tables due to some recent changes, I am
> currently working on a fix for that.
>
IIUC, there can be a HOT-updated tuple which is not initially pruned
by heap_page_prune but later diagnosed HEAPTUPLE_DEAD by
HeapTupleSatisfiesVacuum (Since OldestXmin can be updated by the time
we call HeapTupleSatisfiesVacuum and xmax becomes older than
OldestXmin). So, there can be deleted tuples with xmax older than
OldestXmin. But, you're right that any tuple older than cutoff_xid
(since it was set earlier) will be pruned by heap_page_prune and hence
we shouldn't encounter the error. I'll study the rewrite_heap_tuple
path as well.

> What made you look at this? Did you hit the error?
Nope, I haven't encountered the error. Just trying to understand the code. :-)

-- 
Thanks & Regards,
Kuntal Ghosh

Incorrect assumption in heap_prepare_freeze_tuple

2020-10-02 Thread Kuntal Ghosh

Hello hackers,

In heap_prepare_freeze_tuple, we make the following assumption:

 * It is assumed that the caller has checked the tuple with
 * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
 * (else we should be removing the tuple, not freezing it).

Thus, when we see a committed xmax that precedes the cutoff_xid, we throw
the following data corruption error:
errmsg_internal("cannot freeze committed xmax %u", xid)

However, in the caller (lazy_scan_heap), HeapTupleSatisfiesVacuum may
return HEAPTUPLE_DEAD for an updated/deleted tuple that got modified by a
transaction older than OldestXmin. And, if the tuple is HOT-updated, it
should only be removed by a hot-chain prune operation. So, we treat the
tuple as RECENTLY_DEAD and don't remove the tuple.

So, it may lead to an incorrect data corruption error. IIUC, following will
be the exact scenario where the error may happen,

An updated/deleted tuple whose xamx is in between cutoff_xid and
OldestXmin. Since cutoff_xid depends on vacuum_freeze_min_age and
autovacuum_freeze_max_age, it'll not be encountered  easily. But, I think
it can be reproduced with some xid burner patch.

I think the fix should be something like following:
if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
-   TransactionIdDidCommit(xid))
+   TransactionIdDidCommit(xid) &&
+   !HeapTupleHeaderIsHotUpdated(tuple))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("cannot freeze committed xmax %u",
 xid)));
-   freeze_xmax = true;
+
+   freeze_xmax = HeapTupleHeaderIsHotUpdated(tuple) ? false : true;

Attached a patch for the same. Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh


0001-Fix-sanity-check-for-HOT-updated-tuple-when-freezing.patch
Description: Binary data

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-16 Thread Kuntal Ghosh

On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar  wrote:
>

Few review comments from 0006-Add-support-for-streaming*.patch

+ subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
lseek can return (-)ve value in case of error, right?

+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ *
+ * Don't check for error from mkdir; it could fail if the directory
+ * already exists (maybe someone else just did the same thing).  If
+ * it doesn't work then we'll bomb out when opening the file
+ */
+ mkdir(tempdirpath, S_IRWXU);
If that's the only reason, perhaps we can use something like following:

if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
throw error;

+
+ CloseTransientFile(stream_fd);
Might failed to close the file. We should handle the case.

Also, I think we need some implementations in dumpSubscription() to
dump the (streaming = 'on') option.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-04-15 Thread Kuntal Ghosh

On Wed, Apr 15, 2020 at 10:45 PM Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-15 20:36:39 +0530, Kuntal Ghosh wrote:
> > I was thinking from this point of view - the sooner we introduce
> > parallelism in the process, the greater the benefits.
>
> I don't really agree. Sure, that's true from a theoretical perspective,
> but the incremental gains may be very small, and the cost in complexity
> very high. If we can get single threaded splitting of rows to be >4GB/s,
> which should very well be attainable, the rest of the COPY work is going
> to dominate the time.  We shouldn't add complexity to parallelize more
> of the line splitting, caring too much about scalable datastructures,
> etc when the bottleneck after some straightforward optimization is
> usually still in the parallelized part.
>
> I'd expect that for now we'd likely hit scalability issues in other
> parts of the system first (e.g. extension locks, buffer mapping).
>
Got your point. In this particular case, a single producer is fast
enough (or probably we can make it fast enough) to generate enough
chunks for multiple consumers so that they don't stay idle and wait
for work.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-04-15 Thread Kuntal Ghosh

On Wed, Apr 15, 2020 at 2:15 PM Ants Aasma  wrote:
>
> On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh  wrote:
> > 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> > collects the following three stats from the chunk:
> > a) number of quotes
> > b) position of the first new line after even number of quotes
> > c) position of the first new line after odd number of quotes
> > 2. Once stats from all the chunks are collected, the leader identifies
> > the adjusted chunk boundaries by iterating over the stats linearly:
> > - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> > - If the number is even, then the k-th chunk does not start in the
> > middle of a quoted field, and the first newline after an even number
> > of quotes (the second collected information) is the first record
> > delimiter in this chunk.
> > - Otherwise, if the number is odd, the first newline after an odd
> > number of quotes (the third collected information) is the first record
> > delimiter.
> > - The end position of the adjusted chunk is obtained based on the
> > starting position of the next adjusted chunk.
>
> The trouble is that, at least with current coding, the number of
> quotes in a chunk can depend on whether the chunk started in a quote
> or not. That's because escape characters only count inside quotes. See
> for example the following csv:
>
> foo,\"bar
> baz",\"xyz"
>
> This currently parses as one line and the number of parsed quotes
> doesn't change if you add a quote in front.
>
> But the general approach of doing the tokenization in parallel and
> then a serial pass over the tokenization would still work. The quote
> counting and new line finding just has to be done for both starting in
> quote and not starting in quote case.
>
Yeah, right.

> Using phases doesn't look like the correct approach - the tokenization
> can be prepared just in time for the serial pass and processing the
> chunk can proceed immediately after. This could all be done by having
> the data in a single ringbuffer with a processing pipeline where one
> process does the reading, then workers grab tokenization chunks as
> they become available, then one process handles determining the chunk
> boundaries, after which the chunks are processed.
>
I was thinking from this point of view - the sooner we introduce
parallelism in the process, the greater the benefits. Probably there
isn't any way to avoid a single-pass over the data (phase - 2 in the
above case) to tokenise the chunks. So yeah, if the reading and
tokenisation phase doesn't take much time, parallelising the same will
just be an overkill. As pointed by Andres and you, using a lock-free
circular buffer implementation sounds the way to go forward. AFAIK,
FIFO circular queue with CAS-based implementation suffers from two
problems - 1. (as pointed by you) slow workers may block producers. 2.
Since it doesn't partition the queue among the workers, does not
achieve good locality and cache-friendliness, limits their scalability
on NUMA systems.

> But I still don't think this is something to worry about for the first
> version. Just a better line splitting algorithm should go a looong way
> in feeding a large number of workers, even when inserting to an
> unindexed unlogged table. If we get the SIMD line splitting in, it
> will be enough to overwhelm most I/O subsystems available today.
>
Yeah. Parsing text is a great use case for data parallelism which can
be achieved by SIMD instructions. Consider processing 8-bit ASCII
characters in 512-bit SIMD word. A lot of code and complexity from
CopyReadLineText will surely go away. And further (I'm not sure in
this point), if we can use the schema of the table, perhaps JIT can
generate machine code to efficient read of fields based on their
types.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-04-14 Thread Kuntal Ghosh

Hello,

I was going through some literatures on parsing CSV files in a fully
parallelized way and found (from [1]) an interesting approach
implemented in the open-source project ParaText[2]. The algorithm
follows a two-phase approach: the first pass identifies the adjusted
chunks in parallel by exploiting the simplicity of CSV formats and the
second phase processes complete records within each adjusted chunk by
one of the available workers. Here is the sketch:

1. Each worker scans a distinct fixed sized chunk of the CSV file and
collects the following three stats from the chunk:
a) number of quotes
b) position of the first new line after even number of quotes
c) position of the first new line after odd number of quotes
2. Once stats from all the chunks are collected, the leader identifies
the adjusted chunk boundaries by iterating over the stats linearly:
- For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
- If the number is even, then the k-th chunk does not start in the
middle of a quoted field, and the first newline after an even number
of quotes (the second collected information) is the first record
delimiter in this chunk.
- Otherwise, if the number is odd, the first newline after an odd
number of quotes (the third collected information) is the first record
delimiter.
- The end position of the adjusted chunk is obtained based on the
starting position of the next adjusted chunk.
3. Once the boundaries of the chunks are determined (forming adjusted
chunks), individual worker may take up one adjusted chunk and process
the tuples independently.

Although this approach parses the CSV in parallel, it requires two
scan on the CSV file. So, given a system with spinning hard-disk and
small RAM, as per my understanding, the algorithm will perform very
poorly. But, if we use this algorithm to parse a CSV file on a
multi-core system with a large RAM, the performance might be improved
significantly [1].

Hence, I was trying to think whether we can leverage this idea for
implementing parallel COPY in PG. We can design an algorithm similar
to parallel hash-join where the workers pass through different phases.
1. Phase 1 - Read fixed size chunks in parallel, store the chunks and
the small stats about each chunk in the shared memory. If the shared
memory is full, go to phase 2.
2. Phase 2 - Allow a single worker to process the stats and decide the
actual chunk boundaries so that no tuple spans across two different
chunks. Go to phase 3.
3. Phase 3 - Each worker picks one adjusted chunk, parse and process
tuples from the same. Once done with one chunk, it picks the next one
and so on.
4. If there are still some unread contents, go back to phase 1.

We can probably use separate workers for phase 1 and phase 3 so that
they can work concurrently.

Advantages:
1. Each worker spends some significant time in each phase. Gets
benefit of the instruction cache - at least in phase 1.
2. It also has the same advantage of parallel hash join - fast workers
get to work more.
3. We can extend this solution for reading data from STDIN. Of course,
the phase 1 and phase 2 must be performed by the leader process who
can read from the socket.

Disadvantages:
1. Surely doesn't work if we don't have enough shared memory.
2. Probably, this approach is just impractical for PG due to certain
limitations.

Thoughts?

[1] 
https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf
[2] ParaText. https://github.com/wiseio/paratext.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-13 Thread Kuntal Ghosh

e > 0);
+ Assert(rb->size >= txn->size);
The same three assertions are already there in ReorderBufferLargestTopTXN().

+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+ LogicalDecodingContext *ctx = rb->private_data;
+
+ return ctx->streaming;
+}
Potential inline function.

+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id;
Here also, do we need to declare these two variables as volatile?


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-13 Thread Kuntal Ghosh

On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar  wrote:
> On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh  
> wrote:
> >
> > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > This looks wrong. We should change the name of this Macro or we can
> > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
>
> I think this is in sync with below code (SizeOfXlogOrigin),  SO doen't
> make much sense to add different terminology no?
> #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
>
In that case, we can rename this, for example, SizeOfXLogTransactionId.

Some review comments from 0002-Issue-individual-*.path,

+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr lsn, int nmsgs,
+ SharedInvalidationMessage *msgs)
+{
+ MemoryContext oldcontext;
+ ReorderBufferChange *change;
+
+ /* XXX Should we even write invalidations without valid XID? */
+ if (xid == InvalidTransactionId)
+ return;
+
+ Assert(xid != InvalidTransactionId);

It seems we don't call the function if xid is not valid. In fact,

@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
  }
  case XLOG_XACT_ASSIGNMENT:
  break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Why should we insert an WAL record for such cases?

+ * When wal_level=logical, write invalidations into WAL at each command end to
+ *  support the decoding of the in-progress transaction.  As of now it was
+ *  enough to log invalidation only at commit because we are only decoding the
+ *  transaction at the commit time.   We only need to log the catalog cache and
+ *  relcache invalidation.  There can not be any active MVCC scan in logical
+ *  decoding so we don't need to log the snapshot invalidation.
The alignment is not right.

 /*
  * CommandEndInvalidationMessages
- * Process queued-up invalidation messages at end of one command
- * in a transaction.
+ *  Process queued-up invalidation messages at end of one command
+ *  in a transaction.
Looks unnecessary changes.

  * Note:
- * This should be called during CommandCounterIncrement(),
- * after we have advanced the command ID.
+ *  This should be called during CommandCounterIncrement(),
+ *  after we have advanced the command ID.
  */
Looks unnecessary changes.

  if (transInvalInfo == NULL)
- return;
+ return;
Looks unnecessary changes.

+ /* prepare record */
+ memset(&xlrec, 0, sizeof(xlrec));
We should use MinSizeOfXactInvalidations, no?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-13 Thread Kuntal Ghosh

On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar  wrote:
>
> I have rebased the patch on the latest head.  I haven't yet changed
> anything for xid assignment thing because it is not yet concluded.
>
Some review comments from 0001-Immediately-WAL-log-*.patch,

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}
IMHO, it's important to reduce the complexity of this function since
it's been called for every WAL insertion. During the lifespan of a
transaction, any of these if conditions will only be evaluated if
previous conditions are true. So, we can maintain some state machine
to avoid multiple evaluation of a condition inside a transaction. But,
if the overhead is not much, it's not worth I guess.

+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
  int i;

+ /* reset the subxact assignment flag (if needed) */
+ if (curinsert_flags & XLOG_INCLUDE_XID)
+ MarkSubTransactionAssigned();
The comment looks contradictory.

 XLogSetRecordFlags(uint8 flags)
 {
  Assert(begininsert_called);
- curinsert_flags = flags;
+ curinsert_flags |= flags;
 }
 I didn't understand why we need this change in this patch.

+ txid = XLogRecGetTopXid(record);
+
+ /*
+ * If the toplevel_xid is valid, we need to assign the subxact to the
+ * toplevel transaction. We need to do this for all records, hence we
+ * do it before the switch.
+ */
s/toplevel_xid/toplevel xid or s/toplevel_xid/txid

  if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(r->toplevel_xid))
Perhaps, XLogRecGetTopXid() can be used.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-08 Thread Kuntal Ghosh

Hello Ashutosh, Fujita,

On Wed, Apr 8, 2020 at 3:49 PM Ashutosh Bapat
 wrote:
> On Wed, 8 Apr 2020 at 15:42, Etsuro Fujita  wrote:
>> On Wed, Apr 8, 2020 at 4:30 PM Kuntal Ghosh  
>> wrote:
>> > I'm getting the following warning during compilation.
>> >
>> > partbounds.c: In function ‘partition_bounds_merge’:
>> > partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ 
>> > [-Wunused-variable]
>> >   PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
>> >  ^
>> > For fixing the same, we can declare inner_binfo as
>> > PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.
>>
>> I'd propose to remove an assertion causing this (and  the
>> outer_binfo/inner_binfo variables) from partition_bounds_merge(),
>> rather than doing so, because the assertion is redundant, as we have
>> the same assertion in merge_list_bounds() and merge_range_bounds().
>> Please find attached a patch.
>
>
> I think it's better to have the assertion in all the three places and also in 
> merge_hash_bounds() whenever that comes along. The assertion in 
> merge_*_bounds() will be good to in case those functions are called from 
> places other than partition_bounds_merge(). The assertion in 
> partition_bounds_merge() will make sure that when the individual 
> merge_*_bounds() functions are called based on one of the bounds both of the 
> bounds have same strategy.

Both of your patches fix the problem. I don't have much exposure in
this area to comment on whether we should keep/remove the assertion
from the code. But, here is my opinion:

The code structure looks like following:
Assert(condition A);
if (Condition B)
merge_*_bounds();

Inside merge_*_bounds(), you have both the above assert and the if
condition as another assert:
Assert(condition A and Condition B);

And, merge_*_bounds() are called from only one place. So, something is
redundant here and I'm inclined towards removal of the assert
condition. Another thing I noticed:

/* The partitioning strategies should be the same. */
Assert(outer_binfo->strategy == inner_binfo->strategy);

The comment just reads the assertion aloud which looks unnecessary.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-08 Thread Kuntal Ghosh

Hi,

On Wed, Apr 8, 2020 at 7:07 AM Etsuro Fujita  wrote:
>
> Pushed after modifying some comments further, based on the suggestions
> of Ashutosh.
I'm getting the following warning during compilation.

partbounds.c: In function ‘partition_bounds_merge’:
partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable]
  PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
 ^
For fixing the same, we can declare inner_binfo as
PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU statistics

2020-04-07 Thread Kuntal Ghosh

Hello Tomas,

On Thu, Apr 2, 2020 at 5:59 PM Tomas Vondra
 wrote:
> Thank you for the patch, pushed.
>
In SimpleLruReadPage_ReadOnly, we first try to find the SLRU page in
shared buffer under shared lock, then conditionally visit
SimpleLruReadPage if reading is necessary. IMHO, we should update
hit_count if we can find the buffer in SimpleLruReadPage_ReadOnly
directly. Am I missing something?

Attached a patch for the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Update-stats-in-SimpleLruReadPage_ReadOnly.patch
Description: Binary data

Re: WAL usage calculation patch

2020-03-31 Thread Kuntal Ghosh

On Tue, Mar 31, 2020 at 7:39 PM Julien Rouhaud  wrote:
>
> On Tue, Mar 31, 2020 at 12:21 PM Kuntal Ghosh
>  wrote:
> >
> > On Mon, Mar 30, 2020 at 6:14 PM Julien Rouhaud  wrote:
> > >
> > @@ -448,6 +449,7 @@ XLogInsert(RmgrId rmid, uint8 info)
> >   bool doPageWrites;
> >   XLogRecPtr fpw_lsn;
> >   XLogRecData *rdt;
> > + int num_fpw = 0;
> >
> >   /*
> >   * Get values needed to decide whether to do full-page writes. Since
> > @@ -457,9 +459,9 @@ XLogInsert(RmgrId rmid, uint8 info)
> >   GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
> >
> >   rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
> > - &fpw_lsn);
> > + &fpw_lsn, &num_fpw);
> >
> > - EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
> > + EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpw);
> >   } while (EndPos == InvalidXLogRecPtr);
> >
> > I think there are some issues in the num_fpw calculation. For some
> > cases, we have to return from XLogInsert without inserting a record.
> > Basically, we've to recompute/reassemble the same record. In those
> > cases, num_fpw should be reset. Thoughts?
>
> Mmm, yes but since that's the same record is being recomputed from the
> same RedoRecPtr, doesn't it mean that we need to reset the counter?
> Otherwise we would count the same FPW multiple times.

Yes. That was my point as well. I missed the part that you're already
resetting the same inside the do-while loop before calling
XLogRecordAssemble. Sorry for the noise.

Re: WAL usage calculation patch

2020-03-31 Thread Kuntal Ghosh

On Mon, Mar 30, 2020 at 6:14 PM Julien Rouhaud  wrote:
>
@@ -448,6 +449,7 @@ XLogInsert(RmgrId rmid, uint8 info)
  bool doPageWrites;
  XLogRecPtr fpw_lsn;
  XLogRecData *rdt;
+ int num_fpw = 0;

  /*
  * Get values needed to decide whether to do full-page writes. Since
@@ -457,9 +459,9 @@ XLogInsert(RmgrId rmid, uint8 info)
  GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);

  rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
- &fpw_lsn);
+ &fpw_lsn, &num_fpw);

- EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
+ EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpw);
  } while (EndPos == InvalidXLogRecPtr);

I think there are some issues in the num_fpw calculation. For some
cases, we have to return from XLogInsert without inserting a record.
Basically, we've to recompute/reassemble the same record. In those
cases, num_fpw should be reset. Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-15 Thread Kuntal Ghosh

On Mon, Mar 16, 2020 at 9:43 AM Dilip Kumar  wrote:
> On Mon, Mar 16, 2020 at 8:57 AM Masahiko Sawada
>  wrote:
> > IsRelationExtensionLockHeld and IsPageLockHeld are used only when
> > assertion is enabled. So how about making CheckAndSetLockHeld work
> > only if USE_ASSERT_CHECKING to avoid overheads?
>
> That makes sense to me so updated the patch.
+1

In v10-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patch,

+ * Indicate that the lock is released for a particular type of locks.
s/lock is/locks are

+ /* Indicate that the lock is acquired for a certain type of locks. */
s/lock is/locks are

In v10-0002-*.patch,

+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
s/while holding the page lock except for relation extension/while
holding the page lock except for relation extension and page lock

+ * We don't acquire any other heavyweight lock while holding the page lock
+ * except for relation extension lock.
Same as above

Other than that, the patches look good to me. I've also done some
testing after applying the Test-group-deadlock patch provided by Amit
earlier in the thread. It works as expected.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-13 Thread Kuntal Ghosh

On Fri, Mar 13, 2020 at 8:42 AM Dilip Kumar  wrote:
> > On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh  
> > wrote:
> >
> > > + /*
> > > + * The relation extension or page lock can never participate in actual
> > > + * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
> > > + * no advantage in checking wait edges from it.
> > > + */
> > > + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> > > + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> > > + return false;
> > > +
> > > Since this is true, we can also avoid these kind of locks in
> > > ExpandConstraints, right?
>
> I am not sure I understand this part.  Because topological sort will
> work on the soft edges we have created when we found the cycle,  but
> for relation extension/page lock we are completely ignoring hard/soft
> edge then it will never participate in topo sort as well.  Am I
> missing something?
>
No, I think you're right. We only add constraints if we've detected a
cycle in the graph. Hence, you don't need the check here.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-13 Thread Kuntal Ghosh

On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila  wrote:
>
> On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh  
> wrote:
> > I think moving them inside a macro is a good idea. Also, I think we
> > should move all the Assert related code inside some debugging macro
> > similar to this:
> > #ifdef LOCK_DEBUG
> > 
> > #endif
> >
> If we move it under some macro, then those Asserts will be only
> enabled when that macro is defined.  I think we want there Asserts to
> be enabled always in assert enabled build, these will be like any
> other Asserts in the code.  What is the advantage of doing those under
> macro?
>
My concern is related to performance regression. We're using two
static variables in hot-paths only for checking a few asserts. So, I'm
not sure whether we should enable the same by default, specially when
asserts are itself disabled.
-ResetRelExtLockHeldCount()
+ResetRelExtPageLockHeldCount()
 {
  RelationExtensionLockHeldCount = 0;
+ PageLockHeldCount = 0;
+}
Also, we're calling this method from frequently used functions like
Commit/AbortTransaction. So, it's better these two static variables
share the same cache line and reinitalize them with a single
instruction.

>
> >   /*
> > + * The relation extension or page lock conflict even between the group
> > + * members.
> > + */
> > + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> > + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> > + {
> > + PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
> > + proclock);
> > + return true;
> > + }
> > This check includes the heavyweight locks that conflict even under
> > same parallel group. It also has another property that they can never
> > participate in deadlock cycles. And, the number of locks under this
> > category is likely to increase in future with new parallel features.
> > Hence, it could be used in multiple places. Should we move the
> > condition inside a macro and just call it from here?
> >
>
> Right, this is what I have suggested upthread. Do you have any
> suggestions for naming such a macro or function?  I could think of
> something like LocksConflictAmongGroupMembers or
> LocksNotParticipateInDeadlock. The first one suits more for its usage
> in LockCheckConflicts and the second in the deadlock.c code. So none
> of those sound perfect to me.
>
Actually, I'm not able to come up with a good suggestion. I'm trying
to think of a generic name similar to strong or weak locks but with
the following properties:
a. Locks that don't participate in deadlock detection
b. Locks that conflicts in the same parallel group

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-12 Thread Kuntal Ghosh

On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila  wrote:
>
> On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar  wrote:
> >
> > I have fixed this in the attached patch set.
> >
>
> I have modified your
> v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> modifications are (a) Change src/backend/storage/lmgr/README to
> reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> slightly simplifies the code, (c) moved the deadlock.c check a few
> lines up and (d) changed a few comments.
>
> It might be better if we can move the checks related to extension and
> page lock in a separate API or macro.  What do you think?
>
I think moving them inside a macro is a good idea. Also, I think we
should move all the Assert related code inside some debugging macro
similar to this:
#ifdef LOCK_DEBUG

#endif

+ /*
+ * The relation extension or page lock can never participate in actual
+ * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+ * no advantage in checking wait edges from it.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ return false;
+
Since this is true, we can also avoid these kind of locks in
ExpandConstraints, right? It'll certainly reduce some complexity in
topological sort.

  /*
+ * The relation extension or page lock conflict even between the group
+ * members.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+ proclock);
+ return true;
+ }
This check includes the heavyweight locks that conflict even under
same parallel group. It also has another property that they can never
participate in deadlock cycles. And, the number of locks under this
category is likely to increase in future with new parallel features.
Hence, it could be used in multiple places. Should we move the
condition inside a macro and just call it from here?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: backend type in log_line_prefix?

2020-03-10 Thread Kuntal Ghosh

On Tue, Mar 10, 2020 at 9:11 PM Peter Eisentraut
 wrote:
> On 2020-03-09 16:20, Kuntal Ghosh wrote:
> > In v3-0001-Refactor-ps_status.c-API.patch,
> > - * postgres: walsender   
> > This part is still valid, right?
> Sure but I figured this comment was in the context of the explanation of
> how the old API was being abused, so it's no longer necessary.
>
Makes sense.

> set_ps_display() checks update_process_title itself and does nothing if
> it's off, so this should work okay.
Right.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: backend type in log_line_prefix?

2020-03-09 Thread Kuntal Ghosh

Hello,

On Sat, Mar 7, 2020 at 8:38 PM Peter Eisentraut
 wrote:
>
> Updated patch set because of conflicts.
>
Thank you for the patch. This feature is really helpful. Here are some
minor comments:

In v3-0001-Refactor-ps_status.c-API.patch,

- *
- * For a walsender, the ps display is set in the following form:
- *
- * postgres: walsender   
This part is still valid, right?

+ init_ps_display(ps_data.data);
+ pfree(ps_data.data);
+
+ set_ps_display("initializing");
As per the existing behaviour, if update_process_title is true, we
display "authentication" as the initial string. On the other hand,
this patch, irrespective of the GUC variable, always displays
"initializing" as the initial string and In PerformAuthentication, it
sets the display as "authentication" Is this intended? Should we check
the GUC here as well?

In v3-0002-Unify-several-ways-to-tracking-backend-type.patch,
+ * If fixed_part is NULL, a default will be obtained from BackendType.
s/BackendType/MyBackendType?

In v3-0003-Add-backend-type-to-csvlog-and-optionally-log_lin.patch,
+ Backend process type
In other places you've written "backend type".

In v3-0004-Remove-am_syslogger-global-variable.patch,
+ * This is exported so that elog.c can call it when BackendType is B_LOGGER.
s/BackendType/MyBackendType?

Done some basic testing. Working as expected.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-09 Thread Kuntal Ghosh

On Mon, Mar 9, 2020 at 1:46 PM Michael Paquier  wrote:
>
> On Fri, Mar 06, 2020 at 05:22:16PM +0900, Michael Paquier wrote:
> > Thanks for the suggestion.  Avoiding dead code makes sense as well
> > here.  I'll think about this stuff a bit more first.
>
> Okay, after pondering more on that, here is a first cut with a patch
> refactoring restore_command build to src/common/.  One thing I have
> changed compared to the other versions is that BuildRestoreCommand()
> now returns a boolean to tell if the command was successfully built or
> not.
>
Yeah. If we're returning only -1 and 0, it's better to use a boolean.
If we're planning to provide some information about which parameter is
missing, a few negative integer values might be useful. But, I feel
that's unnecessary in this case.

> A second thing.  As of now the interface of BuildRestoreCommand()
> assumes that the resulting buffer has an allocation of MAXPGPATH.
> This should be fine because that's an assumption we rely on a lot in
> the code, be it frontend or backend.  However, it could also be a trap
> for any caller of this routine if they allocate something smaller.
> Wouldn't it be cleaner to pass down the size of the result buffer
> directly to BuildRestoreCommand() and use the length given by the
> caller of the routine as a sanity check?
>
That's a good suggestion. But, it's unlikely that a caller would pass
something longer than MAXPGPATH and we indeed use that value a lot in
the code. IMHO, it looks okay to me to have that assumption here as
well.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: ALTER TABLE ... SET STORAGE does not propagate to indexes

2020-02-25 Thread Kuntal Ghosh

On Tue, Feb 25, 2020 at 1:09 PM Peter Eisentraut
 wrote:
>
> On 2020-02-24 12:21, Kuntal Ghosh wrote:
> > I've reproduced the issue on head. And, the patch seems to solve the
> > problem. The patch looks good to me. But, I've a small doubt regarding
> > the changes in test_decoding regression file.
> >
> > diff --git a/contrib/test_decoding/sql/toast.sql
> > b/contrib/test_decoding/sql/toast.sql
> > ..
> > -INSERT INTO toasted_several(toasted_key) VALUES(repeat('9876543210', 
> > 1));
> > -SELECT pg_column_size(toasted_key) > 2^16 FROM toasted_several;
> > +INSERT INTO toasted_several(toasted_key) VALUES(repeat('9876543210', 269));
> > +SELECT pg_column_size(toasted_key) > 2^11 FROM toasted_several;
> >
> > This actually tests whether we can decode "old" tuples bigger than the
> > max heap tuple size correctly which is around 8KB. But, the above
> > changes will make the tuple size around 3KB. So, it'll not be able to
> > test that particular scenario.Thoughts?
>
> OK, this is interesting.  The details of this are somewhat unfamiliar to
> me, but it appears that due to TOAST_INDEX_HACK in indextuple.c, an
> index tuple cannot be larger than 8191 bytes when untoasted (but not
> uncompressed).
>
> What the test case above is testing is a situation where the heap tuple
> is stored toasted uncompressed (storage external) but the index tuple is
> not (probably compressed inline).  This is exactly the situation that I
> was contending should not be possible, because it cannot be dumped or
> restored.
>
Yeah. If we only commit this patch to fix the issue, we're going to
put some restriction for the above situation, i.e., the index for an
external attribute has to be stored as an external (i.e. uncompressed)
value. So, a lot of existing workload might start failing after an
upgrade. I think there should be an option to store the index of an
external attribute as a compressed inline value.

> An alternative would be that we make this situation fully supported.
> Then we'd probably need at least ALTER INDEX ... ALTER COLUMN ... SET
> STORAGE, and some pg_dump support.
>
> Thoughts?
Yes. We need the support for this syntax along with the bug fix patch.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: ALTER TABLE ... SET STORAGE does not propagate to indexes

2020-02-24 Thread Kuntal Ghosh

Hello Peter,

On Mon, Feb 24, 2020 at 12:59 PM Peter Eisentraut
 wrote:
>
> On 2020-01-06 13:32, Peter Eisentraut wrote:
> > Attached is a patch that attempts to fix this by propagating the storage
> > change to existing indexes.  This triggers a few regression test
> > failures (going from no error to error), which I attempted to fix up,
> > but I haven't analyzed what the tests were trying to do, so it might
> > need another look.
>
> Attached is a more polished patch, with tests.

I've reproduced the issue on head. And, the patch seems to solve the
problem. The patch looks good to me. But, I've a small doubt regarding
the changes in test_decoding regression file.

diff --git a/contrib/test_decoding/sql/toast.sql
b/contrib/test_decoding/sql/toast.sql
..
-INSERT INTO toasted_several(toasted_key) VALUES(repeat('9876543210', 1));
-SELECT pg_column_size(toasted_key) > 2^16 FROM toasted_several;
+INSERT INTO toasted_several(toasted_key) VALUES(repeat('9876543210', 269));
+SELECT pg_column_size(toasted_key) > 2^11 FROM toasted_several;

This actually tests whether we can decode "old" tuples bigger than the
max heap tuple size correctly which is around 8KB. But, the above
changes will make the tuple size around 3KB. So, it'll not be able to
test that particular scenario.Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-14 Thread Kuntal Ghosh

On Sun, Feb 9, 2020 at 9:18 AM Amit Kapila  wrote:
>
> It seems for this we formed a cache of max_cached_tuplebufs number of
> objects and we don't need to allocate more than that number of tuples
> of size MaxHeapTupleSize because we will anyway return that memory to
> aset.c.
>
In the approach suggested by Amit (approach 1), once we allocate the
max_cached_tuplebufs number of MaxHeapTupleSize, we can use the actual
length of the tuple for allocating memory. So, if we have m
subtransactions, the memory usage at worst case will be,

(max_cached_tuplebufs * MaxHeapTupleSize) cache +
(Maximum changes in a subtransaction before spilling) * m * (Actual tuple size)

= 64 MB cache + 4095 * m * (Actual tuple size)

In the approach suggested by Andres (approach 2), we're going to
reduce the size of a cached tuple to 1024 bytes. So, if we have m
sub-transactions, the memory usage at worst case will be,

(max_cached_tuplebufs * 1024 bytes) cache + (Maximum changes in a
subtransaction before spilling) * m * 1024 bytes

= 8 MB cache + 4095 * m * 1024 (considering the size of the tuple is
less than 1024 bytes)

Once the cache is filled, for 1000 sub-transactions operating on tuple
size, say 100 bytes, approach 1 will allocate 390 MB of memory
(approx.) whereas approach 2 will allocate 4GB of memory
approximately. If there is no obvious error that I'm missing, I think
we should implement the first approach.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-09 Thread Kuntal Ghosh

Hello,

On Sat, Feb 8, 2020 at 1:18 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-02-07 20:02:01 +0100, Tomas Vondra wrote:
> > On Fri, Feb 07, 2020 at 10:33:48AM -0800, Andres Freund wrote:
> > > Hi,
> > >
> > > On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote:
> > > > And, the issue got reproduced with the same error:
> > > > WARNING:  problem in Generation Tuples: number of free chunks 0 in
> > > > block 0x7fe9e9e74010 exceeds 1018 allocated
> > >
> > > That seems like a problem in generation.c - because this should be
> > > unreachable, I think?
>
> > That's rather strange. How could we print this message? The code looks
> > like this
> >
> >   if (block->nfree >= block->nchunks)
> > elog(WARNING, "problem in Generation %s: number of free chunks %d in 
> > block %p exceeds %d allocated",
> >  name, block->nfree, block, block->nchunks);
> >
> > so this says 0 >= 1018. Or am I missing something?
>
> Indeed, it's pretty weird. I can't reproduce it either. Kuntal, which
> exact git version did you repro this on? What precise settings /
> platform?
>

I've used the following steps:

Platform:
OS: Ubuntu 64-bit 18.04.2
VMWare Fusion Version 8.5.10 on my MacOS 10.14.6
16GB RAM with 8 core processors

git checkout -b pg11 remotes/origin/REL_11_STABLE
git reset --hard a4ccc1cef5a04cc (Generational memory allocator)
git apply Set-alloc_len-as-MaxHeapTupleSize.patch

Then, I performed the same test. I've attached the test.sql file for the same.

With this test, I wanted to check how the generational memory
allocator patch is solving the issue. As reported by you as well
earlier in the thread, each of the 4096*10 in-memory tuples
doesn't allocate MaxHeapTupleSize, but instead something like ~30
bytes with this patch. IMHO, this is the change that is making the
difference. If we allocate MaxHeapTupleSize instead for all the
tuples, we'll encounter the same out-of-memory issue.

I haven't looked into the issue in generational tuple context yet.
It's possible that that the changes I've done in the attached patch
don't make sense and break the logic of generational memory allocator.
:-)

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Set-alloc_len-as-MaxHeapTupleSize.patch
Description: Binary data

test.sql
Description: Binary data

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-07 Thread Kuntal Ghosh

On Tue, Feb 4, 2020 at 2:40 PM Amit Kapila  wrote:
>
> I don't think we can just back-patch that part of code as it is linked
> to the way we are maintaining a cache (~8MB) for frequently allocated
> objects.  See the comments around the definition of
> max_cached_tuplebufs.  But probably, we can do something once we reach
> such a limit, basically, once we know that we have already allocated
> max_cached_tuplebufs number of tuples of size MaxHeapTupleSize, we
> don't need to allocate more of that size.  Does this make sense?
>

Yeah, this makes sense. I've attached a patch that implements the
same. It solves the problem reported earlier. This solution will at
least slow down the process of going OOM even for very small sized
tuples.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


0001-Restrict-memory-allocation-in-reorderbuffer-context.patch
Description: Binary data

Fix comment for max_cached_tuplebufs definition

2020-02-07 Thread Kuntal Ghosh

Hello Hackers,

While working on some issue in logical decoding, I found some
inconsistencies in the comment for defining max_cached_tuplebufs in
reorderbuffer.c. It only exists till PG10 because after that the
definition got removed by the generational memory allocator patch. The
variable is defined as follows in reorderbuffer.c:
static const Size max_cached_tuplebufs = 4096 * 2;  /* ~8MB */

And it gets compared with rb->nr_cached_tuplebufs in
ReorderBufferReturnTupleBuf as follows:
if (tuple->alloc_tuple_size == MaxHeapTupleSize &&
rb->nr_cached_tuplebufs < max_cached_tuplebufs)

   {
rb->nr_cached_tuplebufs++;
}

So, what this variable actually tracks is 4096 * 2 times
MaxHeapTupleSize amount of memory which is approximately 64MB. I've
attached a patch to modify the comment.

But, I'm not sure whether the intention was to keep 8MB cache only. In
that case, I can come up with another patch.

Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


0001-Fix-comment-for-max_cached_tuplebufs-definition.patch
Description: Binary data

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-03 Thread Kuntal Ghosh

On Sun, Jan 12, 2020 at 9:51 AM Tom Lane  wrote:
>
> Tomas Vondra  writes:
> > On Sat, Jan 11, 2020 at 10:53:57PM -0500, Tom Lane wrote:
> >> remind me where the win came from, exactly?
>
> > Well, the problem is that in 10 we allocate tuple data in the main
> > memory ReorderBuffer context, and when the transaction gets decoded we
> > pfree() it. But in AllocSet that only moves the data to the freelists,
> > it does not release it entirely. So with the right allocation pattern
> > (sufficiently diverse chunk sizes) this can easily result in allocation
> > of large amount of memory that is never released.
>
> > I don't know if this is what's happening in this particular test, but I
> > wouldn't be surprised by it.
>
> Nah, don't think I believe that: the test inserts a bunch of tuples,
> but they look like they will all be *exactly* the same size.
>
> CREATE TABLE decoding_test(x integer, y text);
> ...
>
> FOR i IN 1..10 LOOP
> BEGIN
> INSERT INTO decoding_test(x) SELECT generate_series(1,5000);
> EXCEPTION
> when division_by_zero then perform 'dummy';
> END;
>
I performed the same test in pg11 and reproduced the issue on the
commit prior to a4ccc1cef5a04 (Generational memory allocator).

ulimit -s 1024
ulimit -v 30

wal_level = logical
max_replication_slots = 4

And executed the following code snippet (shared by Amit Khandekar
earlier in the thread).

SELECT pg_create_logical_replication_slot('test_slot',
'test_decoding');

CREATE TABLE decoding_test(x integer, y text);
do $$
BEGIN
FOR i IN 1..10 LOOP
BEGIN
INSERT INTO decoding_test(x) SELECT
generate_series(1,3000);
EXCEPTION
when division_by_zero then perform 'dummy';
END;
END LOOP;
END $$;

SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL) LIMIT 10;

I got the following error:
ERROR:  out of memory
DETAIL:  Failed on request of size 8208.

After that, I applied the "Generational memory allocator" patch and
that solved the issue. From the error message, it is evident that the
underlying code is trying to allocate a MaxTupleSize memory for each
tuple. So, I re-introduced the following lines (which are removed by
a4ccc1cef5a04) on top of the patch:

--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len)

alloc_len = tuple_len + SizeofHeapTupleHeader;

+   if (alloc_len < MaxHeapTupleSize)
+   alloc_len = MaxHeapTupleSize;

And, the issue got reproduced with the same error:
WARNING:  problem in Generation Tuples: number of free chunks 0 in
block 0x7fe9e9e74010 exceeds 1018 allocated
.
ERROR:  out of memory
DETAIL:  Failed on request of size 8208.

I don't understand the code well enough to comment whether we can
back-patch only this part of the code. But, this seems to allocate a
huge amount of memory per chunk although the tuple is small.

Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-11-12 Thread Kuntal Ghosh

On Tue, Nov 12, 2019 at 4:12 PM Alexey Kondratov
 wrote:
>
> On 04.11.2019 13:05, Kuntal Ghosh wrote:
> > On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar  wrote:
> >> So your result shows that with "streaming on", performance is
> >> degrading?  By any chance did you try to see where is the bottleneck?
> >>
> > Right. But, as we increase the logical_decoding_work_mem, the
> > performance improves. I've not analyzed the bottleneck yet. I'm
> > looking into the same.
>
> My guess is that 64 kB is just too small value. In the table schema used
> for tests every rows takes at least 24 bytes for storing column values.
> Thus, with this logical_decoding_work_mem value the limit should be hit
> after about 2500+ rows, or about 400 times during transaction of 100
> rows size.
>
> It is just too frequent, while ReorderBufferStreamTXN includes a whole
> bunch of logic, e.g. it always starts internal transaction:
>
> /*
>   * Decoding needs access to syscaches et al., which in turn use
>   * heavyweight locks and such. Thus we need to have enough state around to
>   * keep track of those.  The easiest way is to simply use a transaction
>   * internally.  That also allows us to easily enforce that nothing writes
>   * to the database by checking for xid assignments. ...
>   */
>
> Also it issues separated stream_start/stop messages around each streamed
> transaction chunk. So if streaming starts and stops too frequently it
> adds additional overhead and may even interfere with current in-progress
> transaction.
>
Yeah, I've also found the same. With stream_start/stop message, it
writes 1 byte of checksum and 4 bytes of number of sub-transactions
which increases the write amplification significantly.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Ordering of header file inclusion

2019-11-08 Thread Kuntal Ghosh

On Sat, Nov 2, 2019 at 7:42 AM vignesh C  wrote:
>
> >
> Thanks Amit for committing the changes.
> I found couple of more inconsistencies,  the attached patch includes
> the fix for the same.
>
Thanks for the patch. It seems you've to rebase the patch as it
doesn't apply on the latest HEAD. Apart from that, the changes looks
good to me.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: auxiliary processes in pg_stat_ssl

2019-11-04 Thread Kuntal Ghosh

On Mon, Nov 4, 2019 at 9:09 PM Euler Taveira  wrote:
> >
> > But this seems pointless.  Should we not hide those?  Seems this only
> > happened as an unintended side-effect of fc70a4b0df38.  It appears to me
> > that we should redefine that view to restrict backend_type that's
> > 'client backend' (maybe include 'wal receiver'/'wal sender' also, not
> > sure.)
> >
> Yep, it is pointless. BackendType that open connections to server are:
> autovacuum worker, client backend, background worker, wal sender. I
> also notice that pg_stat_gssapi is in the same boat as pg_stat_ssl and
> we should constraint the rows to backend types that open connections.
> I'm attaching a patch to list only connections in those system views.
>
Yeah, We should hide those. As Robert mentioned, I think checking
whether 'client_port IS NOT NULL' is a better approach than checking
the backend_type. The patch looks good to me.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-11-04 Thread Kuntal Ghosh

On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar  wrote:
>
> So your result shows that with "streaming on", performance is
> degrading?  By any chance did you try to see where is the bottleneck?
>
Right. But, as we increase the logical_decoding_work_mem, the
performance improves. I've not analyzed the bottleneck yet. I'm
looking into the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-11-04 Thread Kuntal Ghosh

Hello hackers,

I've done some performance testing of this feature. Following is my
test case (taken from an earlier thread):

postgres=# CREATE TABLE large_test (num1 bigint, num2 double
precision, num3 double precision);
postgres=# \timing on
postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
num2, num3) SELECT round(random()*10), random(), random()*142 FROM
generate_series(1, 100) s(i);

I've kept the publisher and subscriber in two different system.

HEAD:
With 100 tuples,
Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
With 1000 tuples (10 times more),
Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442

With the memory accounting patch, following are the performance results:
With 10 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
9648.223 ms (00:09.648), Spill count: 2315
logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
9895.161 ms (00:09.895), Spill count 3
With 100 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
105761.978 ms (01:45.762), Spill count: 23149
logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
89985.342 ms (01:29.985), Spill count: 23

With logical decoding of in-progress transactions patch and with
streaming on, following are the performance results:
With 10 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
20779.601 ms (00:20.780)
logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
9559.953 ms (00:09.560)
With 100 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
196261.892 ms (03:16.262)
logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
90079.286 ms (01:30.079)
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Custom reloptions and lock modes

2019-09-20 Thread Kuntal Ghosh

On Fri, Sep 20, 2019 at 12:38 PM Michael Paquier  wrote:
>
> > One small thing:
> >
> >   add_int_reloption(bl_relopt_kind, buf,
> >"Number of bits generated for each index column",
> > -  DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS);
> > +  DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS,
> > +  AccessExclusiveLock);
> > Do we need a comment to explain why we're using AccessExclusiveLock in
> > this case?
>
> Because that's the safest default to use here?  That seemed obvious to
> me.
Okay. Sounds good.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Custom reloptions and lock modes

2019-09-19 Thread Kuntal Ghosh

On Fri, Sep 20, 2019 at 7:08 AM Michael Paquier  wrote:
>
> Hence attached is a patch set to address those issues:
> - 0001 makes sure that any existing module creating a custom reloption
> has the lock mode set to AccessExclusiveMode, which would be a sane
> default anyway.  I think that we should just back-patch that and close
> any holes.
Looks good to me. The patch solves the issue and passes with
regression tests. IMHO, it should be back-patched to all the branches.

> - 0002 is a patch which we could use to extend the existing reloption
> APIs to set the lock mode used.  I am aware of the recent work done by
> Nikolay in CC to rework this API set, but I am unsure where we are
> going there, and the resulting patch is actually very short, being
> 20-line long with the current infrastructure.  That could go into
> HEAD.  Table AMs have been added in v12 so custom reloptions could
> gain more in popularity, but as we are very close to the release it
> would not be cool to break those APIs.  The patch simplicity could
> also be a reason sufficient for a back-patch, and I don't think that
> there are many users of them yet.
>
I think this is good approach for now and can be committed on the HEAD only.

One small thing:

  add_int_reloption(bl_relopt_kind, buf,
   "Number of bits generated for each index column",
-  DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS);
+  DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS,
+  AccessExclusiveLock);
Do we need a comment to explain why we're using AccessExclusiveLock in
this case?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: subscriptionCheck failures on nightjar

2019-09-19 Thread Kuntal Ghosh

Hello hackers,

It seems there is a pattern how the error is occurring in different
systems. Following are the relevant log snippets:

nightjar:
sub3 LOG:  received replication command: CREATE_REPLICATION_SLOT
"sub3_16414_sync_16394" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
sub3 LOG:  logical decoding found consistent point at 0/160B578
sub1 PANIC:  could not open file
"pg_logical/snapshots/0-160B578.snap": No such file or directory

dromedary scenario 1:
sub3_16414_sync_16399 LOG:  received replication command:
CREATE_REPLICATION_SLOT "sub3_16414_sync_16399" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
sub3_16414_sync_16399 LOG:  logical decoding found consistent point at 0/15EA694
sub2 PANIC:  could not open file
"pg_logical/snapshots/0-15EA694.snap": No such file or directory


dromedary scenario 2:
sub3_16414_sync_16399 LOG:  received replication command:
CREATE_REPLICATION_SLOT "sub3_16414_sync_16399" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
sub3_16414_sync_16399 LOG:  logical decoding found consistent point at 0/15EA694
sub1 PANIC:  could not open file
"pg_logical/snapshots/0-15EA694.snap": No such file or directory

While subscription 3 is created, it eventually reaches to a consistent
snapshot point and prints the WAL location corresponding to it. It
seems sub1/sub2 immediately fails to serialize the snapshot to the
.snap file having the same WAL location.

Is this helpful?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: subscriptionCheck failures on nightjar

2019-09-18 Thread Kuntal Ghosh

Hello Michael,

On Wed, Sep 18, 2019 at 6:28 AM Michael Paquier  wrote:
>
> On my side, I have let this thing run for a couple of hours with a
> patched version to include a sleep between the rename and the sync but
> I could not reproduce it either:
> #!/bin/bash
> attempt=0
> while true; do
> attempt=$((attempt+1))
> echo "Attempt $attempt"
> cd $HOME/postgres/src/test/recovery/
> PROVE_TESTS=t/006_logical_decoding.pl make check > /dev/null 2>&1
> ERRNUM=$?
> if [ $ERRNUM != 0 ]; then
> echo "Failed at attempt $attempt"
> exit $ERRNUM
> fi
> done
I think the failing test is src/test/subscription/t/010_truncate.pl.
I've tried to reproduce the same failure using your script in OS X
10.14 and Ubuntu 18.04.2 (Linux version 5.0.0-23-generic), but
couldn't reproduce the same.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Kuntal Ghosh

Hello Thomas,

On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro  wrote:
>
> > 1. In UndoLogAllocateInRecovery, when we find the current log number
> > from the list of registered blocks, we don't check whether the
> > block->in_use flag is true or not. In XLogResetInsertion, we just
> > reset in_use flag without reseting the blocks[]->rnode information.
> > So, if we don't check the in_use flag, it's possible that we'll
> > consult some block information from the previous WAL record. IMHO,
> > just adding an in_use check in UndoLogAllocateInRecovery will solve
> > the problem.
>
> Agreed.  I added a line to break out of that loop if !block->in_use.
>
I think we should skip the block if !block->in_use. Because, the undo
buffer can be registered in a subsequent block as well. For different
operations, we can use different block_id to register the undo buffer
in the redo record.

> BTW I am planning to simplify that code considerably, based on a plan
> to introduce a new rule: there can be only one undo record and
> therefore only one undo allocation per WAL record.
>
Okay. In that case, we need to rethink the cases for multi-inserts and
non-inlace updates both of which currently inserts multiple undo
record corresponding to a single WAL record. For multi-inserts, it can
be solved easily by moving all the offset information in the payload.
But, for non-inlace updates, we insert one undo record for the update
and one for the insert. Wondering whether we've to insert two WAL
records - one for update and one for the new insert.

> > 2. A transaction, inserts one undo record and generated a WAL record
> > for the same, say at WAL location 0/2000A000. Next, the undo record
> > gets discarded and WAL is generated to update the meta.discard pointer
> > at location 0/2000B000  At the same time, an ongoing checkpoint with
> > checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
> > Now, the system crashes.
> > Now, the recovery starts from the location 0/2000. When the
> > recovery of 0/2000A000 happens, it sees the undo record that it's
> > about to insert, is already discarded as per meta.discard (flushed by
> > checkpoint). In this case, should we just skip inserting the undo
> > record?
>
> I see two options:
>
> 1.  We make it so that if you're allocating in recovery and discard >
> insert, we'll just set discard = insert so you can proceed.  The code
> in undofile_get_segment_file() already copes with missing files during
> recovery.
>
Interesting. This should work.

>
> > 3. Currently, we create a backup image of the unlogged part of the
> > undo log's metadata only when some backend allocates some space from
> > the undo log (in UndoLogAllocate). This helps us restore the unlogged
> > meta part after a checkpoint.
> > When we perform an undo action, we also update the undo action
> > progress and emit an WAL record. The same operation can performed by
> > the undo worker which doesn't allocate any space from the undo log.
> > So, if an undo worker emits an WAL record to update undo action
> > progress after a checkpoint, it'll not be able to WAL log the backup
> > image of the meta unlogged part. IMHO, this breaks the recovery logic
> > of unlogged part of undo meta.
>
> I thought that was OK because those undo data updates don't depend on
> the insert pointer.  But I see what you mean: the next modification of
> the page that DOES depend on the insert pointer might not log the
> meta-data if it's not the first WAL record to touch it after a
> checkpoint.  Rats.  I'll have to think about that some more.
Cool.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-09-15 Thread Kuntal Ghosh

Hello Thomas,

While testing zheap over undo apis, we've found the following
issues/scenarios that might need some fixes/discussions:

1. In UndoLogAllocateInRecovery, when we find the current log number
from the list of registered blocks, we don't check whether the
block->in_use flag is true or not. In XLogResetInsertion, we just
reset in_use flag without reseting the blocks[]->rnode information.
So, if we don't check the in_use flag, it's possible that we'll
consult some block information from the previous WAL record. IMHO,
just adding an in_use check in UndoLogAllocateInRecovery will solve
the problem.

2. A transaction, inserts one undo record and generated a WAL record
for the same, say at WAL location 0/2000A000. Next, the undo record
gets discarded and WAL is generated to update the meta.discard pointer
at location 0/2000B000  At the same time, an ongoing checkpoint with
checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
Now, the system crashes.
Now, the recovery starts from the location 0/2000. When the
recovery of 0/2000A000 happens, it sees the undo record that it's
about to insert, is already discarded as per meta.discard (flushed by
checkpoint). In this case, should we just skip inserting the undo
record?

3. Currently, we create a backup image of the unlogged part of the
undo log's metadata only when some backend allocates some space from
the undo log (in UndoLogAllocate). This helps us restore the unlogged
meta part after a checkpoint.
When we perform an undo action, we also update the undo action
progress and emit an WAL record. The same operation can performed by
the undo worker which doesn't allocate any space from the undo log.
So, if an undo worker emits an WAL record to update undo action
progress after a checkpoint, it'll not be able to WAL log the backup
image of the meta unlogged part. IMHO, this breaks the recovery logic
of unlogged part of undo meta.

Thoughts?

On Mon, Sep 2, 2019 at 9:47 AM Thomas Munro  wrote:
>
> On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  
> wrote:
> > I'm getting the following assert failure while performing the recovery
> > with the same.
> > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> > File: "undolog.c", Line: 997)"
> >
> > I found that we don't emit an WAL record when we update the
> > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> > crash recovery, some new transaction may use that undo log which is
> > wrong, IMHO. Am I missing something?
>
> Thanks, right, that status logging is wrong, will fix in next version.
>
> --
> Thomas Munro
> https://enterprisedb.com

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] Invalid permission check in pg_stats for functional indexes

2019-09-03 Thread Kuntal Ghosh

On Wed, Sep 4, 2019 at 12:23 AM Pierre Ducroquet  wrote:
>
> All my apologies for that. I submitted this patch some time ago but forgot to
> add it to the commit fest. Attached to this mail is a rebased version.
>
Thank you for the new version of the patch. It looks good to me.
Moving the status to ready for committer.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] psql: add tab completion for \df slash command suffixes

2019-09-03 Thread Kuntal Ghosh

Hello Ian,

On Fri, Aug 30, 2019 at 10:31 AM Ian Barwick
 wrote:
> I just noticed "\df[TAB]" fails to offer any tab-completion for
> the possible suffixes ("\dfa", "\dfn", "\dfp", "\dft" and "\dfw").
> Trivial patch attached, which applies back to Pg96, and separate
> patches for Pg95 and Pg94.
>
> I'll add this to the next commitfest.
I've reviewed the patch. It works as expected in master. But, the
syntax "\dfp" doesn't exist beyond Pg11. So, you've to remove the same
from the patches for Pg95 and Pg94. I guess the same patch should be
applied on Pg10 and Pg96.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] Invalid permission check in pg_stats for functional indexes

2019-09-03 Thread Kuntal Ghosh

Hello Pierre,

> When using a functional index on a table, we realized that the permission
> check done in pg_stats was incorrect and thus preventing valid access to the
> statistics from users.
>
> The attached patch fixes this by introducing a second path in privilege check
> in pg_stats view.
The patch doesn't apply on the latest HEAD [1].
IIUC, the patch introduces an additional privilege check for the
underlying objects involved in the expression/functional index. If the
user has 'select' privileges on all of the columns/objects included in
the expression/functional index, then it should be visible in pg_stats
view. I've applied the patch manually and tested the feature. It works
as expected.

> I have not written a regression test yet, mainly because I'm not 100% certain
> where to write it. Given some hints, I would happily add it to this patch.
>
Yeah, it'll be good to have some regression tests for the same. I'm
also not sure which regression file best suites for these tests.

[1] http://cfbot.cputube.org/patch_24_2274.log

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-30 Thread Kuntal Ghosh

Hello Thomas,

I was doing some testing for the scenario where the undo written by a
transaction overflows to multiple undo logs. For that I've modified
the following macro:
#define UndoLogMaxSize (1024 * 1024) /* 1MB undo log size */
(I should have used the provided pg_force_switch_undo though..)

I'm getting the following assert failure while performing the recovery
with the same.
"TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
File: "undolog.c", Line: 997)"

I found that we don't emit an WAL record when we update the
slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
crash recovery, some new transaction may use that undo log which is
wrong, IMHO. Am I missing something?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Kuntal Ghosh

Hello Andres,

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> > +/* Each worker queue is a binary heap. */
> > +typedef struct
> > +{
> > + binaryheap *bh;
> > + union
> > + {
> > + UndoXidQueue *xid_elems;
> > + UndoSizeQueue *size_elems;
> > + UndoErrorQueue *error_elems;
> > + }   q_choice;
> > +} UndoWorkerQueue;
>
> As we IIRC have decided to change this into a rbtree, I'll ignore
> related parts of the current code.  What is the status of that work?
> I've checked the git trees, without seeing anything? Your last mail with
> patches
> https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com
> doesn't seem to contain that either?
>
Yeah, we're changing this into a rbtree. This is still work-in-progress.

>
> > .
> > +#define GetErrorQueueNthElem(n) \
> > +( \
> > + AssertMacro(!ErrorQueueIsEmpty()), \
> > + DatumGetPointer(binaryheap_nth(UndoWorkerQueues[ERROR_QUEUE].bh, n)) \
> > +)
>
>
> -ETOOMANYMACROS
>
> I think nearly all of these shouldn't exist. See further below.
>
>
> > +#define SetErrorQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr, 
> > e_retry_at, e_occurred_at) \
> > +( \
> > + GetErrorQueueElem(elem).dbid = e_dbid, \
> > + GetErrorQueueElem(elem).full_xid = e_full_xid, \
> > + GetErrorQueueElem(elem).start_urec_ptr = e_start_urec_ptr, \
> > + GetErrorQueueElem(elem).next_retry_at = e_retry_at, \
> > + GetErrorQueueElem(elem).err_occurred_at = e_occurred_at \
> > +)
>
> It's very very rarely a good idea to have macros that evaluate their
> arguments multiple times. It'll also never be a good idea to get the
> same element multiple times from a queue.  If needed - I'm very doubtful
> of that, given that there's a single caller - it should be a static
> inline function that gets the element once, stores it in a local
> variable, and then updates all the fields.
>
Noted. Earlier, Robert also raised the point of using so many macros.
He also suggested to use a single type of object that stores all the
information we need. It'll make things simpler and easier to
understand. In the upcoming patch set, we're removing all these
changes.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-06 Thread Kuntal Ghosh

Hello Thomas,

On Tue, Aug 6, 2019 at 9:50 AM Thomas Munro  wrote:
>
> On Tue, Aug 6, 2019 at 4:35 AM Andres Freund  wrote:
> > On 2019-08-05 20:58:05 +1200, Thomas Munro wrote:
> > > 1.  Commit dafaa3efb75 "Implement genuine serializable isolation
> > > level." (2011) locked the root tuple, because it set t_self to *tid.
> > > Possibly due to confusion about the effect of the preceding line
> > > ItemPointerSetOffsetNumber(tid, offnum).
> > >
> > > 2.  Commit commit 81fbbfe3352 "Fix bugs in SSI tuple locking." (2013)
> > > fixed that by adding ItemPointerSetOffsetNumber(&heapTuple->t_self,
> > > offnum).
> >
> > Hm. It's not at all sure that it's ok to report the non-root tuple tid
> > here. I.e. I'm fairly sure there was a reason to not just set it to the
> > actual tid. I think I might have written that up on the list at some
> > point. Let me dig in memory and list. Obviously possible that that was
> > also obsoleted by parallel changes.
>
> Adding Heikki and Kevin.
>
> I haven't found your earlier discussion about that yet, but would be
> keen to read it if you can find it.   I wondered if your argument
> might have had something to do with heap pruning, but I can't find a
> problem there.  It's not as though the TID of any visible tuples
> change, it's just that some dead stuff goes away and the root line
> pointer is changed to LP_REDIRECT if it wasn't already.
>
> As for the argument for locking the tuple we emit rather than the HOT
> root, I think the questions are: (1) How exactly do we get away with
> locking only one version in a regular non-HOT update chain?  (2) Is it
> OK to do that with a HOT root?
>
> The answer to the first question is given in README-SSI[1].
> Unfortunately it doesn't discuss HOT directly, but I suspect the
> answer is no, HOT is not special here.  By my reading, it relies on
> the version you lock being the version visible to your snapshot, which
> is important because later updates have to touch that tuple to write
> the next version.  That doesn't apply to some arbitrarily older tuple
> that happens to be a HOT root.  Concretely, heap_update() does
> CheckForSerializableConflictIn(relation, &oldtup, buffer), which is
> only going to produce a rw conflict if T1 took an SIREAD on precisely
> the version T2 locks in that path, not some arbitrarily older version
> that happens to be a HOT root.  A HOT root might never be considered
> again by concurrent writers, no?
>
If I understand the problem, this is the same serialization issue as
with in-place updates for zheap. I had a discussion with Kevin
regarding the same in this thread [1]. It seems if we're locking the
hot root id, we may report some false positive serializable errors.


> > > This must be in want of an isolation test, but I haven't yet tried to
> > > get my head around how to write a test that would show the difference.
> >
> > Indeed.
>
> One practical problem is that the only way to reach
> PredicateLockTuple() is from an index scan, and the index scan locks
> the index page (or the whole index, depending on
> rd_indam->ampredlocks).  So I think if you want to see a serialization
> anomaly you'll need multiple indexes (so that index page locks don't
> hide the problem), a long enough HOT chain and then probably several
> transactions to be able to miss a cycle that should be picked up by
> the logic in [1].  I'm out of steam for this problem today though.
>
> The simple test from the report[3] that resulted in commit 81fbbfe3352
> doesn't work for me (ie with permutation "r1" "r2" "w1" "w2" "c1" "c2"
> twice in a row).  The best I've come up with so far is an assertion
> that we predicate-lock the same row version that we emitted to the
> user, when reached via an index lookup that visits a HOT row.  The
> test outputs 'f' for master, but 't' with the change to heapam.c.
>
Here is an example from the multiple-row-versions isolation test which
fails if we perform in-place updates for zheap. I think the same will
be relevant if we lock root tuple id instead of the tuple itself.
Step 1: T1-> BEGIN; Read FROM t where id=100;
Step 2: T2-> BEGIN; UPDATE t where id=100; COMMIT; (creates T1->T2)
Step 3: T3-> BEGIN; UPDATE t where id=100; Read FROM t where id=50;
Step 4: T4-> BEGIN; UPDATE t where id= 50; Read FROM t where id=1;
COMMIT;  (creates T3->T4)
Step 5: T3-> COMMIT;
Step 6: T1-> UPDATE t where id=1; COMMIT;  (creates T4->T1,)

At step 6, when the update statement is executed, T1 is rolled back
because of T3->T4->T1.

But for zheap, step 3 also creates a dependency T1->T3 because of
in-place update. When T4 commits in step 4, it marks T3 as doomed
because of T1 --> T3 --> T4. Hence, in step 5, T3 is rolled back.

[1]  Re: In-place updates and serializable transactions:
https://www.postgresql.org/message-id/CAGz5QCJzreUqJqHeXrbEs6xb0zCNKBHhOj6D9Tjd3btJTzydxg%40mail.gmail.com

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Kuntal Ghosh

On Thu, Aug 1, 2019 at 2:36 AM Andres Freund  wrote:
> > > >
> > > > I think the only part its doing for sub-transaction is
> > > > SubTransGetTopmostTransaction(xid). If xid passed to this function is
> > > > already top most transaction which is case for zheap and zedstore, then
> > > > there is no downside to keeping that code here in common place.
> > >
> > > Well, it's far from a cheap function. It'll do unnecessary on-disk
> > > lookups in many cases. I'd call that quite a downside.
> > >
> >
> > Okay, agree, its costly function and better to avoid the call if possible.
> >
> > Instead of moving the handling out of the function, how do feel about
> > adding boolean isTopTransactionId argument to function
> > CheckForSerializableConflictOut(). The AMs, which implicitly know, only
> > pass top transaction Id to this function, can pass true and avoid the
> > function call to SubTransGetTopmostTransaction(xid). With this
> > subtransaction code remains in generic place and AMs intending to use it
> > continue to leverage the common code, plus explicitly clarifies the
> > behavior as well.
>
> Looking at the code as of master, we currently have:
>
> - PredicateLockTuple() calls SubTransGetTopmostTransaction() to figure
>   out a whether the tuple has been locked by the current
>   transaction. That check afaict just should be
>   TransactionIdIsCurrentTransactionId(), without all the other
>   stuff that's done today.
>
Yeah. this is the only part where predicate locking uses the subxids.
Since, predicate locking always use the top xid, IMHO, it'll be good
to make this api independent of subxids.

>   TransactionIdIsCurrentTransactionId() imo ought to be optimized to
>   always check for the top level transactionid first - that's a good bet
>   today, but even moreso for the upcoming AMs that won't have separate
>   xids for subtransactions.  Alternatively we shouldn't make that a
>   binary search for each subtrans level, but just have a small
>   simplehash hashtable for xids.
A check for top transaction id first and usage of simple sound like
good optimizations. But, I'm not sure whether these changes should be
part of this patch or a separate one.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Kuntal Ghosh

Hello Thomas,

Here are some review comments on 0003-Add-undo-log-manager.patch. I've
tried to avoid duplicate comments as much as possible.

1. In UndoLogAllocate,
+ * time this backend as needed to write to an undo log at all or because
s/as/has

+ * Maintain our tracking of the and the previous transaction start
Do you mean current log's transaction start as well?

2. In UndoLogAllocateInRecovery,
we try to find the current log from the first undo buffer. So, after a
log switch, we always have to register at least one buffer from the
current undo log first. If we're updating something in the previous
log, the respective buffer should be registered after that. I think we
should document this in the comments.

3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full),
it seems the 'full' parameter is not used anywhere. Do we still need this?

+ /* It's been recycled.  SO it must have been entirely discarded. */
s/SO/So

4. In CleanUpUndoCheckPointFiles,
we can emit a debug2 message with something similar to : 'removed
unreachable undo metadata files'

+ if (unlink(path) != 0)
+ elog(ERROR, "could not unlink file \"%s\": %m", path);
according to my observation, whenever we deal with a file operation,
we usually emit a ereport message with errcode_for_file_access().
Should we change it to ereport? There are other file operations as
well including read(), OpenTransientFile() etc.

5. In CheckPointUndoLogs,
+ /* Capture snapshot while holding each mutex. */
+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+ serialized[num_logs++] = slot->meta;
+ LWLockRelease(&slot->mutex);
why do we need an exclusive lock to read something from the slot? A
share lock seems to be sufficient.

pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called
after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE)
without calling pgstat_report_wait_end(). I think you've done the
same to avoid an extra function call. But, it differs from other
places in the PG code. Perhaps, we should follow this approach
everywhere.

6. In StartupUndoLogs,
+ if (fd < 0)
+ elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
assuming your agreement upon changing above elog to ereport, the
message should be more user friendly. May be something like 'cannot
open pg_undo file'.

+ if ((size = read(fd, &slot->meta, sizeof(slot->meta))) != sizeof(slot->meta))
The usage of size of doesn't look like a problem. But, we can save
some extra padding bytes at the end if we use (offsetof + sizeof)
approach similar to other places in PG.

7. In free_undo_log_slot,
+ /*
+ * When removing an undo log from a slot in shared memory, we acquire
+ * UndoLogLock, log->mutex and log->discard_lock, so that other code can
+ * hold any one of those locks to prevent the slot from being recycled.
+ */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+ Assert(slot->logno != InvalidUndoLogNumber);
+ slot->logno = InvalidUndoLogNumber;
+ memset(&slot->meta, 0, sizeof(slot->meta));
+ LWLockRelease(&slot->mutex);
+ LWLockRelease(UndoLogLock);
you've not taken the discard_lock as mentioned in the comment.

8. In find_undo_log_slot,
+ * 1.  If the calling code knows that it is attached to this lock or is the
s/lock/slot

+ * 2.  All other code should acquire log->mutex before accessing any members,
+ * and after doing so, check that the logno hasn't moved.  If it is not, the
+ * entire undo log must be assumed to be discarded (as if this function
+ * returned NULL) and the caller must behave accordingly.
Perhaps, you meant '..check that the logno remains same. If it is not..'.

+ /*
+ * If we didn't find it, then it must already have been entirely
+ * discarded.  We create a negative cache entry so that we can answer
+ * this question quickly next time.
+ *
+ * TODO: We could track the lowest known undo log number, to reduce
+ * the negative cache entry bloat.
+ */
This is an interesting thought. But, I'm wondering how we are going to
search the discarded logno in the simple hash. I guess that's why it's
in the TODO list.

9. In attach_undo_log,
+ * For now we have a simple linked list of unattached undo logs for each
+ * persistence level.  We'll grovel though it to find something for the
+ * tablespace you asked for.  If you're not using multiple tablespaces
s/though/through

+ if (slot == NULL)
+ {
+ if (UndoLogShared->next_logno > MaxUndoLogNumber)
+ {
+ /*
+ * You've used up all 16 exabytes of undo log addressing space.
+ * This is a difficult state to reach using only 16 exabytes of
+ * WAL.
+ */
+ elog(ERROR, "undo log address space exhausted");
+ }
looks like a potential unlikely() condition.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Some reloptions non-initialized when loaded

2019-06-19 Thread Kuntal Ghosh

Hello,

On Wed, Jun 19, 2019 at 10:23 AM Michael Paquier  wrote:
>
> Hi all,
>
> While looking at this code, I have noticed that a couple of reloptions
> which are not toast-specific don't get properly initialized.
> toast_tuple_target and parallel_workers are the ones standing out.
>
Do we also need to initialize vacuum_cleanup_index_scale_factor?



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Adaptive query optimization

2019-06-12 Thread Kuntal Ghosh

On Thu, Jun 13, 2019 at 5:49 AM Tomas Vondra
 wrote:
>
> For example, we might require 1000 samples for a given node (say, scan
> with some quals), before we start using it to tweak the estimates. Once
> we get the number of estimates, we can continue collecting more data,
> and once in a while update the correction. This would require some care,
> of course, because we need to know what coefficient was used to compute
> the estimate, but that's solvable by having some sort of epoch.
>
> Of course, the question is what number should we use, but overall this
> would be a much lower-overhead way to do the learning.
>
> Unfortunately, the learning as implemented in the patch does not allow
> this. It pretty much requires dedicated learning phase with generated
> workload, in a single process.
>
> But I think that's solvable, assuming we:
>
> 1) Store the data in shared memory, instead of a file. Collect data from
> all backends, instead of just a single one, etc.
>
> 2) Make the decision for individual entries, depending on how many
> samples we have for it.
>
Sounds good. I was trying to think whether we can maintain a running
coefficient. In that way, we don't have to store the samples. But,
calculating a running coefficient for more than two variables (with
some single pass algorithm) seems to be a hard problem. Moreover, it
can introduce significant misestimation. Your suggested approach works
better.

> I suggest we try to solve one issue at a time. I agree advising which
> indexes to create is a very interesting (and valuable) thing, but I see
> it as an extension of the AQO feature. That is, basic AQO (tweaking row
> estimates) can work without it.
>
+1

> >> Now, if someone uses this same scan in a join, like for example
> >>
> >>SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
> >> WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
> >>   AND (t2.x = ? AND t2.y = ?)
> >>
> >> then we can still apply the same correction to the t1 scan (I think).
> >> But then we can also collect data for the t1-t2 join, and compute a
> >> correction coefficient in a similar way. It requires a bit of care
> >> because we need to compensate for misestimates of inputs, but I think
> >> that's doable.
> >>
> >That'll be an interesting work. For the above query, we can definitely
> >calculate the correction coefficient of t1-t2 join given (t1.a = ? AND
> >t1.b = ? AND t1.c < ?) and
> >(t2.x = ? AND t2.y = ?) are true. But, I'm not sure how we can
> >extrapolate that value for t1-t2 join.
>
> I'm not sure I see the problem? Essentially, we need to know the sizes
> of the join inputs, i.e.
>
> t1 WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
>
> t2 WHERE (t2.x = ? AND t2.y = ?)
>
> (which we know, and we know how to correct the estimate), and then the
> selectivity of the join condition. Which we also know.
>
> Obviously, there's a chance those parts (clauses at the scan / join
> level) are correlated, which could make this less accurate.
This is exactly what my concern is. The base predicate selectivities
of t1 and t2 should have an impact on the calculation of the
correction coefficient. If those selectivities are low, the
misestimation (which is actual/estimate) should not affect the t1-t2
join correction coefficient much.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Adaptive query optimization

2019-06-12 Thread Kuntal Ghosh

ent
> when computing the scan estimate. (And we need to be careful about
> collection new data, because the estimates will include this correction.
> But that can be done by tracking "epoch" of the plan.)
>
> Now, if someone uses this same scan in a join, like for example
>
>SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
> WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
>   AND (t2.x = ? AND t2.y = ?)
>
> then we can still apply the same correction to the t1 scan (I think).
> But then we can also collect data for the t1-t2 join, and compute a
> correction coefficient in a similar way. It requires a bit of care
> because we need to compensate for misestimates of inputs, but I think
> that's doable.
>
That'll be an interesting work. For the above query, we can definitely
calculate the correction coefficient of t1-t2 join given (t1.a = ? AND
t1.b = ? AND t1.c < ?) and
(t2.x = ? AND t2.y = ?) are true. But, I'm not sure how we can
extrapolate that value for t1-t2 join.
>
> As far as I know Oleg's AQO is now used by Amason.
> So it is something more than just PoC. But certainly there are still many 
> problems
> and my experiments with JOB benchmark shown that there are still a lot of 
> things to improve.
>
Nice.

[1] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.921&rep=rep1&type=pdf
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [pg_rewind] cp: cannot stat ‘pg_wal/RECOVERYHISTORY’: No such file or directory

2019-06-10 Thread Kuntal Ghosh

On Mon, Jun 10, 2019 at 7:19 PM Robert Haas  wrote:
>
> On Mon, Jun 10, 2019 at 7:08 AM Kuntal Ghosh  
> wrote:
> > This can surely be fixed from the script. While configuring the old
> > master as a standby server, clear/modify the settings in
> > postgresql.auto.conf. But, it contradicts with the comment in the file
> > which forbids the user from editing the file.
>
> The user isn't really forbidden from editing the file.  They can do so
> safely when the server is down.
>
> This whole thing looks like a nonissue to me.  If you set it up wrong,
> it won't work.  So don't do that.
>
Yeah. Sounds fair.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Why to index a "Recently DEAD" tuple when creating index

2019-06-10 Thread Kuntal Ghosh

On Mon, Jun 10, 2019 at 5:26 PM Tom Lane  wrote:
>
> Kuntal Ghosh  writes:
> >> 2.   If we only support "Read Committed" isolation level,  is there a safe 
> >> way to not index such data?
>
> > I can't think of a case where the RECENTLY_DELETED tuple needs to be
> > indexed in "Read Committed" case.
>
> I think you're making dangerously optimistic assumptions about how
> long a query snapshot might survive in READ COMMITTED mode.
>
> In particular, I suspect you're reasoning that the new index couldn't
> be used except by a freshly-created query plan, which is possibly
> true, and that such a plan must be used with a freshly-created
> snapshot, which is simply wrong.  A counterexample could be built
> using a SQL or PL function that's marked STABLE, because such a
> function is defined to be executed using the calling query's
> snapshot.  But it'll make query plans using current reality.
>
Wow. I've not thought of this scenario. Also, I'm not aware about this
different snapshot usage as well. I'll debug the same. Thank you Tom.

So, the READ COMMITTED mode will also cause this kind of issues.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [pg_rewind] cp: cannot stat ‘pg_wal/RECOVERYHISTORY’: No such file or directory

2019-06-10 Thread Kuntal Ghosh

Hello,

On Wed, Jun 5, 2019 at 11:55 AM tushar  wrote:
I can see two different problems in this setup.

> > 2)Slave Setup  ->  ./pg_basebackup -PR -X stream -c fast -h 127.0.0.1
> > -U centos -p 5432 -D slave
> > restore_command='cp %p /tmp/archive_dir1/%f'
> > "
> > 7)Modify old master/postgresql.conf file -
> > restore_command='cp %p /tmp/archive_dir1/%f'
When we define a restore command, we tell the server to copy a file a
WAL file from the archive. So, it should be
restore_command='cp tmp/archive_dir1/%f %p'

This is the reason you're getting this following error.
> > cp: cannot stat ‘pg_wal/RECOVERYHISTORY’: No such file or directory
> > cp: cannot stat ‘pg_wal/RECOVERYXLOG’: No such file or directory

> > 2019-05-27 18:55:47.424 IST [25513] FATAL:  the database system is
> > starting up
> > 2019-05-27 18:55:47.425 IST [25512] FATAL:  could not connect to the
> > primary server: FATAL:  the database system is starting up
This case looks interesting.

1. Master is running on port 5432.
2. A standby is created using basebackup with -R option. So, the
pg_basebackup appends the primary connection settings to
postgresql.auto.conf so that the streaming replication can use the
same settings later on.
cat postgresql.auto.conf -> primary_conninfo = 'port=5432'
3. The standby is started in port 5433.
4. Standby is promoted and old master is stopped.
5. Using pg_rewind, the old master is synchronized with the promoted
standby. As part of the process, it has copied the
postgresql.auto.conf of promoted standby in the old master.
6. Now, the old master is configured as a standby but the
postgresql.auto.conf still contains the following settings:
cat postgresql.auto.conf -> primary_conninfo = 'port=5432'
So, the old master tries to connect to the server on port 5432 and
finds itself which is still in recovery.

This can surely be fixed from the script. While configuring the old
master as a standby server, clear/modify the settings in
postgresql.auto.conf. But, it contradicts with the comment in the file
which forbids the user from editing the file.

Any thoughts?
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Why to index a "Recently DEAD" tuple when creating index

2019-06-10 Thread Kuntal Ghosh

On Mon, Jun 10, 2019 at 2:12 PM Alex  wrote:
> On Mon, Jun 10, 2019 at 4:10 PM Kuntal Ghosh  
> wrote:
>> I think what I'm trying to say is different.
>>
>> For my case, the sequence is as following:
>> 1. Transaction A has deleted a tuple, say t1 and got committed.
>> 2. Index A has been created successfully.
>> 3. Now, transaction B starts and use the index A to fetch the tuple
>> t1. While doing visibility check, transaction B gets to know that t1
>> has been deleted by a committed transaction A, so it can't see the
>> tuple. But, it creates a dependency edge that transaction A precedes
>> transaction B. This edge is required to detect a serializable conflict
>> failure.
>>
>> If you don't create the index entry, it'll not be able to create that edge.
>
>
> Thanks,  I got the difference now, but still not get the necessity of it.
> 1.   Assume we don't index it,  in which situation we can get a wrong result?

Consider the following sequence of three different transactions X,A and B:

1. Transaction X reads a tuple t2.
2. Transaction A updates the tuple t2, deletes a tuple t1 and gets
committed. So, there transaction X precedes transaction A, i.e., X <-
A.
3. Index A is created successfully.
4. Transaction B starts and use the index A to fetch tuple t1. But,
it's already deleted by the committed transaction A. So, transaction A
precedes transaction B, i.e., A<-B.
5. At this point you've a dangerous structure X<-A<-B (definition of
dangerous structure src/backend/storage/lmgr/README-SSI) in the graph
which can produce an anomaly. For example now, if X tries to update
another tuple previously read by B, you'll have a dependency B<-X.
But, you already have X<-B which leads to serializable conflict.
Postgres tries to resolve this anomaly by rolling back one of the
transaction.

In your case, it'll be difficult to detect.

> 2.   If we only support "Read Committed" isolation level,  is there a safe 
> way to not index such data?
>
I can't think of a case where the RECENTLY_DELETED tuple needs to be
indexed in "Read Committed" case. So, your suggestion likely to work
logically in "Read committed" isolation level. But, I'm not sure
whether you'll encounter any assertion failures in vacuum path or
concurrent index paths.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Why to index a "Recently DEAD" tuple when creating index

2019-06-10 Thread Kuntal Ghosh

On Mon, Jun 10, 2019 at 1:30 PM Alex  wrote:
>
>
>
> On Mon, Jun 10, 2019 at 3:28 PM Kuntal Ghosh  
> wrote:
>>
>> On Mon, Jun 10, 2019 at 12:15 PM Alex  wrote:
>>>
>>>  HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */
>>>
>>>  It is a tuple which has been deleted AND committed but before the delete 
>>> there is a transaction started but not committed. Let call this transaction 
>>> as Transaction A.
>>>
>>> if we create index on this time, Let's call this index as Index A， it still 
>>> index this record.  my question is why need this.
>>>
>> In this case, the changes of the tuple is not visible yet. Now suppose, your 
>> transaction A is serializable and you've another serializable transaction B 
>> which can see the index A. It generates a plan that requires to fetch the 
>> deleted tuple through an index scan. If the tuple is not present in the 
>> index, how are you going to create a conflict edge between transaction A and 
>> transaction B?
>>
>> Basically, you need to identify the following clause to detect serializable 
>> conflicts:
>> Transaction A precedes transaction B. (Because, transaction A has deleted a 
>> tuple and it's not visible to transaction B)
>>
>
> thanks Ghosh.  Looks your answer is similar with my previous point 
> (transaction is  serializable).   actually if the transaction B can't see the 
> “deleted" which has been committed,  should it see the index A which is 
> created after the "delete" transaction?
>
I think what I'm trying to say is different.

For my case, the sequence is as following:
1. Transaction A has deleted a tuple, say t1 and got committed.
2. Index A has been created successfully.
3. Now, transaction B starts and use the index A to fetch the tuple
t1. While doing visibility check, transaction B gets to know that t1
has been deleted by a committed transaction A, so it can't see the
tuple. But, it creates a dependency edge that transaction A precedes
transaction B. This edge is required to detect a serializable conflict
failure.

If you don't create the index entry, it'll not be able to create that edge.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Why to index a "Recently DEAD" tuple when creating index

2019-06-10 Thread Kuntal Ghosh

On Mon, Jun 10, 2019 at 12:15 PM Alex  wrote:

>  HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */
>
>  It is a tuple which has been deleted AND committed but before the delete
> there is a transaction started but not committed. Let call this transaction
> as Transaction A.
>
> if we create index on this time, Let's call this index as Index A， it
> still index this record.  my question is why need this.
>
> In this case, the changes of the tuple is not visible yet. Now suppose,
your transaction A is serializable and you've another serializable
transaction B which can see the index A. It generates a plan that requires
to fetch the deleted tuple through an index scan. If the tuple is not
present in the index, how are you going to create a conflict edge between
transaction A and transaction B?

Basically, you need to identify the following clause to detect serializable
conflicts:
Transaction A precedes transaction B. (Because, transaction A has deleted a
tuple and it's not visible to transaction B)

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Questions of 'for update'

2019-06-10 Thread Kuntal Ghosh

On Mon, Jun 10, 2019 at 12:42 PM Etsuro Fujita 
wrote:

> Hi,
>
> On Mon, Jun 10, 2019 at 3:50 PM Kuntal Ghosh 
> wrote:
> > On Mon, Jun 10, 2019 at 11:31 AM Zhenghua Lyu  wrote:
> >> 2. Is the case above a bug or a feature?
> >>
> > IMHO, it looks like an expected behaviour of a correct transaction
> management implementation.
>
> This is documented behavior; see the Caution for The Locking Clause on
> the SELECT reference page:
> https://www.postgresql.org/docs/11/sql-select.html
>
>
> Great. It also suggests a workaround.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Questions of 'for update'

2019-06-09 Thread Kuntal Ghosh

Hello,

On Mon, Jun 10, 2019 at 11:31 AM Zhenghua Lyu  wrote:

>
> 1. why after emitting `lockrows` plannode,  the result can no longer be
> assumed sorted?
>
The plan corresponding to your select query is as following:
  QUERY PLAN
---
Limit
  ->  LockRows
->  Sort
  Sort Key: c
  ->  Seq Scan on t

In LockRows node, the executer tries to lock each tuple which are provided
by the Sort node. In the meantime, it's possible that some transaction
updates a tuple (which is to be locked by the current transaction) and gets
committed. These changes will be visible to the current transaction if it
has a transaction isolation level lesser than REPEATABLE_READ. So, the
current transaction needs to check whether the updated tuple still
satisfies the qual check (in your query, there is no quals, so it always
satisfies). If it satisfies, it returns the updated tuple.
Since, the sort has been performed by an earlier node, the output will no
longer be sorted.

> 2. Is the case above a bug or a feature?
>
> IMHO, it looks like an expected behaviour of a correct transaction
management implementation. The argument can be that the snapshot is
consistent throughout all the nodes. Whatever tuple you've fetched from the
bottom level is locked correctly.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Avoiding deadlock errors in CREATE INDEX CONCURRENTLY

2019-05-15 Thread Kuntal Ghosh

Hello,

On Wed, May 15, 2019 at 1:45 PM Goel, Dhruv  wrote:

>
>
> Proposed Solution:
>
> We remove the third wait state completely from the concurrent index build.
> When we mark the index as ready, we also mark “indcheckxmin” to true which
> essentially enforces Postgres to not use this index for older snapshots.
>
>
>
I think there is a problem in the proposed solution. When phase 3 is
reached, the index is valid. But it might not contain tuples deleted
just before the reference snapshot was taken. Hence, we wait for those
transactions that might have older snapshot. The TransactionXmin of these
transactions can be greater than the xmin of the pg_index entry for this
index.
Instead of waiting in the third phase, if we just set indcheckxmin as true,
the above transactions will be able to use the index which is wrong.
(because they won't find the recently deleted tuples from the index that
are still live according to their snapshots)

The respective code from get_relation_info:
if (index->indcheckxmin &&



 
!TransactionIdPrecedes(HeapTupleHeaderGetXmin(indexRelation->rd_indextuple->t_data),
TransactionXmin))
 { /* don't use this index */ }

Please let me know if I'm missing something.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PANIC :Call AbortTransaction when transaction id is no normal

2019-05-13 Thread Kuntal Ghosh

On Mon, May 13, 2019 at 7:07 PM Tom Lane  wrote:

> Michael Paquier  writes:
> > On Mon, May 13, 2019 at 01:25:19PM +0530, Kuntal Ghosh wrote:
> >> If we fix the issue in this way, we're certainly not going to do all
> >> those portal,locks,memory,resource owner cleanups that are done
> >> inside AbortTransaction() for a normal transaction ID. But, I'm not
> >> sure how relevant those steps are since the database is anyway
> >> shutting down.
>
> > And it is happening in bootstrap, meaning that the data folder is most
> > likely toast, and needs to be reinitialized.
>
> Indeed, initdb is going to remove the data directory if the bootstrap run
> crashes.
>
> But ... that code's been like that for decades and nobody's complained
> before.  Why are we worried about bootstrap's response to signals at all?
>
> +1

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PANIC :Call AbortTransaction when transaction id is no normal

2019-05-13 Thread Kuntal Ghosh

Hello,

On Mon, May 13, 2019 at 10:15 AM Thunder  wrote:

> I try to fix this issue and check whether it's normal transaction id
> before we do abort.
>
> diff --git a/src/backend/access/transam/xact.c
> b/src/backend/access/transam/xact.c
> index 20feeec327..dbf2bf567a 100644
> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -4504,8 +4504,13 @@ RollbackAndReleaseCurrentSubTransaction(void)
>  void
>  AbortOutOfAnyTransaction(void)
>  {
> +   TransactionId xid = GetCurrentTransactionIdIfAny();
> TransactionState s = CurrentTransactionState;
>
> +   /* Check to see if the transaction ID is a permanent one because
> we cannot abort it */
> +   if (!TransactionIdIsNormal(xid))
> +   return;
> +
> /* Ensure we're not running in a doomed memory context */
> AtAbort_Memory();
>
> Can we fix in this way?
>
> If we fix the issue in this way, we're certainly not going to do all those
portal,locks,memory,resource owner cleanups that are done
inside AbortTransaction() for a normal transaction ID. But, I'm not sure
how relevant those steps are since the database is anyway shutting down.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-05-10 Thread Kuntal Ghosh

Hello Thomas,

In pg_buffercache contrib module, the file pg_buffercache--1.3--1.4.sql is
missing. AFAICS, this file should be added as part of the following commit:
Add SmgrId to smgropen() and BufferTag
<https://github.com/EnterpriseDB/zheap/commit/d2eb13a80cb67d06c8798976c17fd760be0da332>

Otherwise, I'm not able to compile the contrib modules. I've also attached
the patch to fix the same.


On Fri, May 10, 2019 at 11:48 AM Thomas Munro 
wrote:

> On Thu, May 9, 2019 at 6:34 PM Dilip Kumar  wrote:
> > Patches can be applied on top of undo branch [1] commit:
> > (cb777466d008e656f03771cf16ec7ef9d6f2778b)
>
> Hello all,
>
> Here is a new patch set which includes all of the patches discussed in
> this thread in one go, rebased on today's master.  To summarise the
> main layers, from the top down we have:
>
>  0013:   undo-based orphaned file clean-up ($SUBJECT, a demo of
> undo technology)
>  0009-0010:  undo processing (execution of undo actions when rolling back)
>  0008:   undo records
>  0001-0007:  undo storage
>
> The main changes to the storage layer since the last time I posted the
> full patch stack:
>
> * pg_upgrade support: you can't have any live undo logs (much like 2PC
> transactions, we want to be free to change the format), but some work
> was required to make sure that all "discarded" undo record pointers
> from the old cluster still appear as discarded in the new cluster, as
> well as any from the new cluster
>
> * tweaks to various other src/bin tools that are aware of files under
> pgdata and were confused by undo segment files
>
> * the fsync of undo log segment files when they're created or recycled
> is now handed off to the checkpointer (this was identified as a
> performance problem for zheap)
>
> * code tidy-up, removing dead code (undo log rewind, prevlen, prevlog
> were no longer needed by patches higher up in the stack), removing
> global variables, noisy LOG messages about undo segment files now
> reduced to DEBUG1
>
> * new extension contrib/undoinspect, for developer use, showing what
> will be undone if you abort:
>
> postgres=# begin;
> BEGIN
> postgres=# create table t();
> CREATE TABLE
> postgres=# select * from undoinspect();
>  urecptr  |  rmgr   | flags | xid |
> description
>
> --+-+---+-+-
>  32FA | Storage | P,T   | 487 | CREATE dbid=12934,
> tsid=1663, relfile=16393
> (1 row)
>
> One silly detail: I had to change the default max_worker_processes
> from 8 to 12, because otherwise a couple of tests run with fewer
> parallel workers than they expect, due to undo worker processes using
> up slots.  There is probably a better solution to that problem.
>
> I put the patches in a tarball here, but they are also available from
> https://github.com/EnterpriseDB/zheap/tree/undo.
>
> --
> Thomas Munro
> https://enterprisedb.com
>


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


0001-Add-missing-file-in-pg_buffercache.patch
Description: Binary data

Re: Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Kuntal Ghosh

On Mon, Apr 22, 2019 at 6:07 PM Kyotaro HORIGUCHI
 wrote:
> If you don't have a problem using TAP test suite, tablespace is
> allowed with a bit restricted steps using the first patch in my
> just posted patchset[1]. This will work for you if you are okay
> with creating a standby after creating a tablespace. See the
> second patch in the patchset.
>
> If you stick on shell script, the following steps allow tablespaces.
>
> 1. Create tablespace directories for both master and standby.
> 2. Create a master then start.
> 3. Create tablespaces on the master.
> 4. Create a standby using pg_basebackup --tablespace_mapping==
> 5. Start the standby.
>
> [1] 
> https://www.postgresql.org/message-id/20190422.211933.156769089.horiguchi.kyot...@lab.ntt.co.jp
>
Thank you for the info. I'll try the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Kuntal Ghosh

Hello hackers,

With a master-standby setup configured on the same machine, I'm
getting a panic in tablespace test while running make installcheck.

1. CREATE TABLESPACE regress_tblspacewith LOCATION 'blah';
2. DROP TABLESPACE regress_tblspacewith;
3. CREATE TABLESPACE regress_tblspace LOCATION 'blah';
-- do some operations in this tablespace
4. DROP TABLESPACE regress_tblspace;

The master panics at the last statement when standby has completed
applying the WAL up to step 2 but hasn't started step 3.
PANIC:  could not fsync file
"pg_tblspc/16387/PG_12_201904072/16384/16446": No such file or
directory

The reason is both the tablespace points to the same location. When
master tries to delete the new tablespace (and requests a checkpoint),
the corresponding folder is already deleted by the standby while
applying WAL to delete the old tablespace. I'm able to reproduce the
issue with the attached script.

sh standby-server-setup.sh
make installcheck

I accept that configuring master-standby on the same machine for this
test is not okay. But, can we avoid the PANIC somehow? Or, is this
intentional and I should not include testtablespace in this case?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
echo "Cleaning.."
pkill -9 postgres
rm -rf *_logfile
rm /tmp/failover.log
rm -rf /tmp/archive_dir
mkdir /tmp/archive_dir

export PGPORT=5432			#MASTER PORT
PGSQL_DIR=$(pwd)
PGSQL_BIN=$PGSQL_DIR/install/bin
PGSQL_MASTER=$PGSQL_BIN/master		#DATA FOLDER FOR PRIMARY/MASTER SERVER
PGSQL_STANDBY=$PGSQL_BIN/standby	#DATA FOLDER FOR BACKUP/STANDBY SERVER

#Cleanup the master and slave data directory and create a new one.
rm -rf $PGSQL_MASTER $PGSQL_STANDBY
mkdir $PGSQL_MASTER $PGSQL_STANDBY
chmod 700 $PGSQL_MASTER
chmod 700 $PGSQL_STANDBY

#Initialize MASTER
$PGSQL_BIN/initdb -D $PGSQL_MASTER
echo "wal_level = hot_standby" >> $PGSQL_MASTER/postgresql.conf
echo "max_wal_senders = 3" >> $PGSQL_MASTER/postgresql.conf
echo "wal_keep_segments = 50" >> $PGSQL_MASTER/postgresql.conf
echo "hot_standby = on" >> $PGSQL_MASTER/postgresql.conf
#echo "max_standby_streaming_delay= -1" >> $PGSQL_MASTER/postgresql.conf
#echo "wal_consistency_checking='all'" >> $PGSQL_MASTER/postgresql.conf
echo "max_wal_size= 10GB" >> $PGSQL_MASTER/postgresql.conf
#echo "log_min_messages= debug2" >> $PGSQL_MASTER/postgresql.conf

#Setup replication settings

#Start Master
export PGPORT=5432
echo "Starting Master.."
$PGSQL_BIN/pg_ctl -D $PGSQL_MASTER -c -w -l master_logfile start

#Perform Backup in the Standy Server
$PGSQL_BIN/pg_basebackup --pgdata=$PGSQL_STANDBY -P
echo "primary_conninfo= 'port=5432'" >> $PGSQL_STANDBY/postgresql.conf
touch $PGSQL_STANDBY/standby.signal
#echo "standby_mode = on" >> $PGSQL_STANDBY/postgresql.conf
#Start STANDBY
export PGPORT=5433
echo "Starting Slave.."
$PGSQL_BIN/pg_ctl -D $PGSQL_STANDBY -c -w -l slave_logfile start

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-02-26 Thread Kuntal Ghosh

On Tue, Feb 26, 2019 at 6:46 PM Simon Riggs  wrote:
>
> On Thu, 21 Feb 2019 at 15:38, Kuntal Ghosh  wrote:
>
>>
>> Thank you for the patch. It seems to me that while performing COPY
>> FREEZE, if we've copied tuples in a previously emptied page
>
>
> There won't be any previously emptied pages because of the pre-conditions 
> required for FREEZE.
>
Right, I missed that part. Thanks for pointing that out. But, this
optimization is still possible for copying frozen tuples in a newly
created page, right? If current backend allocates a new page, copies a
bunch of frozen tuples in that page, it can set the PD_ALL_VISIBLE in
the same operation.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-02-21 Thread Kuntal Ghosh

Hello Pavan,

Thank you for the patch. It seems to me that while performing COPY
FREEZE, if we've copied tuples in a previously emptied page, we can
set the PageSetAllVisible(page) in heap_muli_insert only. Something
like,

bool init = (ItemPointerGetOffsetNumber(&(heaptuples[ndone]->t_self))
== FirstOffsetNumber &&
 PageGetMaxOffsetNumber(page) == FirstOffsetNumber + nthispage - 1);
if (init && (options & HEAP_INSERT_FROZEN))
 PageSetAllVisible(page);

Later, you can skip the pages for
CheckAndSetAllVisibleBulkInsertState() where PD_ALL_VISIBLE flag is
already set. Do you think it's correct?


On Thu, Feb 21, 2019 at 11:35 AM Pavan Deolasee
 wrote:
>
> Hi,
>
> Jeff Janes raised an issue [1] about PD_ALL_VISIBLE not being set correctly 
> while loading data via COPY FREEZE and had also posted a draft patch.
>
> I now have what I think is a more complete patch. I took a slightly different 
> approach and instead of setting PD_ALL_VISIBLE bit initially and then not 
> clearing it during insertion, we now recheck the page for all-frozen, 
> all-visible tuples just before switching to a new page. This allows us to 
> then also mark set the visibility map bit, like we do in vacuumlazy.c
>
> Some special treatment is required to handle the last page before bulk insert 
> it shutdown. We could have chosen not to do anything special for the last 
> page and let it remain unfrozen, but I thought it makes sense to take that 
> extra effort so that we can completely freeze the table and set all VM bits 
> at the end of COPY FREEZE.
>
> Let me know what you think.
>
> Thanks,
> Pavan
>
> [1] 
> https://www.postgresql.org/message-id/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com
>
> --
>  Pavan Deolasee   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: In-place updates and serializable transactions

2018-11-18 Thread Kuntal Ghosh

On Fri, Nov 16, 2018 at 4:07 AM Kevin Grittner  wrote:
>
> I would say this should be considered a resounding success.  We should
> probably add an alternative result file to cover this case, but
> otherwise I don't see anything which requires action.
>
As of now, we've added an extra expected file for the same.

> Congratulations on making this work so well!
>
Thanks for developing the serializable module in a nice decoupled architecture.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: In-place updates and serializable transactions

2018-11-15 Thread Kuntal Ghosh

On Thu, Nov 15, 2018 at 3:09 AM Kevin Grittner  wrote:
>
> On Wed, Nov 14, 2018 at 5:43 AM Joshua Yanovski
>  wrote:
>
> > This is only a personal anecdote, but from my own experience with 
> > serializability, this sort of blind update isn't often contended in 
> > realistic workloads.
>
> > So, if this only affects transactions with blind updates, I doubt it will 
> > cause much pain in real workloads (even though it might look bad in 
> > benchmarks which include a mix of blind writes and rmw operations).  
> > Particularly if it only happens if you explicitly opt into zheap storage.
>
> I agree with all of that, but will be very interested in what
> failures, if any, kick out from the "isolation" test set when all
> tables are created using zheap.  I added all the common failure
> patterns I had seen to that set, and other have filled in some corner
> cases I missed since then, so if everything there passes I would not
> worry about it at all.  If we do see some failures, we can take
> another look to see whether any action is needed.
Thanks Kevin for your explanation. The isolation test suits are really
helpful for testing serializable test scenarios for zheap.

The test multiple-row-versions is failing because of the
above-discussed scenario. I've attached the regression diff file and
the result output file for the same. Here is a brief summary of the
test w.r.t. heap:

Step 1: T1-> BEGIN; Read FROM t where id=100;
Step 2: T2-> BEGIN; UPDATE t where id=100; COMMIT; (creates T1->T2)
Step 3: T3-> BEGIN; UPDATE t where id=100; Read FROM t where id=50;
Step 4: T4-> BEGIN; UPDATE t where id= 50; Read FROM t where id=1;
COMMIT;  (creates T3->T4)
Step 5: T3-> COMMIT;
Step 6: T1-> UPDATE t where id=1; COMMIT;  (creates T4->T1,)

At step 6, when the update statement is executed, T1 is rolled back
because of T3->T4->T1.

But for zheap, step 3 also creates a dependency T1->T3 because of
in-place update. When T4 commits in step 4, it marks T3 as doomed
because of T1 --> T3 --> T4. Hence, in step 5, T3 is rolled back.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


multiple-row-versions.out
Description: Binary data


regression.diffs
Description: Binary data

Re: In-place updates and serializable transactions

2018-11-15 Thread Kuntal Ghosh

On Wed, Nov 14, 2018 at 5:13 PM Joshua Yanovski
 wrote:
>
> This is only a personal anecdote, but from my own experience with 
> serializability, this sort of blind update isn't often contended in realistic 
> workloads.  The reason is that (again, IME), most blind writes are either 
> insertions, or "read-writes in disguise" (the client read an old value in a 
> different transaction); in the latter case, the data in question are often 
> logically "owned" by the client, and will therefore rarely be contended.  I 
> think there are two major exceptions to this: transactions that perform 
> certain kinds of monotonic updates (for instance, marking a row complete in a 
> worklist irrespective of whether it was already completed), and automatic 
> bulk updates.  However, these were exactly the classes of transactions that 
> we already ran under a lower isolation level than serializability, since they 
> have tightly constrained shapes and don't benefit much from the additional 
> guarantees.
>
> So, if this only affects transactions with blind updates, I doubt it will 
> cause much pain in real workloads (even though it might look bad in 
> benchmarks which include a mix of blind writes and rmw operations).  
> Particularly if it only happens if you explicitly opt into zheap storage.
>
Thanks Joshua for sharing your input on this. I'm not aware of any
realistic workloads for serializable transactions. So, it is really
helpful.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

In-place updates and serializable transactions

2018-11-13 Thread Kuntal Ghosh

Hello hackers,

Currently, we're working on the serializable implementations for
zheap. As mentioned in README-SSI documentation[1], there is one
difference in SSI implementation of PostgreSQL that can differentiate
the conflict detection behaviour with other storage engines that
supports updates in-place.

The heap storage in PostgreSQL does not use "update in place" with a
rollback log for its MVCC implementation.  Where possible it uses
"HOT" updates on the same page (if there is room and no indexed value
is changed). For non-HOT updates the old tuple is expired in place and
a new tuple is inserted at a new location.  Because of this
difference, a tuple lock in PostgreSQL doesn't automatically lock any
other versions of a row. We can take the following example from the
doc to understand the situation in more detail:

T1 ---rw---> T2 ---ww--->T3

If transaction T1 reads a row version (thus acquiring a predicate lock
on it) and a second transaction T2 updates that row version (thus
creating a rw-conflict graph edge from T1 to T2), must a third
transaction T3 which re-updates the new version of the row also have a
rw-conflict in from T1 to prevent anomalies?  In other words,  does it
matter whether we recognize the edge T1 --rw--> T3? The document also
includes a nice proof for why we don't try to copy or expand a tuple
lock to any other versions of the row or why we don't have to
explicitly recognize the edge T1 --rw--> T3.

In PostgreSQL, the predicate locking is implemented using the tuple
id. In zheap, since we perform updates in-place, we don't change the
tuple id. So, in the above example, we easily recognize the edge
T1--rw--> T3. This may increase the number of false positives for
certain cases. In the above example, if we introduce another
transaction T4 such that T3 --rw--> T4 and T4 gets committed first,
for zheap, T3 will be rolled back because of the dangerous structure
T1 --rw--> T3 --rw--> T4. But, for heap, T3 can be committed(isolation
test case [2]). IMHO, this seems to be an acceptable behavior.

In brief, due to in-place updates, in some cases, the false positives
may increase for serializable transactions. Any thoughts?

[1] src/backend/storage/lmgr/README-SSI
[2] src/test/isolation/specs/multiple-row-versions.spec
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: zheap: a new storage format for PostgreSQL

2018-11-11 Thread Kuntal Ghosh

On Sat, Nov 10, 2018 at 8:51 PM Daniel Westermann
 wrote:
>
> >>Thanks. Initializing the variable seems like the right fix here.
>
> ... just had a warning when recompiling from the latest sources on CentOS 7:
>
> labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing 
> -fwrapv -fexcess-precision=standard -O2 -I../../../../src/include  
> -D_GNU_SOURCE -I/usr/include/libxml2   -c -o tpd.o tpd.c
> tpd.c: In function ‘TPDFreePage’:
> tpd.c:1003:15: warning: variable ‘curblkno’ set but not used 
> [-Wunused-but-set-variable]
>   BlockNumber  curblkno = InvalidBlockNumber;
>^
Thanks Daniel for testing zheap and reporting the issue. We'll push a
fix for the same.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2018-11-05 Thread Kuntal Ghosh

Hello Thomas,

On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
 wrote:
>
> It passes make check on Unix and Windows, though currently it's
> failing some of the TAP tests for reasons I'm looking into (possibly
> due to bugs in the lower level patches, not sure).
>
I looked into the regression failures when the tap-tests are enabled.
It seems that we're not estimating and allocating the shared memory
for rollback-hash tables correctly. I've added a patch to fix the
same.


--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


0001-Fix-shared-memory-size-for-rollback-hash-table.patch
Description: Binary data

Re: Unordered wait event ClogGroupUpdate

2018-10-23 Thread Kuntal Ghosh

On Wed, Oct 24, 2018 at 10:48 AM Michael Paquier  wrote:
> > That's a valid argument. Additionally, I've found
> > WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING and
> > WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING are added in a
> > non-alphabetical order in WaitEventIPC enum.
> And those ones are also incorrect after another lookup:
> - WAIT_EVENT_PARALLEL_FINISH
> - WAIT_EVENT_HASH_GROW_BATCHES_DECIDING
> - WAIT_EVENT_LOGICAL_APPLY_MAIN
> I don't see more of them..
Nice. Same here.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Unordered wait event ClogGroupUpdate

2018-10-23 Thread Kuntal Ghosh

On Wed, Oct 24, 2018 at 5:56 AM Michael Paquier  wrote:
> baaf272 has added support for group updates in clog, however it has
> added the wait event WAIT_EVENT_CLOG_GROUP_UPDATE in a non-alphabetical
> order.  There are many events, so keeping things in order helps users in
> finding them.
That's a valid argument. Additionally, I've found
WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING and
WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING are added in a
non-alphabetical order in WaitEventIPC enum.

> Are there any objections to the attached, which reorders things
> properly?  This is a patch for HEAD, for v11 I propose to only fix the
> documentation side of things to avoid an ABI breakage.
+1

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

In pageinspect, perform clean-up after testing gin-related functions

2018-07-11 Thread Kuntal Ghosh

Hello all,

In pageinspect/sql/gin.sql, we don't drop the table test1 at the end
of the test. IMHO, we should clean-up at the end of a test. I've
attached the patch to perform the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From 1e4be3749eed9ff9d59f775d2bd4ad2b409a6d3c Mon Sep 17 00:00:00 2001
From: Kuntal Ghosh 
Date: Wed, 11 Jul 2018 12:21:13 +0530
Subject: [PATCH] In pageinspect, drop table after testing gin-related
 functions

---
 contrib/pageinspect/expected/gin.out | 1 +
 contrib/pageinspect/sql/gin.sql  | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/contrib/pageinspect/expected/gin.out b/contrib/pageinspect/expected/gin.out
index 82f63b2..ef7570b 100644
--- a/contrib/pageinspect/expected/gin.out
+++ b/contrib/pageinspect/expected/gin.out
@@ -35,3 +35,4 @@ FROM gin_leafpage_items(get_raw_page('test1_y_idx',
 -[ RECORD 1 ]
 ?column? | t
 
+DROP TABLE test1;
diff --git a/contrib/pageinspect/sql/gin.sql b/contrib/pageinspect/sql/gin.sql
index d516ed3..423f5c5 100644
--- a/contrib/pageinspect/sql/gin.sql
+++ b/contrib/pageinspect/sql/gin.sql
@@ -17,3 +17,5 @@ SELECT COUNT(*) > 0
 FROM gin_leafpage_items(get_raw_page('test1_y_idx',
 (pg_relation_size('test1_y_idx') /
  current_setting('block_size')::bigint)::int - 1));
+
+DROP TABLE test1;
-- 
1.8.3.1

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-28 Thread Kuntal Ghosh

On Fri, Jun 29, 2018 at 11:04 AM, Andrey V. Lepikhov
 wrote:
> On 29.06.2018 10:00, Kuntal Ghosh wrote:
>>
>> On Wed, Jun 27, 2018 at 12:10 PM, Andrey V. Lepikhov
>>  wrote:
>>>
>>> I prepare third version of the patches. Summary:
>>> 1. Mask DEAD tuples at a page during consistency checking (See comments
>>> for
>>> the mask_dead_tuples() function).
>>
>>
>> - ItemIdSetDead(lp);
>> + if (target_index_deletion_factor > 0)
>> + ItemIdMarkDead(lp);
>> + else
>> + ItemIdSetDead(lp);
>> IIUC, you want to hold the storage of DEAD tuples to form the ScanKey
>> which is required for the index scan in the second phase of quick
>> vacuum strategy. To achieve that, you've only marked the tuple as DEAD
>> during pruning. Hence, PageRepairFragmentation cannot claim the space
>> for DEAD tuples since they still have storage associated with them.
>> But, you've WAL-logged it as DEAD tuples having no storage. So, when
>> the WAL record is replayed in standby(or crash recovery), the tuples
>> will be marked as DEAD having no storage and their space may be
>> reclaimed by PageRepairFragmentation. Now, if you do byte-by-byte
>> comparison with wal_consistency tool, it may fail even for normal
>> tuple as well. Please let me know if you feel the same way.
>>
> Thanks for your analysis.
> In this development version of the patch I expect the same prune() strategy
> on a master and standby (i.e. target_index_deletion_factor is equal for
> both).
> In this case storage of a DEAD tuple holds during replay or recovery in the
> same way.
Okay. That's acceptable for now.

> On some future step of development I plan to use more flexible prune()
> strategy. This will require to append a 'isDeadStorageHold' field to the WAL
> record.
That sounds interesting. I'll be waiting for your next patches.

Few minor comments:
+ qsort((void *)vacrelstats->dead_tuples,
vacrelstats->num_dead_tuples, sizeof(ItemPointerData),
tid_comparator);
+ ivinfo.isSorted = true;
My understanding is that vacrelstats->dead_tuples are already sorted
based on their tids. I'm referring to the following part in
lazy_scan_heap(),
for (blk=0 to nblocks)
{
 for (offset=1 to max offset)
 {
  if certain conditions are met store the tuple in
vacrelstats->dead_tuples using lazy_record_dead_tuple();
 }
}
So, you don't have to sort them again, right?

+ slot = MakeSingleTupleTableSlot(RelationGetDescr(hrel));
+ econtext->ecxt_scantuple = slot;
+ ExecDropSingleTupleTableSlot(slot);
You don't have to do this for every tuple. Before storing the tuple,
you can just call MemoryContextReset(econtext->ecxt_per_tuple_memory);
Perhaps, you can check IndexBuildHeapRangeScan for the details.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-28 Thread Kuntal Ghosh

On Wed, Jun 27, 2018 at 12:10 PM, Andrey V. Lepikhov
 wrote:
> I prepare third version of the patches. Summary:
> 1. Mask DEAD tuples at a page during consistency checking (See comments for
> the mask_dead_tuples() function).

- ItemIdSetDead(lp);
+ if (target_index_deletion_factor > 0)
+ ItemIdMarkDead(lp);
+ else
+ ItemIdSetDead(lp);
IIUC, you want to hold the storage of DEAD tuples to form the ScanKey
which is required for the index scan in the second phase of quick
vacuum strategy. To achieve that, you've only marked the tuple as DEAD
during pruning. Hence, PageRepairFragmentation cannot claim the space
for DEAD tuples since they still have storage associated with them.
But, you've WAL-logged it as DEAD tuples having no storage. So, when
the WAL record is replayed in standby(or crash recovery), the tuples
will be marked as DEAD having no storage and their space may be
reclaimed by PageRepairFragmentation. Now, if you do byte-by-byte
comparison with wal_consistency tool, it may fail even for normal
tuple as well. Please let me know if you feel the same way.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Incorrect fsync handling in pg_basebackup's tar_finish

2018-06-25 Thread Kuntal Ghosh

On Mon, Jun 25, 2018 at 6:47 PM, Michael Paquier  wrote:
> Yes, there is a second one.  I just looked at walmethods.c and I did not
> spot any other issues.  What do you think about the updated version
> attached?
> --
I've also verified the same. The patch looks good to me.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Incorrect fsync handling in pg_basebackup's tar_finish

2018-06-25 Thread Kuntal Ghosh

On Mon, Jun 25, 2018 at 2:27 PM, Magnus Hagander  wrote:
>
>
> On Mon, Jun 25, 2018 at 4:43 AM, Michael Paquier 
> wrote:
>>
>> Hi all,
>>
>> I was just looking at the code of pg_basebackup, and noticed that we
>> don't actually check if the two last empty blocks of any tar file
>> produced are correctly fsync'd or not:
>> @@ -957,7 +957,10 @@ tar_finish(void)
>>
>>  /* sync the empty blocks as well, since they're after the last file */
>>  if (tar_data->sync)
>> -   fsync(tar_data->fd);
>> +   {
>> +   if (fsync(tar_data->fd) != 0)
>> +   return false;
>> +   }
>>
>> That looks incorrect to me, hence shouldn't something like the attached
>> be done?  Magnus and others, any opinions?
In the same note, in tar_close(), we fsync on close. We're not
checking the status of fsync there. Should we introduce the same check
there as well?


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Loaded footgun open_datasync on Windows

2018-06-25 Thread Kuntal Ghosh

On Wed, Jun 20, 2018 at 7:36 PM, Laurenz Albe  wrote:
> Kuntal Ghosh wrote:
> [pg_test_fsync doesn't use pgwin32_open and hence doesn't respect O_DSYNC]
>> On Wed, Jun 6, 2018 at 2:39 PM, Amit Kapila  wrote:
>> > On Wed, Jun 6, 2018 at 10:18 AM, Michael Paquier  
>> > wrote:
>> > > On Wed, Jun 06, 2018 at 09:58:34AM +0530, Amit Kapila wrote:
>> > > > One another alternative could be that we
>> > > > define open as pgwin32_open (for WIN32) wherever we need it.
>> > >
>> > > Which is what basically happens on any *nix platform, are you foreseeing
>> > > anything bad here?
>> >
>> > Nothing apparent, but I think we should try to find out why at the first
>> > place this has been made backend specific.
>>
>> It seems the "#ifndef FRONTEND" restriction was added around
>> pgwin32_open() for building libpq with Visual C++ or Borland C++.  The
>> restriction was added in commit 422d4819 to build libpq with VC++[1].
>> Later, in commit fd7c3f67e0bc4, the support for Borland C++ was also
>> added.
>>
>> So, I'm not sure whether removing that restriction will work for all
>> front-end modules.
>
> Thanks for the research, and sorry for my long silence.
>
> If I remove the "#ifndef FRONTEND", building with MSVC fails for all
> call sites that use the two-argument version of open(2).
>
> If I use the three-argument version throughout, which should be no problem,
> PostgreSQL builds just fine with MSVC.
>
I don't have enough experience on MSVC/Windows to comment on the same.

> How about checking what the buildfarm thinks about the attached?
+1


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-13 Thread Kuntal Ghosh

On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada  wrote:
> Hi,
>
> Three functions: brin_summarize_new_values, brin_summarize_range and
> brin_desummarize_range can be called during recovery as follows.
>
> =# select brin_summarize_new_values('a_idx');
> ERROR:  cannot acquire lock mode ShareUpdateExclusiveLock on database
> objects while recovery is in progress
> HINT:  Only RowExclusiveLock or less can be acquired on database
> objects during recovery.
>
> I think we should complaint "recovery is in progress" error in this
> case rather than erroring due to lock modes.
+1



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Loaded footgun open_datasync on Windows

2018-06-06 Thread Kuntal Ghosh

On Wed, Jun 6, 2018 at 2:39 PM, Amit Kapila  wrote:
> On Wed, Jun 6, 2018 at 10:18 AM, Michael Paquier 
> wrote:
>>
>> On Wed, Jun 06, 2018 at 09:58:34AM +0530, Amit Kapila wrote:
>>
>> >> It could be
>> >> risky for existing callers of open() for tool maintainers, or on the
>> >> contrary people could welcome a wrapper of open() which is
>> >> concurrent-safe in their own tools.
>> >
>> > I am not sure if we can safely assume that because using these functions
>> > would allow users to concurrently delete the files, but may be it is
>> > okay
>> > for all the FRONTEND modules.  One another alternative could be that we
>> > define open as pgwin32_open (for WIN32) wherever we need it.
>>
>> Which is what basically happens on any *nix platform, are you foreseeing
>> anything bad here?
>>
>>
>
> Nothing apparent, but I think we should try to find out why at the first
> place this has been made backend specific.
>
It seems the "#ifndef FRONTEND" restriction was added around
pgwin32_open() for building libpq with Visual C++ or Borland C++.  The
restriction was added in commit 422d4819 to build libpq with VC++[1].
Later, in commit fd7c3f67e0bc4, the support for Borland C++ was also
added.

So, I'm not sure whether removing that restriction will work for all
front-end modules.

[1] 
https://www.postgresql.org/message-id/flat/D90A5A6C612A39408103E6ECDD77B829408D4F%40voyager.corporate.connx.com#d90a5a6c612a39408103e6ecdd77b829408...@voyager.corporate.connx.com
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: New committers announced at PGCon 2018

2018-06-01 Thread Kuntal Ghosh

Congratulations all.

On Sat, 2 Jun 2018 at 2:35 AM, Tom Lane  wrote:

> The core team is pleased to announce the appointment of seven
> new Postgres committers:
>
> Etsuro Fujita
> Peter Geoghegan
> Amit Kapila
> Alexander Korotkov
> Thomas Munro
> Michael Paquier
> Tomas Vondra
>
> Congratulations to all!
>
> regards, tom lane
>
> --
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Inconsistent behavior in serializable snapshot

2018-03-11 Thread Kuntal Ghosh

On Sun, Mar 11, 2018 at 7:52 PM, Kuntal Ghosh
 wrote:
> Hello hackers,
>
> While working on serializable transaction isolation, I've noticed some
> strange behavior in the first permutation mentioned in
> isolation/specs/read-only-anomaly-2.spec file.
>
> setup
> {
> CREATE TABLE bank_account (id TEXT PRIMARY KEY, balance DECIMAL NOT NULL);
> INSERT INTO bank_account (id, balance) VALUES ('X', 0), ('Y', 0);
> }
>
> # without s3, s1 and s2 commit
> permutation "s2rx" "s2ry" "s1ry" "s1wy" "s1c" "s2wx" "s2c" "s3c"
>
> Here, we can see a serial order T1 <- T2 without any conflict.
> However, if I perform "VACUUM FREEZE bank_account" after the setup
> step, s2wx throws a conflict error:
> ERROR:  could not serialize access due to read/write dependencies
> among transactions
> DETAIL:  Reason code: Canceled on identification as a pivot, during write.
> HINT:  The transaction might succeed if retried.
>
> Is this an expected behavior?
Got the answer.

In this case, when we perform VACUUM FREEZE on the table, the planner
chooses a plan with sequential scan(instead of Index Scan) to scan the
table for SELECT and UPDATE statements. Hence, we've to take relation
level SIReadLock instead of page and tuple level SIReadLock. This
causes the serialization conflict.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Inconsistent behavior in serializable snapshot

2018-03-11 Thread Kuntal Ghosh

Hello hackers,

While working on serializable transaction isolation, I've noticed some
strange behavior in the first permutation mentioned in
isolation/specs/read-only-anomaly-2.spec file.

setup
{
CREATE TABLE bank_account (id TEXT PRIMARY KEY, balance DECIMAL NOT NULL);
INSERT INTO bank_account (id, balance) VALUES ('X', 0), ('Y', 0);
}

# without s3, s1 and s2 commit
permutation "s2rx" "s2ry" "s1ry" "s1wy" "s1c" "s2wx" "s2c" "s3c"

Here, we can see a serial order T1 <- T2 without any conflict.
However, if I perform "VACUUM FREEZE bank_account" after the setup
step, s2wx throws a conflict error:
ERROR:  could not serialize access due to read/write dependencies
among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during write.
HINT:  The transaction might succeed if retried.

Is this an expected behavior?
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: zheap: a new storage format for PostgreSQL

2018-03-02 Thread Kuntal Ghosh

On Fri, Mar 2, 2018 at 2:42 AM, Alexander Korotkov
 wrote:
>
> I think results representation should be improved.  You show total size of 
> the database, but it's hard to understand how bloat degree was really 
> decreased, assuming that there are both update and append-only tables.  So, I 
> propose to show the results in per table manner.
>
> What is total number of transactions processed in both cases?  It would be 
> also more fair to compare sizes for the same number of processed transactions.
>
> Also, what are index sizes?  What are undo log sizes for zheap?
>
I've added the table sizes and TPS in the performance results. As of
now, we've just performed stress testing using pgbench. We've plans
for performing other tests including:
1. Introduce random delay in the transactions instead of keeping a
transaction open for 15 minutes.
2. Combination of ROLLBACK and COMMIT (As suggested by Fabien)
3. PGbench tests for fixed number of transaction.
4. Modify the distribution (As suggested by Alexander Korotkov)

Do let me know if any other tests are required.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


zheap_perf_data_1.pdf
Description: Adobe PDF document

Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

2017-12-12 Thread Kuntal Ghosh

On Wed, Dec 13, 2017 at 10:33 AM, Kuntal Ghosh
 wrote:
> On Tue, Dec 12, 2017 at 5:20 PM, Amit Kapila  wrote:
>> On Tue, Dec 12, 2017 at 4:00 PM, Kuntal Ghosh
>>  wrote:
>>> On Mon, Dec 11, 2017 at 2:26 PM, Thomas Munro
>>>  wrote:
>>>> On Mon, Dec 11, 2017 at 8:14 PM, Amit Kapila  
>>>> wrote:
>>>>
>>>>> Thanks for looking into it.  I will see if we can write some test.  In
>>>>> the meantime if possible, can you please request Patrick Hemmer to
>>>>> verify the attached patch?
>>>>
>>>> Our discussion was on the #postgresql Freenode channel.  I pointed him
>>>> at this thread, but I'm not sure if he'll see my message or be able to
>>>> test.
>>> After discussing with Amit, I'm able to reproduce the scenario in a
>>> master-standby setup. The issue occurs when we perform parallel
>>> index(-only) scan on a BTP_HALF_DEAD -marked page. (If a page is
>>> marked as BTP_DELETED, it's already unlinked from the index).
>>>
>>> When a btree page is deleted during vacuum, it's first marked as
>>> BTP_HALF_DEAD in _bt_mark_page_halfdead and then marked as BTP_DELETED
>>> in _bt_unlink_halfdead_page without releasing cleanup lock on the
>>> buffer. Hence, any scan node cannot lock the same buffer. So, the
>>> issue can't be reproduced on master.
>>>
>>> However, after replaying XLOG_BTREE_MARK_PAGE_HALFDEAD record, standby
>>> releases the lock on the same buffer. If we force parallelism, an
>>> index scan on the same page will cause hang the standby server.
>>> Following is a (unpleasant)way to reproduce the issue:
>>>
>>> In master (with autovacuum = off):
>>> 1. create table t1(a int primary key);
>>> 2. insert into t1 select * from generate_series(1,1000); --generates 3
>>> leaf nodes (block no 1,2,4) and 1 root node (block no 3)
>>> 3. delete from t1 where a>=367 and a<=735; --delete all tuples pointed by 
>>> leaf 2
>>> 4. analyze t1; --update the stats
>>> 5. explain analyze select * from t1 where a>=1 and a<=1000; --ensures
>>> that the next vacuum will consider leaf 2 for page deletion
>>
>> What do you mean by next vacuum, here autovacuum is off?  Are you
>> missing any step which manually performs the vacuum?
>>
> Yeah, you've to manually vacuum the table.
> 6. vacuum t1.
>
>>> Now, put a break point at _bt_unlink_halfdead_page, so that vacuum
>>> can't unlink the page.
>>>
>>> In standby,
>>> 1. force parallelism.
>>> 2. explain analyze select * from t1 where a>=1 and a<=1000; and the
>>> parallel workers hang at the above-discussed place!
>>>
I've also verified the backward scan case with the query provided by
Thomas. In standby,
2. explain analyze select * from t1 where a+1>a order by a desc; and
the parallel workers hang.
The patch fixes the issue.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

2017-12-12 Thread Kuntal Ghosh

On Tue, Dec 12, 2017 at 5:20 PM, Amit Kapila  wrote:
> On Tue, Dec 12, 2017 at 4:00 PM, Kuntal Ghosh
>  wrote:
>> On Mon, Dec 11, 2017 at 2:26 PM, Thomas Munro
>>  wrote:
>>> On Mon, Dec 11, 2017 at 8:14 PM, Amit Kapila  
>>> wrote:
>>>
>>>> Thanks for looking into it.  I will see if we can write some test.  In
>>>> the meantime if possible, can you please request Patrick Hemmer to
>>>> verify the attached patch?
>>>
>>> Our discussion was on the #postgresql Freenode channel.  I pointed him
>>> at this thread, but I'm not sure if he'll see my message or be able to
>>> test.
>> After discussing with Amit, I'm able to reproduce the scenario in a
>> master-standby setup. The issue occurs when we perform parallel
>> index(-only) scan on a BTP_HALF_DEAD -marked page. (If a page is
>> marked as BTP_DELETED, it's already unlinked from the index).
>>
>> When a btree page is deleted during vacuum, it's first marked as
>> BTP_HALF_DEAD in _bt_mark_page_halfdead and then marked as BTP_DELETED
>> in _bt_unlink_halfdead_page without releasing cleanup lock on the
>> buffer. Hence, any scan node cannot lock the same buffer. So, the
>> issue can't be reproduced on master.
>>
>> However, after replaying XLOG_BTREE_MARK_PAGE_HALFDEAD record, standby
>> releases the lock on the same buffer. If we force parallelism, an
>> index scan on the same page will cause hang the standby server.
>> Following is a (unpleasant)way to reproduce the issue:
>>
>> In master (with autovacuum = off):
>> 1. create table t1(a int primary key);
>> 2. insert into t1 select * from generate_series(1,1000); --generates 3
>> leaf nodes (block no 1,2,4) and 1 root node (block no 3)
>> 3. delete from t1 where a>=367 and a<=735; --delete all tuples pointed by 
>> leaf 2
>> 4. analyze t1; --update the stats
>> 5. explain analyze select * from t1 where a>=1 and a<=1000; --ensures
>> that the next vacuum will consider leaf 2 for page deletion
>
> What do you mean by next vacuum, here autovacuum is off?  Are you
> missing any step which manually performs the vacuum?
>
Yeah, you've to manually vacuum the table.
6. vacuum t1.

>> Now, put a break point at _bt_unlink_halfdead_page, so that vacuum
>> can't unlink the page.
>>
>> In standby,
>> 1. force parallelism.
>> 2. explain analyze select * from t1 where a>=1 and a<=1000; and the
>> parallel workers hang at the above-discussed place!
>>



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

2017-12-12 Thread Kuntal Ghosh

On Mon, Dec 11, 2017 at 2:26 PM, Thomas Munro
 wrote:
> On Mon, Dec 11, 2017 at 8:14 PM, Amit Kapila  wrote:
>
>> Thanks for looking into it.  I will see if we can write some test.  In
>> the meantime if possible, can you please request Patrick Hemmer to
>> verify the attached patch?
>
> Our discussion was on the #postgresql Freenode channel.  I pointed him
> at this thread, but I'm not sure if he'll see my message or be able to
> test.
After discussing with Amit, I'm able to reproduce the scenario in a
master-standby setup. The issue occurs when we perform parallel
index(-only) scan on a BTP_HALF_DEAD -marked page. (If a page is
marked as BTP_DELETED, it's already unlinked from the index).

When a btree page is deleted during vacuum, it's first marked as
BTP_HALF_DEAD in _bt_mark_page_halfdead and then marked as BTP_DELETED
in _bt_unlink_halfdead_page without releasing cleanup lock on the
buffer. Hence, any scan node cannot lock the same buffer. So, the
issue can't be reproduced on master.

However, after replaying XLOG_BTREE_MARK_PAGE_HALFDEAD record, standby
releases the lock on the same buffer. If we force parallelism, an
index scan on the same page will cause hang the standby server.
Following is a (unpleasant)way to reproduce the issue:

In master (with autovacuum = off):
1. create table t1(a int primary key);
2. insert into t1 select * from generate_series(1,1000); --generates 3
leaf nodes (block no 1,2,4) and 1 root node (block no 3)
3. delete from t1 where a>=367 and a<=735; --delete all tuples pointed by leaf 2
4. analyze t1; --update the stats
5. explain analyze select * from t1 where a>=1 and a<=1000; --ensures
that the next vacuum will consider leaf 2 for page deletion
Now, put a break point at _bt_unlink_halfdead_page, so that vacuum
can't unlink the page.

In standby,
1. force parallelism.
2. explain analyze select * from t1 where a>=1 and a<=1000; and the
parallel workers hang at the above-discussed place!

The attached patch fixes the issue. One comment on the same:
+ else if (scan->parallel_scan != NULL)
+ {
+ /* allow next page be processed by parallel worker */
+ _bt_parallel_release(scan, opaque->btpo_next);
+ }

  /* nope, keep going */
  if (scan->parallel_scan != NULL)

IMHO, There is no need to add an extra if condition.
_bt_parallel_release can be included in the next one.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallelize queries containing initplans

2017-11-16 Thread Kuntal Ghosh

On Thu, Nov 16, 2017 at 3:34 PM, Amit Kapila  wrote:
> On Wed, Nov 15, 2017 at 12:25 AM, Robert Haas  wrote:
>> On Tue, Nov 14, 2017 at 11:00 AM, Tom Lane  wrote:
>>> Yeah, I'm sure it is.  I have a vague recollection that there might be
>>> existing regression test cases exercising such things, though I won't
>>> swear to that.  The "orderstest" bit in subselect.sql looks like it
>>> might be meant to do that...
>>
>
> I agree that such cases can cause a problem with fixed memory.
>
I'm able to come up with a test case to produce the problem.

regression[107494]=# select ten,count(*) from tenk1 a where a.ten in (select
b.ten from tenk1 b where (select a.ten from tenk1 c where c.ten =
a.ten limit 1) = b.ten limit 1) group by a.ten;

QUERY PLAN
---
HashAggregate
  Group Key: a.ten
  ->  Seq Scan on tenk1 a
Filter: (SubPlan 2)
SubPlan 2
  ->  Limit
InitPlan 1 (returns $1)
  ->  Limit
->  Seq Scan on tenk1 c
  Filter: (ten = a.ten)
->  Seq Scan on tenk1 b
  Filter: ($1 = ten)
In the above case, Subplan 2 returns the same value of a.ten that is
used in initplan as a reference. So, this query is nothing but select
ten,count(*) from tenk1 group by a.ten. The output should be 10 rows
each row having a count=1000.

By enforcing parallelism, I got the following plan:
  QUERY PLAN
--
HashAggregate
  Group Key: a.ten
  ->  Seq Scan on tenk1 a
Filter: (SubPlan 2)
SubPlan 2
  ->  Limit
InitPlan 1 (returns $1)
  ->  Limit
->  Seq Scan on tenk1 c
  Filter: (ten = a.ten)
->  Gather
  Workers Planned: 2
  Params Evaluated: $1
  ->  Parallel Seq Scan on tenk1 b
Filter: ($1 = ten)

When I set parallel_leader_participation to off, I get the incorrect
result as follows:
 ten | count

   0  | 1000
(1 row)
In the above case, the initplan gets evaluated only once from
ExecInitParallelPlan path. Hence, we see the incorrect result.

>> Here's an updated patch that attempts to work around this problem using DSA.
>>
>
> There were a couple of problems with your changes:
> 1.
> BufferUsage *buffer_usage; /* points to bufusage area in DSM */
> + dsa_handle param_exec; /* serialized PARAM_EXEC parameters */
> @@ -35,12 +36,13 @@ typedef struct ParallelExecutorInfo
>  } ParallelExecutorInfo;
>
> This should be dsa_pointer, otherwise, the value returned by
> SerializeParamExecParams will get truncated.
>
> 2. In ExecParallelReinitialize(), we need to evaluate the params
> before serializing them.
>
> 3. I think we should free the dsa pointer at the end of the gather.
>
> Attached patch fixes the mentioned problems.
>
>> It could use a test case that actually tickles the new logic in
>> ExecParallelReinitialize ... this is mostly just to show the concept.
>>
>
> Thanks, it was quite helpful.
>
I've tested the above-mentioned scenario with this patch and it is
working fine. Also, I've created a text column named 'vartext',
inserted some random length texts(max length 100) and tweaked the
above query as follows:
select ten,count(*) from tenk1 a where a.ten in (select
b.ten from tenk1 b where (select a.vartext from tenk1 c where c.ten =
a.ten limit 1) = b.vartext limit 1) group by a.ten;
This query is equivalent to select ten,count(*) from tenk1 group by
a.ten. It also produced the expected result without throwing any
error.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

1 2 >

100 matches

Mail list logo