from:"tsunakawa.ta...@fujitsu.com"

Farewell greeting

2021-06-27 Thread tsunakawa.ta...@fujitsu.com

Hello all,


I'm moving to another company from July 1st.  I may possibly not be allowed to 
do open source activities there, so let me say goodbye once.  Thank you all for 
your help.  I really enjoyed learning PostgreSQL and participating in its 
development.

It's a pity that I may not be able to part of PostgreSQL's great history until 
it becomes the most popular database (in the DB-Engines ranking.)  However, if 
possible, I'd like to come back as just MauMau.

I hope you all will succeed.


Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-17 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> On Tue, Jun 15, 2021 at 5:51 AM tsunakawa.ta...@fujitsu.com
>  wrote:
> > Postgres can do that, but other implementations can not necessaily do it, 
> > I'm
> afraid.  But before that, the FDW interface documentation doesn't describe
> anything about how to handle interrupts.  Actually, odbc_fdw and possibly
> other FDWs don't respond to interrupts.
> 
> Well, I'd consider that a bug.

I kind of hesitate to call it a bug...  Unlike libpq, JDBC (for jdbc_fdw) 
doesn't have asynchronous interface, and Oracle and PostgreSQL ODBC drivers 
don't support asynchronous interface.  Even with libpq, COMMIT (and other SQL 
commands) is not always cancellable, e.g., when the (NFS) storage server gets 
hand while writing WAL.

> > What we're talking here is mainly whether commit should return success or
> failure when some participants failed to commit in the second phase of 2PC.
> That's new to Postgres, isn't it?  Anyway, we should respect existing relevant
> specifications and (well-known) implementations before we conclude that we
> have to devise our own behavior.
> 
> Sure ... but we can only decide to do things that the implementation
> can support, and running code that might fail after we've committed
> locally isn't one of them.

Yes, I understand that Postgres may not be able to conform to specifications or 
well-known implementations in all aspects.  I'm just suggesting to take the 
stance "We carefully considered established industry specifications that we can 
base on, did our best to design the desirable behavior learned from them, but 
couldn't implement a few parts", rather than "We did what we like and can do."

> I think your comparison here is quite unfair. We work hard to add
> overhead in hot paths where it might cost, but the FDW case involves a
> network round-trip anyway, so the cost of an if-statement would surely
> be insignificant. I feel like you want to assume without any evidence
> that a local resolver can never be quick enough, even thought the cost
> of IPC between local processes shouldn't be that high compared to a
> network round trip. But you also want to suppose that we can run code
> that might fail late in the commit process even though there is lots
> of evidence that this will cause problems, starting with the code
> comments that clearly say so.

There may be better examples.  What I wanted to say is just that I believe it's 
not PG developers' standard to allow serial prepare and commit.  Let's make it 
clear what's difficult to do the 2PC from each client session in normal 
operation without going through the resolver.

Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-15 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> Well, we're talking about running this commit routine from within
> CommitTransaction(), right? So I think it is in fact running in the
> server. And if that's so, then you have to worry about how to make it
> respond to interrupts. You can't just call some functions
> DBMSX_xa_commit() and wait for infinite time for it to return. Look at
> pgfdw_get_result() for an example of what real code to do this looks
> like.

Postgres can do that, but other implementations can not necessaily do it, I'm 
afraid.  But before that, the FDW interface documentation doesn't describe 
anything about how to handle interrupts.  Actually, odbc_fdw and possibly other 
FDWs don't respond to interrupts.


> Honestly, I am not quite sure what any specification has to say about
> this. We're talking about what happens when a user does something with
> a foreign table and then type COMMIT. That's all about providing a set
> of behaviors that are consistent with how PostgreSQL works in other
> situations. You can't negotiate away the requirement to handle errors
> in a way that works with PostgreSQL's infrastructure, or the
> requirement that any length operation handle interrupts properly, by
> appealing to a specification.

What we're talking here is mainly whether commit should return success or 
failure when some participants failed to commit in the second phase of 2PC.  
That's new to Postgres, isn't it?  Anyway, we should respect existing relevant 
specifications and (well-known) implementations before we conclude that we have 
to devise our own behavior.


> That sounds more like a limitation of the present implementation than
> a fundamental problem. We shouldn't reject the idea of having a
> resolver process handle this just because the initial implementation
> might be slow. If there's no fundamental problem with the idea,
> parallelism and concurrency can be improved in separate patches at a
> later time. It's much more important at this stage to reject ideas
> that are not theoretically sound.

We talked about that, and unfortunately, I haven't seen a good and feasible 
idea to enhance the current approach that involves the resolver from the 
beginning of 2PC processing.  Honestly, I don't understand why such a "one 
prepare, one commit in turn" serialization approach can be allowed in 
PostgreSQL where developers pursue best performance and even tries to refrain 
from adding an if statement in a hot path.  As I showed and Ikeda-san said, 
other implementations have each client session send prepare and commit 
requests.  That's a natural way to achieve reasonable concurrency and 
performance.


Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-13 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> On Thu, Jun 10, 2021 at 9:58 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> > The question I have been asking is how.  With that said, we should only have
> two options; one is the return value of the FDW commit routine, and the other 
> is
> via ereport(ERROR).  I suggested the possibility of the former, because if the
> FDW does ereport(ERROR), Postgres core (transaction manager) may have
> difficulty in handling the rest of the participants.
> 
> I don't think that is going to work. It is very difficult to write
> code that doesn't ever ERROR in PostgreSQL. It is not impossible if
> the operation is trivial enough, but I think you're greatly
> underestimating the complexity of committing the remote transaction.
> If somebody had designed PostgreSQL so that every function returns a
> return code and every time you call some other function you check that
> return code and pass any error up to your own caller, then there would
> be no problem here. But in fact the design was that at the first sign
> of trouble you throw an ERROR. It's not easy to depart from that
> programming model in just one place.

> > I'm not completely sure about this.  I thought (and said) that the only 
> > thing
> the FDW does would be to send a commit request through an existing
> connection.  So, I think it's not a severe restriction to require FDWs to do
> ereport(ERROR) during commits (of the second phase of 2PC.)
> 
> To send a commit request through an existing connection, you have to
> send some bytes over the network using a send() or write() system
> call. That can fail. Then you have to read the response back over the
> network using recv() or read(). That can also fail. You also need to
> parse the result that you get from the remote side, which can also
> fail, because you could get back garbage for some reason. And
> depending on the details, you might first need to construct the
> message you're going to send, which might be able to fail too. Also,
> the data might be encrypted using SSL, so you might have to decrypt
> it, which can also fail, and you might need to encrypt data before
> sending it, which can fail. In fact, if you're using the OpenSSL,
> trying to call SSL_read() or SSL_write() can both read and write data
> from the socket, even multiple times, so you have extra opportunities
> to fail.

I know sending a commit request may get an error from various underlying 
functions, but we're talking about the client side, not the Postgres's server 
side that could unexpectedly ereport(ERROR) somewhere.  So, the new FDW commit 
routine won't lose control and can return an error code as its return value.  
For instance, the FDW commit routine for DBMS-X would typically be:

int
DBMSXCommit(...)
{
int ret;

/* extract info from the argument to pass to xa_commit() */

ret = DBMSX_xa_commit(...);
/* This is the actual commit function which is exposed to the app 
server (e.g. Tuxedo) through the xa_commit() interface */

/* map xa_commit() return values to the corresponding return values of 
the FDW commit routine */
switch (ret)
{
case XA_RMERR:
ret = ...;
break;
...
}

return ret;
}


> I think that's a valid concern, but we also have to have a plan that
> is realistic. Some things are indeed not possible in PostgreSQL's
> design. Also, some of these problems are things everyone has to
> somehow confront. There's no database doing 2PC that can't have a
> situation where one of the machines disappears unexpectedly due to
> some natural disaster or administrator interference. It might be the
> case that our inability to do certain things safely during transaction
> commit puts us out of compliance with the spec, but it can't be the
> case that some other system has no possible failures during
> transaction commit. The problem of the network potentially being
> disconnected between one packet and the next exists in every system.

So, we need to design how commit behaves from the user's perspective.  That's 
the functional design.  We should figure out what's the desirable response of 
commit first, and then see if we can implement it or have to compromise in some 
way.  I think we can reference the X/Open TX standard and/or JTS (Java 
Transaction Service) specification (I haven't had a chance to read them yet, 
though.)  Just in case we can't find the requested commit behavior in the 
volcano case from those specifications, ... (I'm hesitant to say this because 
it may be hard,) it's desirable to follow representative products such as 
Tuxedo and GlassFish (the reference implementation of Java EE specs.)


> > I don't think the resolver-based approach would bring us far enough.  It's
&g

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-10 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> That is completely unrealistic. As Sawada-san has pointed out
> repeatedly, there are tons of things that can go wrong even after the
> remote side has prepared the transaction. Preparing a transaction only
> promises that the remote side will let you commit the transaction upon
> request. It doesn't guarantee that you'll be able to make the request.
> Like Sawada-san says, network problems, out of memory issues, or many
> other things could stop that from happening. Someone could come along
> in another session and run "ROLLBACK PREPARED" on the remote side, and
> now the "COMMIT PREPARED" will never succeed no matter how many times
> you try it. At least, not unless someone goes and creates a new
> prepared transaction with the same 2PC identifier, but then you won't
> be committing the correct transaction anyway. Or someone could take
> the remote server and drop it in a volcano. How do you propose that we
> avoid giving the user an error after the remote server has been
> dropped into a volcano, even though the local node is still alive?

I understand that.  As I cited yesterday and possibly before, that's why 
xa_commit() returns various return codes.  So, I have never suggested that FDWs 
should not report an error and always report success for the commit request.  
They should be allowed to report an error.

The question I have been asking is how.  With that said, we should only have 
two options; one is the return value of the FDW commit routine, and the other 
is via ereport(ERROR).  I suggested the possibility of the former, because if 
the FDW does ereport(ERROR), Postgres core (transaction manager) may have 
difficulty in handling the rest of the participants.


> Also, leaving aside theoretical arguments, I think it's not
> realistically possible for an FDW author to write code to commit a
> prepared transaction that will be safe in the context of running late
> in PrepareTransaction(), after we've already done
> RecordTransactionCommit(). Such code can't avoid throwing errors
> because it can't avoid performing operations and allocating memory.

I'm not completely sure about this.  I thought (and said) that the only thing 
the FDW does would be to send a commit request through an existing connection.  
So, I think it's not a severe restriction to require FDWs to do ereport(ERROR) 
during commits (of the second phase of 2PC.)


> It's already been mentioned that, if an ERROR is thrown, it would be
> reported to the user in place of the COMMIT acknowledgement that they
> are expecting. Now, it has also been suggested that we could downgrade
> the ERROR to a WARNING and still report the COMMIT. That doesn't sound
> easy to do, because when the ERROR happens, control is going to jump
> to AbortTransaction(). But even if you could hack it so it works like
> that, it doesn't really solve the problem. What about all of the other
> servers where the prepared transaction also needs to be committed? In
> the design of PostgreSQL, in all circumstances, the way you recover
> from an error is to abort the transaction. That is what brings the
> system back to a clean state. You can't simply ignore the requirement
> to abort the transaction and keep doing more work. It will never be
> reliable, and Tom will instantaneously demand that any code works like
> that be reverted -- and for good reason.

(I took "abort" as the same as "rollback" here.)  Once we've sent commit 
requests to some participants, we can't abort the transaction.  If one FDW 
returned an error halfway, we need to send commit requests to the rest of 
participants.

It's a design question, as I repeatedly said, whether and how we should report 
the error of some participants to the client.  For instance, how should we 
report the errors of multiple participants?  Concatenate those error messages?

Anyway, we should design the interface first, giving much thought and 
respecting the ideas of predecessors (TX/XA, MS DTC, JTA/JTS).  Otherwise, we 
may end up like "We implemented like this, so the interface is like this and it 
can only behave like this, although you may find it strange..."  That might be 
a situation similar to what your comment "the design of PostgreSQL, in all 
circumstances, the way you recover from an error is to abort the transaction" 
suggests -- Postgres doesn't have statement-level rollback.


> I am not sure that it's 100% impossible to find a way to solve this
> problem without just having the resolver do all the work, but I think
> it's going to be extremely difficult. We tried to figure out some
> vaguely similar things while working on undo, and it really didn't go
> very well. The later stages of CommitTransaction() and
> AbortTransaction() are places where very few kinds of code are safe to
> execute, and finding a way to patch around that problem is not simple
> either. If the resolver performance is poor, perhaps we could try to
> find a way to improve it. I don't know. But I don't think

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-10 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> If we accept each elementary-commit (via FDW connection) to fail, the
> parent(?) there's no way the root 2pc-commit can succeed.  How can we
> ignore the fdw-error in that case?

No, we don't ignore the error during FDW commit.  As mentioned at the end of 
this mail, the question is how the FDW reports the eror to the caller 
(transaction  manager in Postgres core), and how we should handle it.

As below, Glassfish catches the resource manager's error during commit, retries 
the commit if the error is transient or communication failure, and hands off 
the processing of failed commit to the recovery manager.  (I used all of my 
energy today; I'd be grateful if someone could figure out whether Glassfish 
reports the error to the application.)


[XATerminatorImpl.java]
public void commit(Xid xid, boolean onePhase) throws XAException {
...
} else {
coord.commit();
}


[TopCoordinator.java]
// Commit all participants.  If a fatal error occurs during
// this method, then the process must be ended with a fatal error.
...
try {
participants.distributeCommit();
} catch (Throwable exc) {


[RegisteredResources.java]
void distributeCommit() throws HeuristicMixed, HeuristicHazard, NotPrepared 
{
...
// Browse through the participants, committing them. The following is
// intended to be done asynchronously as a group of operations.
...
// Tell the resource to commit.
// Catch any exceptions here; keep going until
// no exception is left.
...
// If the exception is neither TRANSIENT or
// COMM_FAILURE, it is unexpected, so display a
// message and give up with this Resource.
...
// For TRANSIENT or COMM_FAILURE, wait
// for a while, then retry the commit.
...
// If the retry limit has been exceeded,
// end the process with a fatal error.
...
if (!transactionCompleted) {
if (coord != null)
RecoveryManager.addToIncompleTx(coord, true);


> > No.  Taking the description literally and considering the relevant XA
> specification, it's not about the remote commit failure.  The remote server is
> not allowed to fail the commit once it has reported successful prepare, which 
> is
> the contract of 2PC.  HeuristicMixedException is about the manual resolution,
> typically by the DBA, using the DBMS-specific tool or the standard
> commit()/rollback() API.
> 
> Mmm. The above seems as if saying that 2pc-comit does not interact
> with remotes.  The interface contract does not cover everything that
> happens in the real world. If remote-commit fails, that is just an
> issue outside of the 2pc world.  In reality remote-commit may fail for
> all reasons.

The following part of XA specification is relevant.  We're considering to model 
the FDW 2PC interface based on XA, because it seems like the only standard 
interface and thus other FDWS would naturally take advantage of, aren't we?  
Then, we need to take care of such things as this.  The interface design is not 
easy.  So, proper design and its review should come first, before going deeper 
into the huge code patch.

2.3.3 Heuristic Branch Completion 
--
Some RMs may employ heuristic decision-making: an RM that has prepared to 
commit a transaction branch may decide to commit or roll back its work 
independently 
of the TM. It could then unlock shared resources. This may leave them in an 
inconsistent state. When the TM ultimately directs an RM to complete the 
branch, the 
RM may respond that it has already done so. The RM reports whether it committed 
the branch, rolled it back, or completed it with mixed results (committed some 
work 
and rolled back other work). 

An RM that reports heuristic completion to the TM must not discard its 
knowledge of 
the transaction branch. The TM calls the RM once more to authorise it to forget 
the 
branch. This requirement means that the RM must notify the TM of all heuristic 
decisions, even those that match the decision the TM requested. The referenced 
OSI DTP specifications (model) and (service) define heuristics more precisely. 
--


> https://www.ibm.com/docs/ja/db2-for-zos/11?topic=support-example-distr
> ibuted-transaction-that-uses-jta-methods
> This suggests that both XAResoruce.prepare() and commit() can throw a
> exception.

Yes, XAResource() can throw an exception:

void commit(Xid xid, boolean onePhase) throws XAException 

Throws: XAException 
An error has occurred. Possible XAExceptions are XA_HEURHAZ, XA_HEURCOM, 
XA_HEURRB, XA_HEURMIX, XAER_RMERR, XAER_RMFAIL, XAER_NOTA, 
XAER_INVAL, or XAER_PROTO.

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-09 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> Maybe it's better to start a new thread to discuss this topic. If your
> idea is good, we can lower all error that happened after writing the
> commit record to warning, reducing the cases where the client gets
> confusion by receiving an error after the commit.

No.  It's an important part because it determines the 2PC behavior and 
performance.  This discussion had started from the concern about performance 
before Ikeda-san reported pathological results.  Don't rush forward, hoping 
someone will commit the current patch.  I'm afraid you just don't want to 
change your design and code.  Let's face the real issue.

As I said before, and as Ikeda-san's performance benchmark results show, I have 
to say the design isn't done sufficiently.  I talked with Fujii-san the other 
day about this patch.  The patch is already huge and it's difficult to decode 
how the patch works, e.g., what kind of new WALs it emits, how many disk writes 
it adds, how the error is handled, whether/how it's different from the textbook 
or other existing designs, etc.  What happend to my request to add such design 
description to the following page, so that reviewers can consider the design 
before spending much time on looking at the code?  What's the situation of the 
new FDW API that should naturally accommodate other FDW implementations?

Atomic Commit of Distributed Transactions
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

Design should come first.  I don't think it's a sincere attitude to require 
reviewers to spend long time to read the design from huge code.


Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-09 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> On Tue, Jun 8, 2021 at 5:28 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> > Then, in what kind of scenario are we talking about the difficulty, and how 
> > is
> it difficult to handle, when we adopt either the method 1 or 2?  (I'd just 
> like to
> have the same clear picture.)
> 
> IMO, even though FDW's commit/rollback transaction code could be
> simple in some cases, I think we need to think that any kind of errors
> (or even FATAL or PANIC) could be thrown from the FDW code. It could
> be an error due to a temporary network problem, remote server down,
> driver’s unexpected error, or out of memory etc. Errors that happened
> after the local transaction commit doesn't affect the global
> transaction decision, as you mentioned. But the proccess or system
> could be in a bad state. Also, users might expect the process to exit
> on error by setting  exit_on_error = on. Your idea sounds like that we
> have to ignore any errors happening after the local commit if they
> don’t affect the transaction outcome. It’s too scary to me and I think
> that it's a bad idea to blindly ignore all possible errors under such
> conditions. That could make the thing worse and will likely be
> foot-gun. It would be good if we can prove that it’s safe to ignore
> those errors but not sure how we can at least for me.
> 
> This situation is true even today; an error could happen after
> committing the transaction. But I personally don’t want to add the
> code that increases the likelihood.

I'm not talking about the code simplicity here (actually, I haven't reviewed 
the code around prepare and commit in the patch yet...)  Also, I don't 
understand well what you're trying to insist and what realistic situations you 
have in mind by citing exit_on_error, FATAL, PANIC and so on.  I just asked (in 
a different part) why the client has to know the error.

Just to be clear, I'm not saying that we should hide the error completely 
behind the scenes.  For example, you can allow the FDW to emit a WARNING if the 
DBMS-specific client driver returns an error when committing.  Further, if you 
want to allow the FDW to throw an ERROR when committing, the transaction 
manager in core can catch it by PG_TRY(), so that it can report back 
successfull commit of the global transaction to the client while it leaves the 
handling of failed commit of the FDW to the resolver.  (I don't think we like 
to use PG_TRY() during transaction commit for performance reasons, though.)

Let's give it a hundred steps and let's say we want to report the error of the 
committing FDW to the client.  If that's the case, we can use SQLSTATE 02xxx 
(Warning) and attach the error message.

> Just to be clear, with your idea, we will ignore only ERROR or also
> FATAL and PANIC? And if an error happens during committing one of the
> prepared transactions on the foreign server, will we proceed with
> committing other transactions or return OK to the client?

Neither FATAL nor PANIC can be ignored.  When FATAL, which means the 
termination of a particular session, the committing of the remote transaction 
should be taken over by the resolver.  Not to mention PANIC; we can't do 
anything.  Otherwise, we proceed with committing other FDWs, hand off the task 
of committing the failed FDW to the resolver, and report success to the client. 
 If you're not convinced, I'd like to ask you to investigate the code of some 
Java EE app server, say GlassFish, and share with us how it handles an error 
during commit.

Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-08 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> I think the discussion is based the behavior that any process that is
> responsible for finishing the 2pc-commit continue retrying remote
> commits until all of the remote-commits succeed.

Thank you for coming back.  We're talking about the first attempt to prepare 
and commit in each transaction, not the retry case.


> > Throws: HeuristicMixedException
> > Thrown to indicate that a heuristic decision was made and that some
> relevant updates have been
> > committed while others have been rolled back.

> I'm not sure about how JTA works in detail, but doesn't
> UserTransaction.commit() return HeuristicMixedExcpetion when some of
> relevant updates have been committed but other not? Isn't it the same
> state with the case where some of the remote servers failed on
> remote-commit while others are succeeded?

No.  Taking the description literally and considering the relevant XA 
specification, it's not about the remote commit failure.  The remote server is 
not allowed to fail the commit once it has reported successful prepare, which 
is the contract of 2PC.  HeuristicMixedException is about the manual 
resolution, typically by the DBA, using the DBMS-specific tool or the standard 
commit()/rollback() API.


> (I guess that
> UserTransaction.commit() would throw RollbackException if
> remote-prepare has been failed for any of the remotes.)

Correct.


Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-08 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.ta...@fujitsu.com
>  wrote:
> > Why does the client have to know the error on a remote server, whereas the
> global transaction itself is destined to commit?
> 
> It's not necessarily on a remote server. It could be a problem with
> the local server.

Then, in what kind of scenario are we talking about the difficulty, and how is 
it difficult to handle, when we adopt either the method 1 or 2?  (I'd just like 
to have the same clear picture.)  For example,

1. All FDWs prepared successfully.
2. The local transaction prepared successfully, too.
3. Some FDWs committed successfully.
4. One FDW failed to send the commit request because the remote server went 
down.

Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-07 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> I think we should not reinterpret the severity of the error and lower
> it. Especially, in this case, any kind of errors can be thrown. It
> could be such a serious error that FDW developer wants to report to
> the client. Do we lower even PANIC to a lower severity such as
> WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> non-error?

Why does the client have to know the error on a remote server, whereas the 
global transaction itself is destined to commit?

FYI, the tx_commit() in the X/Open TX interface and the 
UserTransaction.commit() in JTA don't return such an error, IIRC.  Do TX_FAIL 
and SystemException serve such a purpose?  I don't feel like that.


[Tuxedo manual (Japanese)]
https://docs.oracle.com/cd/F25597_01/document/products/tuxedo/tux80j/atmi/rf3c91.htm


[JTA]
public interface javax.transaction.UserTransaction 
public void commit()
 throws RollbackException, HeuristicMixedException, 
HeuristicRollbackException, SecurityException, 
IllegalStateException, SystemException 

Throws: RollbackException 
Thrown to indicate that the transaction has been rolled back rather than 
committed. 

Throws: HeuristicMixedException 
Thrown to indicate that a heuristic decision was made and that some relevant 
updates have been 
committed while others have been rolled back. 

Throws: HeuristicRollbackException 
Thrown to indicate that a heuristic decision was made and that all relevant 
updates have been rolled 
back. 

Throws: SecurityException 
Thrown to indicate that the thread is not allowed to commit the transaction. 

Throws: IllegalStateException 
Thrown if the current thread is not associated with a transaction. 

Throws: SystemException 
Thrown if the transaction manager encounters an unexpected error condition. 


Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-04 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> On Fri, Jun 4, 2021 at 5:04 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> > Why does the user have to get an error?  Once the local transaction has been
> prepared, which means all remote ones also have been prepared, the whole
> transaction is determined to commit.  So, the user doesn't have to receive an
> error as long as the local node is alive.
> 
> I think we should neither ignore the error thrown by FDW code nor
> lower the error level (e.g., ERROR to WARNING).

Why?  (Forgive me for asking relentlessly... by imagining me as a cute 
7-year-old boy/girl asking "Why Dad?")


> > How do non-2PC and 2PC cases differ in the rarity of the error?
> 
> I think the main difference would be that in 2PC case there will be
> network communications possibly with multiple servers after the local
> commit.

Then, it's the same failure mode.  That is, the same failure could occur for 
both cases.  That doesn't require us to differentiate between them.  Let's 
ignore this point from now on.


Regards
Takayuki Tsunakawa

RE: Fast COPY FROM based on batch insert

2021-06-04 Thread tsunakawa.ta...@fujitsu.com

From: Andrey Lepikhov 
> We still have slow 'COPY FROM' operation for foreign tables in current master.
> Now we have a foreign batch insert operation And I tried to rewrite the patch 
> [1]
> with this machinery.

I haven't looked at the patch, but nice performance.

However, I see the following problems.  What do you think about them?

1)
No wonder why the user would think like "Why are INSERTs run on the remote 
server?  I ran COPY."


2)
Without the FDW API for COPY, other FDWs won't get a chance to optimize for 
bulk data loading.  For example, oracle_fdw might use conventional path insert 
for the FDW batch insert, and the direct path insert for the FDW COPY.


3)
INSERT and COPY in Postgres differs in whether the rule is invoked:

https://www.postgresql.org/docs/devel/sql-copy.html

"COPY FROM will invoke any triggers and check constraints on the destination 
table. However, it will not invoke rules."


Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-04 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
1. the backend continues attempting to commit all prepared foreign
> transactions until all of them are committed.
> 2. the backend attempts to commit all prepared foreign transactions
> once. If an error happens, leave them for the resolver.
> 3. the backend asks the resolver that launched per foreign server to
> commit the prepared foreign transactions (and backend waits or doesn't
> wait for the commit completion depending on the setting).
> 
> With ideas 1 and 2, since the backend itself commits all foreign
> transactions the resolver process cannot be a bottleneck, and probably
> the code can get more simple as backends don't need to communicate
> with resolver processes.
> 
> However, those have two problems we need to deal with:
> 

> First, users could get an error if an error happens during the backend
> committing prepared foreign transaction but the local transaction is
> already committed and some foreign transactions could also be
> committed, confusing users. There were two opinions to this problem:
> FDW developers should be responsible for writing FDW code such that
> any error doesn't happen during committing foreign transactions, and
> users can accept that confusion since an error could happen after
> writing the commit WAL even today without this 2PC feature. 

Why does the user have to get an error?  Once the local transaction has been 
prepared, which means all remote ones also have been prepared, the whole 
transaction is determined to commit.  So, the user doesn't have to receive an 
error as long as the local node is alive.


> For the
> former point, I'm not sure it's always doable since even palloc()
> could raise an error and it seems hard to require all FDW developers
> to understand all possible paths of raising an error.

No, this is a matter of discipline to ensure consistency, just in case we 
really have to return an error to the user.


> And for the
> latter point, that's true but I think those cases are
> should-not-happen cases (i.g., rare cases) whereas the likelihood of
> an error during committing prepared transactions is not low (e.g., by
> network connectivity problem). I think we need to assume that that is
> not a rare case.

How do non-2PC and 2PC cases differ in the rarity of the error?


> The second problem is whether we can cancel committing foreign
> transactions by pg_cancel_backend() (or pressing Ctl-c). If the
> backend process commits prepared foreign transactions, it's FDW
> developers' responsibility to write code that is interruptible. I’m
> not sure it’s feasible for drivers for other databases.

That's true not only for prepare and commit but also for other queries.  Why do 
we have to treat prepare and commit specially?


> Through the long discussion on this thread, I've been thought we got a
> consensus on idea 3 but sometimes ideas 1 and 2 are proposed again for

I don't remember seeing any consensus yet?

> With the current patch, we commit prepared foreign transactions
> asynchronously. But maybe we need to compare the performance of ideas
> 1 (and 2) to idea 3 with synchronous foreign transaction resolution.

+1


Regards
Takayuki Tsunakawa

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: Bharath Rupireddy 
> I'm still not sure why the execution time with 0 workers (or serial execution 
> or
> no parallelism involved) on my testing system is 112 sec compared to 58 sec on
> Hou-San's system for the same use case. Maybe the testing system I'm using
> is not of the latest configuration compared to others.

What's the setting of wal_level on your two's systems?  I thought it could be 
that you set it to > minimal, while Hou-san set it to minimal.  (I forgot the 
results of 2 and 4 workers, though.)


Regards
Takayuki Tsunakawa

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: Bharath Rupireddy 
> I think we can discuss this in a separate thread and see what other
> hackers think.

OK, unless we won't get stuck in the current direction.  (Our goal is to not 
degrade in performance, but to outperform serial execution, isn't it?)


> If the idea is to give the user control of whether or not to use the
> separate RING BUFFER for bulk inserts/writes, then how about giving it
> as a rel option? Currently BAS_BULKWRITE (GetBulkInsertState), is
> being used by CTAS, Refresh Mat View, Table Rewrites (ATRewriteTable)
> and COPY. Furthermore, we could make the rel option an integer and
> allow users to provide the size of the ring buffer they want to choose
> for a particular bulk insert operation (of course with a max limit
> which is not exceeding the shared buffers or some reasonable amount
> not exceeding the RAM of the system).

I think it's not a table property but an execution property.  So, it'd be 
appropriate to control it with the SET command, just like the DBA sets work_mem 
and maintenance_work_mem for specific maintenance operations.

I'll stop on this here...


Regards
Takayuki Tsunakawa

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: houzj.f...@fujitsu.com 
> Although, the 4 workers case still has performance degradation compared to
> serial case.
> 
> SERIAL: 58759.213 ms
> PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms  [SKIP FSM]:
> 58633.924 ms
> PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms   [SKIP FSM]:
> 66,960.305 ms

Can you see any difference in table sizes?


Regards
Takayuki Tsunakawa

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: Dilip Kumar  
Basically you are creating a new table and loading data to it and that means 
you will be less likely to access those data soon so for such thing spoiling 
buffer cache may not be a good idea.
--

Some people, including me, would say that the table will be accessed soon and 
that's why the data is loaded quickly during minimal maintenance hours.


--
I was just suggesting only for experiments for identifying the root cause.
--

I thought this is a good chance to possibly change things better (^^).
I guess the user would simply think like this: "I just want to finish CTAS as 
quickly as possible, so I configured to take advantage of parallelism.  I want 
CTAS to make most use of our resources.  Why doesn't Postgres try to limit 
resource usage (by using the ring buffer) against my will?"


Regards
Takayuki Tsunakawa

RE: Parallel Inserts in CREATE TABLE AS

2021-05-27 Thread tsunakawa.ta...@fujitsu.com

From: Dilip Kumar 
> I think some other cause of contention on relation extension locks are
> 1. CTAS is using a buffer strategy and due to that, it might need to
> evict out the buffer frequently for getting the new block in.  Maybe
> we can identify by turning off the buffer strategy for CTAS and
> increasing the shared buffer so that data fits in memory.

Yes, both Bhrath-san (on a rich-man's machine) and I (on a poor-man's VM) saw 
that it's effective.  I think we should remove this shackle from CTAS.

The question is why CTAS chose to use BULKWRITE strategy in the past.  We need 
to know that to make a better decision.  I can understand why VACUUM uses a 
ring buffer, because it should want to act humbly as a background maintenance 
task to not cause trouble to frontend tasks.  But why does CTAS have to be 
humble?  If CTAS needs to be modest, why doesn't it use the BULKREAD strategy 
for its SELECT?


Regards
Takayuki Tsunakawa

RE: Parallel Inserts in CREATE TABLE AS

2021-05-26 Thread tsunakawa.ta...@fujitsu.com

Thank you for the detailed analysis, I'll look into it too.  (The times have 
changed...)

From: Bharath Rupireddy 
> Well, one might think to add more blocks at a time, say
> Min(1024, lockWaiters * 128/256/512) than currently extraBlocks =
> Min(512, lockWaiters * 20);. This will work (i.e. we don't see any
> regression with parallel inserts in CTAS patches), but it can't be a
> practical solution. Because the total pages for the relation will be
> more with many pages having more free space. Furthermore, the future
> sequential scans on that relation might take a lot of time.

> Otherwise, the similar speed up can be observed when the BAS_BULKWRITE
> is increased a bit from the current 16MB to some other reasonable
> value. I earlier tried these experiments.
> 
> Otherwise, as I said in [1], we can also increase the number of extra
> blocks added at a time, say Min(1024, lockWaiters * 128/256/512) than
> currently extraBlocks = Min(512, lockWaiters * 20);. This will also
> give some speedup and we don't see any regression with parallel
> inserts in CTAS patches.
> 
> But, I'm not so sure that the hackers will agree any of the above as a
> practical solution to the "relation extension" problem.

I think I understand your concern about resource consumption and impact on 
other concurrently running jobs (OLTP, data analysis.)

OTOH, what's the situation like when the user wants to run CTAS, and further, 
wants to speed it up by using parallelism?  isn't it okay to let the (parallel) 
CTAS use as much as it wants?  At least, I think we can provide another mode 
for it, like Oracle provides conditional path mode and direct path mode for 
INSERT and data loading.

What do we want to do to maximize parallel CTAS speedup if we were a bit 
unshackled from the current constraints (alignment with existing code, impact 
on other concurrent workloads)?

* Use as many shared buffers as possible to decrease WAL flush.
Otherwise, INSERT SELECT may be faster?

* Minimize relation extension (= increase the block count per extension)
posix_fallocate() would help too.

* Allocate added pages among parallel workers, and each worker fills pages to 
their full capacity.
The worker that extended the relation stores the page numbers of added pages in 
shared memory for parallel execution.  Each worker gets a page from there after 
waiting for the relation extension lock, instead of using FSM.
The last pages that the workers used will be filled halfway, but the amount of 
unused space should be low compared to the total table size.



Regards
Takayuki Tsunakawa

RE: Batch insert in CTAS/MatView code

2021-05-26 Thread tsunakawa.ta...@fujitsu.com

From: Bharath Rupireddy 
> I think the "New Table Access Methods for Multi and Single Inserts"
> patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat
> View, Refresh Mat View and so on. It also has a patch for multi
> inserts in CTAS and Refresh Mat View
> (v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch).
> Please see that thread and feel free to review it.

Ouch, I didn't notice that the patch was reborn in that thread.  OK.

Could you add it to the CF if you haven't yet?


Regards
Takayuki Tsunakawa

RE: Batch insert in CTAS/MatView code

2021-05-26 Thread tsunakawa.ta...@fujitsu.com

Hello Paul-san,

From: Daniel Gustafsson 
> In an off-list discussion with Paul, we decided to withdraw this patch for now
> and instead create a new entry when there is a re-worked patch.  This has
> now
> been done in the CF app.

Would you mind if I take over this patch for PG 15?  I find this promising, as 
Bharath-san demonstrated good performance by combining your patch and the 
parallel CTAS that Bharath-san has been working on.  We'd like to do things 
that enhance parallelism.

Please allow me to start posting the revised patches next week if I can't get 
your reply.


Regards
Takayuki Tsunakawa

RE: Parallel Inserts in CREATE TABLE AS

2021-05-25 Thread tsunakawa.ta...@fujitsu.com

From: houzj.f...@fujitsu.com 
> + /*
> +  * We don't need to skip contacting FSM while inserting tuples
> for
> +  * parallel mode, while extending the relations, workers
> instead of
> +  * blocking on a page while another worker is inserting, can
> check the
> +  * FSM for another page that can accommodate the tuples.
> This results
> +  * in major benefit for parallel inserts.
> +  */
> + myState->ti_options = 0;
> 
> I am not quite sure that disabling the " SKIP FSM " in parallel worker will 
> bring
> performance gain.
> In my test environment, if I change this code to use option "
> TABLE_INSERT_SKIP_FSM ", then there
> seems no performance degradation.

+1, probably.

Does the code comment represent the situation like this?

1. Worker 1 is inserting into page 1.

2. Worker 2 tries to insert into page 1, but cannot acquire the buffer content 
lock of page 1 because worker 1 holds it.

3. Worker 2 looks up FSM to find a page with enough free space.

But isn't FSM still empty during CTAS?



Regards
Takayuki Tsunakawa

RE: Parallel Inserts in CREATE TABLE AS

2021-05-25 Thread tsunakawa.ta...@fujitsu.com

Bharath-san, all,


Hmm, I didn't experience performance degradation on my poor-man's Linux VM (4 
CPU, 4 GB RAM, HDD)...

[benchmark preparation]
autovacuum = off
shared_buffers = 1GB
checkpoint_timeout = 1h
max_wal_size = 8GB
min_wal_size = 8GB
(other settings to enable parallelism)
CREATE UNLOGGED TABLE a (c char(1100));
INSERT INTO a SELECT i FROM generate_series(1, 30) i;
(the table size is 335 MB)

[benchmark]
CREATE TABLE b AS SELECT * FROM a;
DROP TABLE a;
CHECKPOINT;
(measure only CTAS)


[results]
parallel_leader_participation = off
  workers  time(ms)
  0  3921
  2  3290
  4  3132
parallel_leader_participation = on
  workers  time(ms)
  2  3266
  4  3247


Although this should be a controversial and may be crazy idea, the following 
change brought 4-11% speedup.  This is because I thought parallel workers might 
contend for WAL flush as a result of them using the limited ring buffer and 
flushing dirty buffers when the ring buffer is filled.  Can we take advantage 
of this?

[GetBulkInsertState]
/*  bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);*/
bistate->strategy = NULL;


[results]
parallel_leader_participation = off
  workers  time(ms)
  0  3695  (5% reduction)
  2  3135  (4% reduction)
  4  2767  (11% reduction)


Regards
Takayuki Tsunakawa

RE: Skip partition tuple routing with constant partition key

2021-05-24 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> Ah, Maybe I found the issue.
> Attaching a new patch, please have a try on this patch.

Thanks, it has compiled perfectly without any warning.


Regards
Takayuki Tsunakawa

RE: Skip partition tuple routing with constant partition key

2021-05-24 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> It seems a little strange, I have compiled it alone in two different linux 
> machine
> and did
> not find such an error. Did you compile it on a windows machine ?

On Linux, it produces:

gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-s\
tatement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-securit\
y -fno-strict-aliasing -fwrapv -g -O0 -I../../../src/include  -D_GNU_SOURCE   -\
c -o heap.o heap.c -MMD -MP -MF .deps/heap.Po
In file included from heap.c:86:
../../../src/include/utils/partcache.h:54: error: redefinition of typedef 'Part\
KeyContext'
../../../src/include/partitioning/partdefs.h:26: note: previous declaration of \
'PartKeyContext' was here


Regards
Takayuki Tsunakawa

RE: Skip partition tuple routing with constant partition key

2021-05-24 Thread tsunakawa.ta...@fujitsu.com

From: houzj.f...@fujitsu.com 
> I think this patch can solve the performance degradation of key expression 
> after
> applying the [Save the last partition] patch.
> Besides, this could be a separate patch which can improve some more cases.
> Thoughts ?

Thank you for proposing an impressive improvement so quickly!  Yes, I'm in the 
mood for adopting Amit-san's patch as a base because it's compact and readable, 
and plus add this patch of yours to complement the partition key function case.

But ...

* Applying your patch alone produced a compilation error.  I'm sorry I 
mistakenly deleted the compile log, but it said something like "There's a 
redeclaration of PartKeyContext in partcache.h; the original definition is in 
partdef.h"

* Hmm, this may be too much to expect, but I wonder if we can make the patch 
more compact...


Regards
Takayuki Tsunakawa

RE: Bug in query rewriter - hasModifyingCTE not getting set

2021-05-21 Thread tsunakawa.ta...@fujitsu.com

From: Tom Lane 
> "tsunakawa.ta...@fujitsu.com"  writes:
> > Finally, I think I've understood what you meant.  Yes, the current code 
> > seems
> to be wrong.
> 
> I'm fairly skeptical of this claim, because that code has stood for a
> long time.  Can you provide an example (not involving hasModifyingCTE)
> in which it's wrong?

Hmm, I don't think of an example.  I wonder if attaching WITH before INSERT 
SELECT and putting WITH between INSERT and SELECT produce the same results.  
Maybe that's why the regression test succeeds with the patch.

To confirm, the question is that when we have the following rule in place and 
the client issues the query:

[rule]
CREATE RULE myrule AS
ON {INSERT | UPDATE | DELETE} TO orig_table
DO INSTEAD
INSERT INTO some_table SELECT ...;

[original query]
WITH t AS (
SELECT and/or NOTIFY
)
{INSERT INTO | UPDATE | DELETE FROM} orig_table ...;

which of the following two queries do we expect?

[generated query 1]
WITH t AS (
SELECT and/or NOTIFY
)
INSERT INTO some_table SELECT ...;

[generated query 2]
INSERT INTO some_table
WITH t AS (
SELECT and/or NOTIFY
)
SELECT ...;

Although both may produce the same results, I naturally expected query 1, 
because WITH was originally attached before the top-level query, and (2) the 
top-level query has been replaced with a rule action, so it's natural that the 
WITH is attached before the rule action.  Super-abbreviated description is:

x -> y  (rule)
WITH t x  (original query)
WITH t y  (generated query 1)
one-part-of-y WITH t another-part-of-y  (generated query 2)

As we said, we agree to fail the query if it's the above generated query 2 and 
WITH contains a data-modyfing CTE, if we cannot be confident to accept the 
change to the WITH position.  Which do you think we want to choose?

Regards
Takayuki Tsunakawa

RE: Bug in query rewriter - hasModifyingCTE not getting set

2021-05-20 Thread tsunakawa.ta...@fujitsu.com

From: Tom Lane 
> I think either the bit about rule_action is unnecessary, or most of
> the code immediately above this is wrong, because it's only updating
> flags in sub_action.  Why do you think it's necessary to change
> rule_action in addition to sub_action?

Finally, I think I've understood what you meant.  Yes, the current code seems 
to be wrong.  rule_action is different from sub_action only when the rule 
action (the query specified in CREATE RULE) is INSERT SELECT.  In that case, 
rule_action points to the entire INSERT SELECT, while sub_action points to the 
SELECT part.  So, we should add the CTE list and set 
hasModifyingCTE/hasRecursive flags in rule_action.


> Hm.  So after looking at this more, the problem is that the rewrite
> is producing something equivalent to
> 
> INSERT INTO bug6051_2
> (WITH t1 AS (DELETE FROM bug6051 RETURNING *) SELECT * FROM t1);

Yes.  In this case, the WITH clause must be put before INSERT.

The attached patch is based on your version.  It includes cosmetic changes to 
use = instead of |= for boolean variable assignments.  make check passed.  
Also, Greg-san's original failed test case succeeded.  I confirmed that the 
hasModifyingCTE of the top-level rewritten query is set to true now by looking 
at the output of debug_print_rewritten and debug_print_plan.


Regards
Takayuki Tsunakawa



v3-0001-propagate-CTE-property-flags-in-rewriter.patch
Description: v3-0001-propagate-CTE-property-flags-in-rewriter.patch

RE: Skip partition tuple routing with constant partition key

2021-05-19 Thread tsunakawa.ta...@fujitsu.com

From: Amit Langote 
> On Tue, May 18, 2021 at 11:11 AM houzj.f...@fujitsu.com
>  wrote:
> > For some big data scenario, we sometimes transfer data from one table(only
> store not expired data)
> > to another table(historical data) for future analysis.
> > In this case, we import data into historical table regularly(could be one 
> > day or
> half a day),
> > And the data is likely to be imported with date label specified, then all 
> > of the
> data to be
> > imported this time belong to the same partition which partition by time 
> > range.
> 
> Is directing that data directly into the appropriate partition not an
> acceptable solution to address this particular use case?  Yeah, I know
> we should avoid encouraging users to perform DML directly on
> partitions, but...

Yes, I want to make/keep it possible that application developers can be unaware 
of partitions.  I believe that's why David-san, Alvaro-san, and you have made 
great efforts to improve partitioning performance.  So, I'm +1 for what Hou-san 
is trying to achieve.

Is there something you're concerned about?  The amount and/or complexity of 
added code?

Regards
Takayuki Tsunakawa

RE: Bug in query rewriter - hasModifyingCTE not getting set

2021-05-17 Thread tsunakawa.ta...@fujitsu.com

From: Tom Lane 
> In view of this, maybe the right thing is to disallow modifying CTEs
> in rule actions in the first place.  I see we already do that for
> views (i.e. ON SELECT rules), but they're not really any safer in
> other types of rules.

You meant by views something like the following, didn't you?

postgres=# create view myview as with t as (delete from b) select * from a;
ERROR:  views must not contain data-modifying statements in WITH

OTOH, the examples Greg-san showed do not contain CTE in the rule action, but 
in the query that the rule is applied to.  So, I think the solution would be 
something different.


>  Given that non-SELECT rules are an undertested
> legacy thing, I'm not that excited about moving mountains to make
> this case possible.

> That semantic issue doesn't get any less pressing just because the query
> was generated by rewrite.  So I now think that what we have to do is
> throw an error if we have a modifying CTE and sub_action is different
> from rule_action.  Not quite sure how to phrase the error though.

So, how about just throwing an error when the original query (not the rule 
action) has a data-modifying CTE?  The error message would be something like "a 
query containing a data-modifying CTE cannot be executed because there is some 
rule applicable to the relation".  This may be overkill and too many regression 
tests might fail, so we may have to add some condition to determine if we error 
out.

Or, I thought Greg-san's patch would suffice.  What problem do you see in it?

I couldn't imagine what "mountains" are.  Could you tell me what's that?


Regards
Takayuki Tsunakawa

RE: Support for VACUUMing Foreign Tables

2021-05-16 Thread tsunakawa.ta...@fujitsu.com

From: Bharath Rupireddy 
> This can be useful in situations like where there are many remote
> postgres servers that are connected to a single coordinator on which
> foreign tables are defined for each of the remote tables. In this
> case, the DBA (or whoever is responsible to do that job) doesn't have
> to figure out which remote server should be logged onto to perform the
> VACUUM. They can issue VACUUM command on the foreign table from the
> coordinator server.

I thought the FDW mechanism was initially, and probably still is, designed to 
access data on other data sources that are operated independently, mostly for 
data integration.  Are you thinking of shared-nothing clustering that consist 
of tightly coupled servers, because you're referring to a coordinator server?  
(Is EDB (re-)starting the sharding scale-out?)


> There are other databases that have MVCC implemented for which the
> bloat clean up might be necessary at some point. They may not have the
> same terminology that postgres has for cleaning up the bloat. For
> instance, MySQL (instead of VACUUM they have OPTIMIZE TABLE command)
> which can be connected to postgres_fdw using supported fdw

MySQL's OPTIMIZE TABLE looks like VACUUM FULL, not plain VACUUM, although I'm 
not completely sure.

How would the various options map to the FDW interface, such as FREEZE, 
VERBOSE, vacuum_truncate, index_cleanup?  Also, how would the following GUC 
settings propagate to the foreign server?

SET vacuum_freeze_table_age = 0;
SET vacuum_freeze_min_age = 0;
VACUUM mytable;

I think people who want to run manual VACUUM will want to control VACUUM 
behavior.  But I'm afraid VACUUM is too specific to Postgres to be not very 
good to be incorporated into the FDW interface.


What's our stance toward the FDW interface?  I've thought 1 so far.

1) We carefully choose FDW routines so that many other data sources can provide 
implementations for.  We want to allow access to various data sources through 
the frame of Postgres.

2) We don't care other data sources.  We are happy if multiple Postgres 
instances can interoperate with each other.  Other data source providers can 
choose to implement suitable routines when they can fit their implementations 
into the Postgres world.



Regards
Takayuki Tsunakawa

RE: FDW and connections

2021-05-16 Thread tsunakawa.ta...@fujitsu.com

From: Phil Godfrin 
My question is - how does the call to GetConnection() know what port to use? 
Lets say we're using PGBouncer to connect on the local server at port 6432, but 
there is no pgbouncer listening at the foreign server, what port gets passed? 
My first thought is whatever the client connects port is, but I believe 
pgbouncer ultimately hands of the connection to whatever port you have defined 
for the local database...
This gets important when one has an HAProxy instance between the local and 
foreign servers which is interrogating the port number to decide which ip:port 
to send the request to, ultimately the master or replicant at the foreign 
remoter server.
So how does the port number get propagated from local to foreign server???
--


postgres_fdw uses libpq as a client to connect to the foreign server.  So, as 
the following says, you can specify the libpq's "port" parameter in CREATE 
SERVER.  If it's ommitted as in your case, the default 5432 will be used.

F.35.1.1. Connection Options
https://www.postgresql.org/docs/devel/postgres-fdw.html

"A foreign server using the postgres_fdw foreign data wrapper can have the same 
options that libpq accepts in connection strings, as described in Section 
34.1.2, except that these options are not allowed or have special handling:"


I'm afraid it's better to post user-level questions like this to 
pgsql-gene...@lists.postgresql.org.


Regards
Takayuki Tsunakawa

RE: Support for VACUUMing Foreign Tables

2021-05-13 Thread tsunakawa.ta...@fujitsu.com

From: Bharath Rupireddy 
> I think it will be useful to allow foreign tables to be VACUUMed if
> the underlying FDW supports, currently VACUUM doesn't support foreign
> tables, see [1].

Could you let us imagine more concretely how useful it will be?  While TRUNCATE 
can be part of an application's data processing as alternative to DELETE, I 
think VACUUM is purely the data storage maintenance that's performed by the DBA 
and can be done naturally locally on the server where the table resides.  (The 
existing ANALYZE on FDW is an exception; it's useful to also have data 
statistics locally.)


> this may not be much useful for FDWs that connect to remote non-MVCC
> databases where the concept of VACUUM may not apply, but for
> postgres_fdw and others it might help.

Can you show some examples of "others"?  I believe we should be careful not to 
make the FDW interface a swamp for functions that are only convenient for 
PostgreSQL.

How about adding a routine to the FDW interface that allows to execute an 
arbitrary command like the following?  VACUUM will be able to use this.

PGresult *DoCommandPathThrough(ForeignTable *table, const char *command);

Or, maybe it's more flexible to use ForeignServer instead of ForeignTable.


Regards
Takayuki Tsunakawa

RE: Inaccurate error message when set fdw batch_size to 0

2021-05-09 Thread tsunakawa.ta...@fujitsu.com

From: houzj.f...@fujitsu.com  
> So, is it better to change the error message to “fetch_size requires a 
> positive integer value” ?
> I also found fetch_size has the similar issue, attaching a patch to fix this.

I have a faint memory that I fixed them after receiving the same feedback from 
someone else, strange...  Anyway, thanks.


Regards
Takayuki Tsunakawa

RE: batch fdw insert bug (Postgres 14)

2021-05-09 Thread tsunakawa.ta...@fujitsu.com

From: Tomas Vondra 
> I think the simplest fix is simply to pstrdup() the query when building
> fmstate in create_foreign_modify. We've already been doing that for
> orig_query anyway. That seems cleaner than printing the last query we
> build (which would be confusing I think).
> 
> I've pushed a fix doing that. We only need that for INSERT queries, and
> we might even restrict that to cases with batching if needed.

Thank you for investigating and committing the fix.  (I'm relieved that the 
feature was not reverted.)


Regards
Takayuki Tsunakawa

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-05-07 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> On Wed, May 5, 2021 at 10:54 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> > As proposed in this thread and/or "Parallel INSERT SELECT take 2", we
> thought of detecting parallel unsafe function execution during SQL statement
> execution, instead of imposing much overhead to check parallel safety during
> query planning.  Specifically, we add parallel safety check in fmgr_info()
> and/or FunctionCallInvoke().
> 
> I haven't read that thread, but I don't understand how that can work.
> The reason we need to detect it at plan time is because we might need
> to use a different plan. At execution time it's too late for that.

(I forgot to say this in my previous email.  Robert-san, thank you very much 
for taking time to look at this and giving feedback.  It was sad that we had to 
revert our parallel INSERT SELECT for redesign at the very end of the last CF.  
We need advice and suggestions from knowledgeable and thoughtful people like 
Tom-san, Andres-san and you in early stages to not repeat the tragedy.)

I'd really like you to have a look at the first mail in [1], and to get your 
feedback like "this part should be like ... instead" and "this part would 
probably work, I think."  Without feedback from leading developers, I'm 
somewhat at a loss if and how we can proceed with the proposed approach.

To put it shortly, we found that it can take non-negligible time for the 
planner to check the parallel safety of the target table of INSERT SELECT when 
it has many (hundreds or thousands of) partitions.  The check also added much 
complicated code, too.  So, we got inclined to take Tom-san's suggestion -- let 
the user specify the parallel safety of the target table with CREATE/ALTER 
TABLE and the planner just decides a query plan based on it.  Caching the 
results of parallel safety checks in relcache or a new shared hash table didn't 
seem to work well to me, or it should be beyond my brain at least.

We may think that it's okay to just believe the user-specified parallel safety. 
 But I thought we could step up and do our best to check the parallel safety 
during statement execution, if it's not very invasive in terms of performance 
and code complexity.  The aforementioned idea is that if the parallel processes 
find the called functions parallel unsafe, they error out.  All ancillary 
objects of the target table, data types, constraints, indexes, triggers, etc., 
come down to some UDF, so it should be enough to check the parallel safety when 
the UDF is called.

> Also, it seems potentially quite expensive. A query may be planned
> once and executed many times. Also, a single query execution may call
> the same SQL function many times. I think we don't want to incur the
> overhead of an extra syscache lookup every time anyone calls any
> function. A very simple expression like a+b+c+d+e involves four
> function calls, and + need not be a built-in, if the data type is
> user-defined. And that might be happening for every row in a table
> with millions of rows.

We (optimistically) expect that the overhead won't be serious, because the 
parallel safety information is already at hand in the FmgrInfo struct when the 
function is called.  We don't have to look up the syscache every time the 
function is called.

Of course, adding even a single if statement may lead to a disaster in a 
critical path, so we need to assess the performance.  I'd also appreciate if 
you could suggest some good workload we should experiment in the thread above.

[1]
Parallel INSERT SELECT take 2
https://www.postgresql.org/message-id/tyapr01mb29905a9ab82cc8ba50ab0f80fe...@tyapr01mb2990.jpnprd01.prod.outlook.com

Regards
Takayuki Tsunakawa

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-05-06 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> On Wed, Apr 28, 2021 at 9:42 PM houzj.f...@fujitsu.com
>  wrote:
> > So, If we do not want to lock down the parallel safety of built-in 
> > functions.
> > It seems we can try to fetch the proparallel from pg_proc for built-in 
> > function
> > in fmgr_info_cxt_security too. To avoid recursive safety check when fetching
> > proparallel from pg_proc, we can add a Global variable to mark is it in a
> recursive state.
> > And we skip safety check in a recursive state, In this approach, parallel 
> > safety
> > will not be locked, and there are no new members in FmgrBuiltin.
> >
> > Attaching the patch about this approach [0001-approach-1].
> > Thoughts ?
> 
> This seems to be full of complicated if-tests that don't seem
> necessary and aren't explained by the comments. Also, introducing a
> system cache lookup here seems completely unacceptable from a
> reliability point of view, and I bet it's not too good for
> performance, either.

Agreed.  Also, PG_TRY() would be relatively heavyweight here.  I'm inclined to 
avoid this approach.

> > I also attached another approach patch [0001-approach-2] about adding
> > parallel safety in FmgrBuiltin, because this approach seems faster and
> > we can combine some bool member into a bitflag to avoid enlarging the
> > FmgrBuiltin array, though this approach will lock down the parallel safety
> > of built-in function.
> 
> This doesn't seem like a good idea either.

This looks good to me.  What makes you think so?

That said, I actually think we want to avoid even this change.  That is, I'm 
wondering if we can skip the parallel safety of built-in functions.

Can anyone think of the need to check the parallel safety of built-in functions 
in the context of parallel INSERT SELECT?  The planner already checks (or can 
check) the parallel safety of the SELECT part with max_parallel_hazard().  
Regarding the INSERT part, we're trying to rely on the parallel safety of the 
target table that the user specified with CREATE/ALTER TABLE.  I don't see 
where we need to check the parallel safety of uilt-in functions.

Regards
Takayuki Tsunakawa

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-05-05 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> On Tue, May 4, 2021 at 11:47 PM Greg Nancarrow 
> wrote:
> > Problem is, for built-in functions, the changes are allowed, but for
> > some properties (like strict) the allowed changes don't actually take
> > effect (this is what Amit was referring to - so why allow those
> > changes?).
> > It's because some of the function properties are cached in
> > FmgrBuiltins[] (for a "fast-path" lookup for built-ins), according to
> > their state at build time (from pg_proc.dat), but ALTER FUNCTION is
> > just changing it in the system catalogs. Also, with sufficient
> > privileges, a built-in function can be redefined, yet the original
> > function (whose info is cached in FmgrBuiltins[]) is always invoked,
> > not the newly-defined version.
> 
> I agree. I think that's not ideal. I think we should consider putting
> some more restrictions on updating system catalog changes, and I also
> think that if we can get out of having strict need to be part of
> FmgrBuiltins[] that would be good. But what I don't agree with is the
> idea that since strict already has this problem, it's OK to do the
> same thing with parallel-safety. That seems to me to be making a bad
> situation worse, and I can't see what problem it actually solves.

Let me divide the points:

(1) Is it better to get hardcoded function properties out of fmgr_builtins[]?
It's little worth doing so or thinking about that.  It's no business for users 
to change system objects, in this case system functions.

Also, hardcoding is a worthwhile strategy for good performance or other 
inevitable reasons.  Postgres is using it as in the system catalog relcache 
below.

[relcache.c]
/*
 *  hardcoded tuple descriptors, contents generated by genbki.pl
 */
static const FormData_pg_attribute Desc_pg_class[Natts_pg_class] = 
{Schema_pg_class};
static const FormData_pg_attribute Desc_pg_attribute[Natts_pg_attribute] = 
{Schema_pg_attribute};
...

(2) Should it be disallowed for users to change system function properties with 
ALTER FUNCTION?
Maybe yes, but it's not an important issue for achieving parallel INSERT SELECT 
at the moment.  So, I think this can be discussed in an independent separate 
thread.

As a reminder, Postgres have safeguards against modifying system objects as 
follows.

test=# drop table^C
test=# drop function pg_wal_replay_pause();
ERROR:  cannot drop function pg_wal_replay_pause() because it is required by 
the database system
test=# drop table pg_largeobject;
ERROR:  permission denied: "pg_largeobject" is a system catalog

OTOH, Postgres doesn't disallow changing the system table column values 
directly, such as UPDATE pg_proc SET   But it's warned in the manual that 
such operations are dangerous.  So, we don't have to care about it.

Chapter 52. System Catalogs
https://www.postgresql.org/docs/devel/catalogs.html

"You can drop and recreate the tables, add columns, insert and update values, 
and severely mess up your system that way. Normally, one should not change the 
system catalogs by hand, there are normally SQL commands to do that. (For 
example, CREATE DATABASE inserts a row into the pg_database catalog — and 
actually creates the database on disk.) There are some exceptions for 
particularly esoteric operations, but many of those have been made available as 
SQL commands over time, and so the need for direct manipulation of the system 
catalogs is ever decreasing."

(3) Why do we want to have parallel-safety in fmgr_builtins[]?
As proposed in this thread and/or "Parallel INSERT SELECT take 2", we thought 
of detecting parallel unsafe function execution during SQL statement execution, 
instead of imposing much overhead to check parallel safety during query 
planning.  Specifically, we add parallel safety check in fmgr_info() and/or 
FunctionCallInvoke().

(Alternatively, I think we can conclude that we assume parallel unsafe built-in 
functions won't be used in parallel DML.  In that case, we don't change 
FmgrBuiltin and we just skip the parallel safety check for built-in functions 
when the function is called.  Would you buy this?)

Regards
Takayuki Tsunakawa

RE: A test for replay of regression tests

2021-04-23 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> We have automated tests for many specific replication and recovery scenarios,
> but nothing that tests replay of a wide range of records.
> People working on recovery code often use installcheck (presumably along
> with other custom tests) to exercise it, sometimes with
> wal_consistency_check enabled.  So, why don't we automate that?  Aside
> from exercising the WAL decoding machinery (which brought me here), that'd
> hopefully provide some decent improvements in coverage of the various redo
> routines, many of which are not currently exercised at all.
> 
> I'm not quite sure where it belongs, though.  The attached initial sketch 
> patch

I think that's a good attempt.  It'd be better to confirm that the database 
contents are identical on the primary and standby.  But... I remember when I 
ran make installcheck with a standby connected, then ran pg_dumpall on both the 
primary and standby and compare the two output files, about 40 lines of 
difference were observed.  Those differences were all about the sequence 
values.  I didn't think about whether I should/can remove the differences.


+# XXX The tablespace tests don't currently work when the standby shares a
+# filesystem with the primary due to colliding absolute paths.  We'll skip
+# that for now.

Maybe we had better have a hidden feature that creates tablespace contents in

/tablespace_location/PG_..._/

 is the value of cluster_name or application_name.

Or, we may provide a visible feature that allows users to store tablespace 
contents in a specified directory regardless of the LOCATION value in CREATE 
TABLESPACE.  For instance, add a GUC like

table_space_directory = '/some_dir'

Then, the tablespace contents are stored in /some_dir//.  This 
may be useful if a DBaaS provider wants to offer some tablespace-based feature, 
say tablespace transparent data encryption, but doesn't want users to specify 
filesystem directories.


Regards
Takayuki Tsunakawa

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-22 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> For approach 1): I think it could result in infinite recursion.
> 
> For example:
> If we first access one built-in function A which have not been cached,
> it need access the pg_proc, When accessing the pg_proc, it internally still 
> need
> some built-in function B to scan.
> At this time, if B is not cached , it still need to fetch function B's 
> parallel flag by
> accessing the pg_proc.proparallel.
> Then it could result in infinite recursion.
> 
> So, I think we can consider the approach 2)

Hmm, that makes sense.  That's a problem structure similar to that of relcache. 
 Only one choice is left already, unless there's another better idea.



Regards
Takayuki Tsunakawa

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-22 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> I agree that it's better to mark the function with correct parallel safety 
> lable.
> Especially for the above functions which will be executed in parallel mode.
> It will be friendly to developer and user who is working on something related 
> to
> parallel test.
> 
> So, I attached the patch to mark the above functions parallel safe.

Thank you, the patch looks good.  Please register it with the next CF if not 
yet.


Regards
Takayuki Tsunakawa

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-22 Thread tsunakawa.ta...@fujitsu.com

From: Tom Lane 
> Amit Kapila  writes:
> > IIUC, the idea here is to check for parallel safety of functions at
> > someplace in the code during function invocation so that if we execute
> > any parallel unsafe/restricted function via parallel worker then we
> > error out. If so, isn't it possible to deal with built-in and
> > non-built-in functions in the same way?
> 
> Yeah, one of the reasons I doubt this is a great idea is that you'd
> still have to fetch the pg_proc row for non-built-in functions.
> 
> The obvious place to install such a check is fmgr_info(), which is
> fetching said row anyway for other purposes, so it's really hard to
> see how adding anything to FmgrBuiltin is going to help.

Thank you, fmgr_info() looks like the best place to do the parallel safety 
check.  Having a quick look at its callers, I didn't find any concerning place 
(of course, we can't be relieved until the regression test succeeds.)  Also, 
with fmgr_info(), we don't have to find other places to add the check to deal 
with functions calls in execExpr.c and execExprInterp.c.  This is beautiful.

But the current fmgr_info() does not check the parallel safety of builtin 
functions.  It does not have information to do that.  There are two options.  
Which do you think is better?  I think 2.

1) fmgr_info() reads pg_proc like for non-builtin functions
This ruins the effort for the fast path for builtin functions.  I can't imagine 
how large the adverse impact on performance would be, but I'm worried.

The benefit is that ALTER FUNCTION on builtin functions takes effect.  But such 
operations are nonsensical, so I don't think we want to gain such a benefit.

2) Gen_fmgrtab.pl adds a member for proparallel in FmgrBuiltin
But we don't want to enlarge FmgrBuiltin struct.  So, change the existing bool 
members strict and and retset into one member of type char, and represent the 
original values with some bit flags.  Then we use that member for proparallel 
as well.  (As a result, one byte is left for future use.)

I think we'll try 2).  I'd be grateful if you could point out anything I need 
to be careful to.

Regards
Takayuki Tsunakawa

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-20 Thread tsunakawa.ta...@fujitsu.com

From: Tom Lane 
> [ raised eyebrow... ]  I find it very hard to understand why that would
> be necessary, or even a good idea.  Not least because there's no spare
> room there; you'd have to incur a substantial enlargement of the
> array to add another flag.  But also, that would indeed lock down
> the value of the parallel-safety flag, and that seems like a fairly
> bad idea.

You're right, FmgrBuiltins is already fully packed (24 bytes on 64-bit 
machines).  Enlarging the frequently accessed fmgr_builtins array may wreak 
unexpectedly large adverse effect on performance.

I wanted to check the parallel safety of functions, which various objects (data 
type, index, trigger, etc.) come down to, in FunctionCallInvoke() and other few 
places.  But maybe we skip the check for built-in functions.  That's a matter 
of where we draw a line between where we check and where we don't.


Regards
Takayuki Tsunakawa

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-20 Thread tsunakawa.ta...@fujitsu.com

From: Tom Lane 
> Bharath Rupireddy  writes:
> > IMO, the reason for not checking the parallel safety of the support
> > functions is that the functions themselves can have whole lot of other
> > functions (can be nested as well) which might be quite hard to check
> > at the planning time. That is why the job of marking an aggregate as
> > parallel safe is best left to the user.
> 
> Yes.  I think the documentation is perfectly clear that this is
> intentional; I don't see a need to change it.

OK, that's what I expected.  I understood from this that the Postgres's stance 
toward parallel safety is that Postgres does its best effort to check parallel 
safety (as far as it doesn't hurt performance much, and perhaps the core code 
doesn't get very complex), and the user should be responsible for the actual 
parallel safety of ancillary objects (in this case, support functions for an 
aggregate) of the target object that he/she marked as parallel safe.

> >> Should we add a member for parallel safety in fmgr_builtins[], and disallow
> ALTER FUNCTION to change the parallel safety of builtin UDFs?
> 
> No.  You'd have to be superuser anyway to do that, and we're not in the
> habit of trying to put training wheels on superusers.

Understood.  However, we may add the parallel safety member in fmgr_builtins[] 
in another thread for parallel INSERT SELECT.  I'd appreciate your comment on 
this if you see any concern.

> Don't have an opinion about the other points yet.

I'd like to have your comments on them, too.  But I understand you must be so 
busy at least until the beta release of PG 14.

Regards
Takayuki Tsunakawa

[bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-20 Thread tsunakawa.ta...@fujitsu.com

Hello,


I think we've found a few existing problems with handling the parallel safety 
of functions while doing an experiment.  Could I hear your opinions on what we 
should do?  I'd be willing to create and submit a patch to fix them.

The experiment is to add a parallel safety check in FunctionCallInvoke() and 
run the regression test with force_parallel_mode=regress.  The added check 
errors out with ereport(ERROR) when the about-to-be-called function is parallel 
unsafe and the process is currently in parallel mode.  6 test cases failed 
because the following parallel-unsafe functions were called:

dsnowball_init
balkifnull
int44out
text_w_default_out
widget_out

The first function is created in src/backend/snowball/snowball_create.sql for 
full text search.  The remaining functions are created during the regression 
test run.

The relevant issues follow.


(1)
All the above functions are actually parallel safe looking at their 
implementations.  It seems that their CREATE FUNCTION statements are just 
missing PARALLEL SAFE specifications, so I think I'll add them.  
dsnowball_lexize() may also be parallel safe.


(2)
I'm afraid the above phenomenon reveals that postgres overlooks parallel safety 
checks in some places.  Specifically, we noticed the following:

* User-defined aggregate
CREATE AGGREGATE allows to specify parallel safety of the aggregate itself and 
the planner checks it, but the support function of the aggregate is not 
checked.  OTOH, the document clearly says:

https://www.postgresql.org/docs/devel/xaggr.html

"Worth noting also is that for an aggregate to be executed in parallel, the 
aggregate itself must be marked PARALLEL SAFE. The parallel-safety markings on 
its support functions are not consulted."

https://www.postgresql.org/docs/devel/sql-createaggregate.html

"An aggregate will not be considered for parallelization if it is marked 
PARALLEL UNSAFE (which is the default!) or PARALLEL RESTRICTED. Note that the 
parallel-safety markings of the aggregate's support functions are not consulted 
by the planner, only the marking of the aggregate itself."

Can we check the parallel safety of aggregate support functions during 
statement execution and error out?  Is there any reason not to do so?

* User-defined data type
The input, output, send,receive, and other functions of a UDT are not checked 
for parallel safety.  Is there any good reason to not check them other than the 
concern about performance?

* Functions for full text search
Should CREATE TEXT SEARCH TEMPLATE ensure that the functions are parallel safe? 
 (Those functions could be changed to parallel unsafe later with ALTER 
FUNCTION, though.)


(3) Built-in UDFs are not checked for parallel safety
The functions defined in fmgr_builtins[], which are derived from pg_proc.dat, 
are not checked.  Most of them are marked parallel safe, but some are paralel 
unsaferestricted.

Besides, changing their parallel safety with ALTER FUNCTION PARALLEL does not 
affect the selection of query plan.  This is because fmgr_builtins[] does not 
have a member for parallel safety.

Should we add a member for parallel safety in fmgr_builtins[], and disallow 
ALTER FUNCTION to change the parallel safety of builtin UDFs?


Regards
Takayuki Tsunakawa

RE: Schema variables - new implementation for Postgres 15

2021-04-16 Thread tsunakawa.ta...@fujitsu.com

From: Pavel Stehule 
--
I am sorry, but in this topic we have different opinions. The variables in 
PLpgSQL are not transactional too (same is true in Perl, Python, ...). Session 
variables in Oracle, MS SQL, DB2, MySQL are not transactional too. My primary 
focus is PLpgSQL - and I would like to use schema variables as global plpgsql 
variables (from PLpgSQL perspective) - that means in Postgres's perspective 
session variables. But in Postgres, I have to write features that will work 
with others PL too - PLPython, PLPerl, ... Statement SET in ANSI/SQL standard 
(SQL/PSM) doesn't expect transactional behaviour for variables too. 
Unfortunately SET keyword is used in Postgres for GUC, and isn't possible to 
reuse without a compatibility break.

The PostgreSQL configuration is transactional, but it is a different feature 
designed for different purposes. Using GUC like session variables is just a 
workaround. It can be useful for some cases, sure. But it is not usual 
behaviour. And for other cases the transactional behaviour is not practical. 
Schema variables are not replacement of GUC, schema variables are not 
replacement of temporal tables. There is a prepared patch for global temp 
tables. I hope so this patch can be committed to Postgres 15. Global temp 
tables fixes almost all disadvantages of temporary tables in Postgres. So the 
schema variable is not a one row table. It is a different creature - designed 
to support the server's side procedural features.
--

+1
I understand (and wish) this feature is intended to ease migration from Oracle 
PL/SQL, which will further increase the popularity of Postgres.  So, the 
transactional behavior is not necessary unless Oracle has such a feature.

Furthermore, Postgres already has some non-transactonal SQL commands.  So, I 
don't think we need to reject non-transactional LET.

* Sequence operation: SELECT nextval/setval
* SET [SESSION]
* SET ROLE
* SET SESSION AUTHORIZATION


--
I have prepared a patch that allows non default transactional behaviour (but 
this behaviour should not be default - I didn't design schema variables as temp 
tables replacement). This patch increases the length of the current patch about 
1/4, and I have enough work with rebasing with the current patch, so I didn't 
send it to commitfest now. If schema variables will be inside core, this day 
I'll send the patch that allows transactional behaviour for schema variables - 
I promise.
--

I prefer the simpler, targeted one without transactional behavior initially, 
because added complexity might prevent this feature from being committed in PG 
15.


Regards
Takayuki Tsunakawa

RE: Schema variables - new implementation for Postgres 15

2021-04-15 Thread tsunakawa.ta...@fujitsu.com

From: Pavel Stehule 
--
do $$
declare x int ;
begin
  for i in 1..100
  loop
let ooo = i;
  end loop;
end;
$$;

variant 1 .. 1500 ms
variant 2 with PLpgSQL support .. 140 ms
variant 2 without PLpgSQL support 9000 ms
--


That's impressive!  But 1 million times of variable assignment took only 140 
ms?  It's that one assignment took only 140 nanosecond, which is near one DRAM 
access?  Can PL/pgSQL processing be really so fast?


Regards
Takayuki Tsunakawa

Parallel INSERT SELECT take 2

2021-04-11 Thread tsunakawa.ta...@fujitsu.com

This is another try of [1].


BACKGROUND


We want to realize parallel INSERT SELECT in the following steps:
1) INSERT + parallel SELECT
2) Parallel INSERT + parallel SELECT

Below are example use cases.  We don't expect high concurrency or an empty data 
source.
* Data loading (ETL or ELT) into an analytics database, typically a data ware 
house.
* Batch processing in an OLTP database.


PROBLEMS


(1) The overhead of checking parallel-safety could be large
We have to check the target table and its child partitions for parallel safety. 
 That is, we make sure those relations don't have parallel-unsafe domains, 
constraints, indexes, or triggers.

What we should check is the relations into which the statement actually inserts 
data.  However, the planner does not know which relations will be actually 
inserted into.  So, the planner has to check all descendant partitions of a 
target table.  When the target table has many partitions, this overhead could 
be unacceptable when compared to the benefit gained from parallelism.


(2) There's no mechanism for parallel workers to assign an XID
Parallel workers need an XID of the current (sub)transaction when actually 
inserting a tuple (i.e., calling heap_insert()).  When the leader has not got 
the XID yet, the worker may have to assign a new XID and communicate it to the 
leader and other workers so that all parallel processes use the same XID.


SOLUTION TO (1)


The candidate ideas are:

1) Caching the result of parallel-safety check
The planner stores the result of checking parallel safety for each relation in 
relcache, or some purpose-built hash table in shared memory.

The problems are:

* Even if the target relation turns out to be parallel safe by looking at those 
data structures, we cannot assume it remains true until the SQL statement 
finishes.  For instance, other sessions might add a parallel-unsafe index to 
its descendant relations.  Other examples include that when the user changes 
the parallel safety of indexes or triggers by running ALTER FUNCTION on the 
underlying index AM function or trigger function, the relcache entry of the 
table or index is not invalidated, so the correct parallel safety is not 
maintained in the cache.
In that case, when the executor encounters a parallel-unsafe object, it can 
change the cached state as being parallel-unsafe and error out.

* Can't ensure fast access.  With relcache, the first access in each session 
has to undergo the overhead of parallel-safety check.  With a hash table in 
shared memory, the number of relations stored there would be limited, so the 
first access after database startup or the hash table entry eviction similarly 
experiences slowness.

* With a new hash table, some lwlock for concurrent access must be added, which 
can have an adverse effect on performance.


2) Enabling users to declare that the table allows parallel data modification
Add a table property that represents parallel safety of the table for DML 
statement execution.  Users specify it as follows:

CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };

This property is recorded in pg_class's relparallel column as 'u', 'r', or 's', 
just like pg_proc's proparallel.  The default is UNSAFE.

The planner assumes that all of the table, its descendant partitions, and their 
ancillary objects have the specified parallel safety or safer one.  The user is 
responsible for its correctness.  If the parallel processes find an object that 
is less safer than the assumed parallel safety during statement execution, it 
throws an ERROR and abort the statement execution.

The objects that relate to the parallel safety of a DML target table are as 
follows:

* Column default expression
* DOMAIN type CHECK expression
* CHECK constraints on column
* Partition key
* Partition key support function
* Index expression
* Index predicate
* Index AM function
* Operator function
* Trigger function

When the parallel safety of some of these objects is changed, it's costly to 
reflect it on the parallel safety of tables that depend on them.  So, we don't 
do it.  Instead, we provide a utility function 
pg_get_parallel_safety('table_name') that returns records of (objid, classid, 
parallel_safety) that represent the parallel safety of objects that determine 
the parallel safety of the specified table.  The function only outputs objects 
that are not parallel safe.  Otherwise, it will consume excessive memory while 
accumulating the output.  The user can use this function to identify 
problematic objects when a parallel DML fails or is not parallelized in an 
expected manner.

How does the executor detect parallel unsafe objects?  There are two ways:

1) At loading time
When the executor loads the

RE: 2019-03 CF now in progress

2021-04-08 Thread tsunakawa.ta...@fujitsu.com

From: David Steele 
> Overall, 118 of 262 entries were closed during this commitfest (45%).
> That includes 91 patches committed since March 1, which is pretty
> fantastic. Thank you to everyone, especially the committers, for your
> hard work!

The number of committed patches in the last CF is record-breaking in recent 
years:

PG 14: 124
PG 13: 90
PG 12: 100
PG 11: 117
PG 10: 116
PG 9.6: 74
PG 9.5: 102

I take off my hat to the hard work of committers and CFM.  OTOH, there seems to 
be many useful-looking features pending as ready for committer.  I look forward 
to seeing them committed early in PG 15.


Regards
Takayuki Tsunakawa

RE: Stronger safeguard for archive recovery not to miss data

2021-04-05 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> I didn't see that, but found the following article.
> 
> https://stackoverflow.com/questions/2590794/gcov-warning-merge-mismat
> ch-for-summaries
...
> It seems like your working directory needs some cleanup.

That could very well be.  It'd be safer to run "make coverage-clean" between 
builds.

I thought otherwise that the multiple TAP tests that were simultaneously run 
conflicted on the .gcda file.  If this is the case, we may not be able to 
eliminate the failure possibility.  (make -J 1 circumvents this?)

https://www.postgresql.org/docs/devel/install-procedure.html

"If using GCC, all programs and libraries are compiled with code coverage 
testing instrumentation."

33.4. TAP Tests
https://www.postgresql.org/docs/devel/regress-tap.html

"Generically speaking, the TAP tests will test the executables in a 
previously-installed installation tree if you say make installcheck, or will 
build a new local installation tree from current sources if you say make check. 
In either case they will initialize a local instance (data directory) and 
transiently run a server in it. Some of these tests run more than one server."

"It's important to realize that the TAP tests will start test server(s) even 
when you say make installcheck; this is unlike the traditional non-TAP testing 
infrastructure, which expects to use an already-running test server in that 
case. Some PostgreSQL subdirectories contain both traditional-style and 
TAP-style tests, meaning that make installcheck will produce a mix of results 
from temporary servers and the already-running test server."



Regards
Takayuki Tsunakawa

RE: Stronger safeguard for archive recovery not to miss data

2021-04-05 Thread tsunakawa.ta...@fujitsu.com

From: osumi.takami...@fujitsu.com 
> By the way, when I build postgres with this patch and enable-coverage option,
> the results of RT becomes unstable. Does someone know the reason ?
> When it fails, I get stderr like below
> 
> t/001_start_stop.pl .. 10/24
> #   Failed test 'pg_ctl start: no stderr'
> #   at t/001_start_stop.pl line 48.
> #  got:
> 'profiling:/home/k5user/new_disk/recheck/PostgreSQL-Source-Dev/src/bac
> kend/executor/execMain.gcda:Merge mismatch for function 15
> # '
> # expected: ''
> t/001_start_stop.pl .. 24/24 # Looks like you failed 1 test of 24.
> t/001_start_stop.pl .. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/24
> subtests
> 
> Similar phenomena was observed in [1] and its solution seems to upgrade my
> gcc higher than 7. And, I did so but still get this unstable error with
> enable-coverage. This didn't happen when I remove enable-option and the
> make check-world passes.

Can you share the steps you took?  e.g.,

$ configure --enable-coverage ...
$ make world
$ make check-world
$ patch -p1 < your_patch
$ make world
$ make check-world

A bit of Googling shows that the same error message has shown up in the tests 
of other software than Postgres.  It doesn't seem like this failure is due to 
your patch.



Regards
Takayuki Tsunakawa

RE: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2021-04-01 Thread tsunakawa.ta...@fujitsu.com

From: Bharath Rupireddy 
> Attaching a small patch that emits a warning when the persistence setting of a
> partitioned table is changed and also added a note into the docs. Please have 
> a
> look at it.

Thank you.  However, I think I'll withdraw this CF entry since I'm not sure I 
can take time for the time being to research and fix other various ALTER 
TABLE/INDEX issues.  I'll mark this as withdrawn around the middle of next week 
unless someone wants to continue this.


Regards
Takayuki Tsunakawa

RE: Add client connection check during the execution of the query

2021-04-01 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> > Following PostmasterIsAlive(), maybe it's better to name it as
> pq_connection_is_alive() pq_client_is_alive(), or pq_frontend_is_alive() like
> the pqcomm.c's head comment uses the word frontend?
> 
> I think it's OK, because it matches the name of the GUC.  I'm more concerned
> about the name of the GUC.  Will we still be happy with this name if a future
> releases sends a heartbeat message?  I think that is still OK, so I'm happy 
> with
> these names for now, but if someone has a better name, please speak up very
> soon.

OK, agreed.


> > (4)
> > I think the new GUC works for walsender as well.  If so, how do we explain
> the relationship between the new GUC and wal_receiver_timeout and
> recommend the settings of them?
> 
> No, it only works while executing a query.  (Is there something in logical
> decoding, perhaps, that I have failed to consider?)

When I saw the following code, I thought the new GUC takes effect at the same 
time as wal\sender_timeout.  But they don't become effective simultaneously.  
The new GUC is effective before the logical replication starts.  And 
wal_sender_timeout takes effect after the physical/logical replication starts.  
So, there's no problem.

[PostgresMain]
if (am_walsender)
{
if (!exec_replication_command(query_string))
exec_simple_query(query_string);
}
else
exec_simple_query(query_string);


The patch looks committable to me.


> PS The "from" headers in emails received from Fujitsu seems to have the
> names stripped, somewhere in the tubes of the internet.  I see the full 
> version
> when people from Fujitsu quote other people from Fujitsu.
> I copied one of those into the commit message, complete with its magnificent
> kanji characters (perhaps these are the cause of the filtering?), and I hope
> that's OK with you.

Certainly, it seems that only my email address appears in the sender field on 
the mailer.  I'll check my Outlook settings.


Regards
Takayuki Tsunakawa

RE: Add client connection check during the execution of the query

2021-04-01 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> I changed my mind.  Let's commit the pleasingly simple Linux-only feature for
> now, and extend to it to send some kind of no-op message in a later release.
> So this is the version I'd like to go with.
> Objections?

+1, as some of our users experienced the problem that the server kept 
processing (IIRC, a buggy PL/pgSQL procedure that loops indefinitely) after 
they killed the client.

TBH, Linux and Windows will be sufficient.  But I'm for providing a good 
feature on a specific OS first.


(1)
+   rc = poll(, 1, 0);
+   if (rc < 0)
+   {
+   elog(COMMERROR, "could not poll socket: %m");
+   return false;

I think it's customary to use ereport() and errcode_for_socket_access().


(2)
pq_check_connection()

Following PostmasterIsAlive(), maybe it's better to name it as 
pq_connection_is_alive() pq_client_is_alive(), or pq_frontend_is_alive() like 
the pqcomm.c's head comment uses the word frontend?


(3)
 #include 
+#include 
 #ifndef WIN32
 #include 
 #endif

poll.h should be included between #ifndef WIN32 and #endif?


(4)
I think the new GUC works for walsender as well.  If so, how do we explain the 
relationship between the new GUC and wal_receiver_timeout and recommend the 
settings of them?


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-30 Thread tsunakawa.ta...@fujitsu.com

From: 'alvhe...@alvh.no-ip.org' 
> > got expected ERROR:  cannot insert multiple commands into a prepared
> statement
> > got expected division-by-zero
> 
> Eh?  this is not at all expected, of course, but clearly not PQtrace's
> fault.  I'll look tomorrow.

Iwata-san and I were starting to investigate the issue.  I guessed the first 
message was not expected.  Please let us know if there's something we can help.

(I was amazed that you finally committed this great feature, libpq pipeline, 
while you are caring about many patches.)


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-30 Thread tsunakawa.ta...@fujitsu.com

From: 'alvhe...@alvh.no-ip.org' 
> Okay, pushed this patch and the new testing for it based on
> libpq_pipeline.  We'll see how the buildfarm likes it.

Thank you very much!  I appreciate you taking your valuable time while I 
imagine you must be pretty busy with taking care of other (possibly more 
important) patches.

TBH, when Tom-san suggested drastic change, I was afraid we may not be able to 
complete this in PG 14.  But in the end, I'm very happy that the patch has 
become much leaner and cleaner.

And congratulations on your first commit, Iwata-san!  I hope you can have time 
and energy to try adding a connection parameter to enable tracing, which 
eliminates application modification.


> I didn't like the idea of silently skipping the redacted fields, so I
> changed the code to print  or  instead.  I also made the
> redacting occur in the last mile (pqTraceOutputInt32 / String) rather
> that in their callers: it seemed quite odd to advance the cursor in the
> "else" branch.
> 
> I refactored the duplicate code that appeared for Notice and Error.
> In that function, we redact not only the 'L' field (what Iwata-san was
> doing) but also 'F' (file) and 'R' (routine) because those things can
> move around for reasons that are not interesting to testing this code.
> 
> In the libpq_pipeline commit I added 'pipeline_abort' and 'transaction'
> to the cases that generate traces, which adds coverage for
> NoticeResponse and ErrorResponse.

These make sense to me.  Thank you for repeatedly polishing and making the 
patch better much.



Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-29 Thread tsunakawa.ta...@fujitsu.com

From: 'alvhe...@alvh.no-ip.org' 
> > > (Hey, what the heck is that "Z" at the end of the message?  I thought
> > > the error ended right at the \x00 ...)
> >
> > We'll investigate these issues.
> 
> For what it's worth, I did fix this problem in patch 0005 that I
> attached.  The problem was that one "continue" should have been "break",
> and also a "while ( .. )" needed to be made an infinite loop.  It was
> easy to catch these problems once I added (in 0006) the check that the
> bytes consumed equal message length, as I had suggested a couple of
> weeks ago :-)  I also changed the code for Notice, but I didn't actually
> verify that one.

You already fix the issue, didn't you?  Thank you.  Your suggestion of checking 
the length has surely proved to be correct.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-29 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> It's better to be short as far as it is clear enough. Actually '<' to
> 'F' and '>' to 'B' is clear enough to me. So I don't need a longer
> notation.

Agreed, because the message format description in the PG manual uses F and B.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-28 Thread tsunakawa.ta...@fujitsu.com

From: alvhe...@alvh.no-ip.org 
> > Proposed changes on top of v29.
> 
> This last one uses libpq_pipeline -t and verifies the output against an 
> expected
> trace file.  Applies on top of all the previous patches.  I attach the whole 
> lot,
> so that the CF bot has a chance to run it.

Thank you for polishing the patch.

Iwata-san,
Please review Alvaro-san's code, and I think you can integrate all patches into 
one except for 0002 and 0007.  Those two patches may be separate or merged into 
one as a test patch.


> I did notice another problem for comparison of expected trace files, which is
> that RowDescription includes table OIDs for some columns.  I think we would
> need to have a flag to suppress that too, somehow, although the answer to what
> should we do is not as clear as for the other two cases.

I'm afraid this may render the test comparison almost impossible.  Tests that 
access system catalogs and large objects probably output OIDs.



Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-28 Thread tsunakawa.ta...@fujitsu.com

From: alvhe...@alvh.no-ip.org 
> I added an option to the new libpq_pipeline program that it activates
> libpq trace.  It works nicely and I think we can add that to the
> regression tests.

That's good news.  Thank you.



> 1. The trace output for the error message won't be very nice, because it
> prints line numbers.  So if I want to match the output to an "expected"
> file, it would break every time somebody edits the source file where the
> error originates and the ereport() line is moved.  For example:

> (Hey, what the heck is that "Z" at the end of the message?  I thought
> the error ended right at the \x00 ...)

We'll investigate these issues.


> 2. The < and > characters are not good for visual inspection.  I
> replaced them with F and B and I think it's much clearer.  Compare:
> I think the second one is much easier on the eye.

Yes, agreed.  I too thought of something like "C->S" and "S->C" because client 
and server should be more familiar for users than frontend and backend.


Regards
Takayuki Tsunakawa

RE: Global snapshots

2021-03-24 Thread tsunakawa.ta...@fujitsu.com

From: Andrey V. Lepikhov 
> Current state of the patch set rebased on master, 5aed6a1fc2.
> 
> It is development version. Here some problems with visibility still detected 
> in
> two tests:
> 1. CSN Snapshot module - TAP test on time skew.
> 2. Clock SI implementation - TAP test on emulation of bank transaction.

I'm sorry to be late to respond.  Thank you for the update.

As discussed at the HighGo meeting, what do you think we should do about this 
patch set, now that we agreed that Clock-SI is covered by Microsoft's patent?  
I'd appreciate it if you could share some idea to change part of the algorithm 
and circumvent the patent.

Otherwise, why don't we discuss alternatives, such as the Commitment Ordering?

I have a hunch that YugabyteDB's method seems promising, which I wrote in the 
following wiki.  Of course, we should make efforts to see if it's patented 
before diving deeper into the design or implementation.

Scaleout Design - PostgreSQL wiki
https://wiki.postgresql.org/wiki/Scaleout_Design


Regards
Takayuki Tsunakawa

RE: Disable WAL logging to speed up data loading

2021-03-24 Thread tsunakawa.ta...@fujitsu.com

From: Stephen Frost 
> * tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote:
> > As Laurenz-san kindly replied, the database server refuses to start with a
> clear message.  So, it's similarly very clear what happens.  The user will 
> never
> unknowingly resume operation with possibly corrupt data.
> 
> No, instead they'll just have a completely broken system that has to be 
> rebuilt or
> restored from a backup.  That doesn't strike me as a good result.

Your understanding is correct.  What I wanted to answer to confirm is that the 
behavior is clear: the server refuses to start, and the user knows he/she 
should restore the database from backup.

> > So, I understood the point boils down to elegance.  Could I ask what makes
> you feel ALTER TABLE UNLOGGED/LOGGED is (more) elegant?  I'm purely
> asking as a user.
> 
> The impact is localized to those specific tables.  The rest of the system 
> should
> come up cleanly and there won't be corruption, instead merely the lack of data
> in UNLOGGED tables.

So, I took your point as the ease and fast time of restore after a crash: the 
user just has to restore the lost table data using COPY FROM from files that 
was saved before the data loading job using COPY TO.

In that sense, the backup and restoration of the whole database is an option 
for users when they have some instant backup and restore feature available.

> > (I don't want to digress, but if we consider the number of options for
> > wal_level as an issue, I feel it's not elegant to have separate
> > "replica" and "logical".)
> 
> Do you know of a way to avoid having those two distinct levels while still 
> writing
> only the WAL needed depending on if a system is doing logical replication or
> not..?  If you've got suggestions on how to eliminate one of those levels, I'm
> sure there would be interest in doing so.  I don't see the fact that we have
> those two levels as justification for adding another spelling of 'minimal'.

Sorry, I have almost no knowledge of logical replication implementation.  So, 
being ignorant of its intricacies, I have felt like as a user "Why do I have to 
set wal_level = logical, because streaming replication and logical replication 
are both replication features?  If the implementation needs some additional WAL 
for logical replication, why doesn't the server automatically emit the WAL when 
the target table of DML statements is in a publication?"

> > The elegance of wal_level = none is that the user doesn't have to
> > remember to add ALTER TABLE to the data loading job when they add load
> > target tables/partitions.  If they build and use their own (shell)
> > scripts to load data, that won't be burdon or forgotten.  But what
> > would they have to do when they use ETL tools like Talend, Pentaho,
> > and Informatica Power Center?  Do those tools allow users to add
> > custom processing like ALTER TABLE to the data loading job steps for
> > each table?  (AFAIK, not.)
> 
> I don't buy the argument that having to 'remember' to do an ALTER TABLE is
> such a burden when it means that the database will still be consistent and
> operational after a crash.

That depends on whether an instant backup and restore feature is at hand.  If 
the user is comfortable with it, wal_level = none is easier and more 
attractive.  At least, I don't want the feature to be denied.

> As for data loading tools, surely they support loading data into UNLOGGED
> tables and it's certainly not hard to have a script run around and flip those
> tables to LOGGED after they're loaded, and I do actually believe some of those
> tools support building processes of which one step could be such a command
> (I'm fairly confident Pentaho, in particular, does as I remember building such
> pipelines myself...).

Oh, Pentaho has such a feature, doesn't it?  But isn't it a separate step from 
the data output step?  Here, I assume ETL tools allow users to compose a data 
loading job from multiple steps: data input, transformation, data output, etc.  
I guess the user can't directly incorporate ALTER TABLE into the data output 
step, and has to add separate custom steps for ALTER TABLE.  That's burdonsome 
and forgettable, I think.

Regards
TakayukiTsunakawa

RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-22 Thread tsunakawa.ta...@fujitsu.com

From: Justin Pryzby 
> On Mon, Mar 22, 2021 at 08:18:56PM -0700, Zhihong Yu wrote:
> > with data_dest_cb callback. It is used for send text representation of a
> > tuple to a custom destination.
> >
> > send text -> sending text
> 
> I would say "It is used to send the text representation ..."

I took Justin-san's suggestion.  (It feels like I'm in a junior English 
class...)


> It's actually a pointer:
> src/include/commands/copy.h:typedef struct CopyToStateData *CopyToState;
> 
> There's many data structures like this, where a structure is typedefed with a
> "Data" suffix and the pointer is typedefed without the "Data"

Yes.  Thank you for good explanation, Justin-san.



Regards
TakayukiTsunakawa




v22-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v22-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch

RE: Disable WAL logging to speed up data loading

2021-03-22 Thread tsunakawa.ta...@fujitsu.com

From: Stephen Frost 
> First- what are you expecting would actually happen during crash recovery in
> this specific case with your proposed new WAL level?
...
> I'm not suggesting it's somehow more crash safe- but it's at least very clear
> what happens in such a case, to wit: the entire table is cleared on crash
> recovery.

As Laurenz-san kindly replied, the database server refuses to start with a 
clear message.  So, it's similarly very clear what happens.  The user will 
never unknowingly resume operation with possibly corrupt data.


> We're talking about two different ways to accomplish essentially the same
> thing- one which introduces a new WAL level, vs. one which adds an
> optimization for a WAL level we already have.  That the second is more elegant
> is more-or-less entirely the point I'm making here, so it seems pretty 
> relevant.

So, I understood the point boils down to elegance.  Could I ask what makes you 
feel ALTER TABLE UNLOGGED/LOGGED is (more) elegant?  I'm purely asking as a 
user.

(I don't want to digress, but if we consider the number of options for 
wal_level as an issue, I feel it's not elegant to have separate "replica" and 
"logical".)


> Under the proposed 'none', you basically have to throw out the entire cluster 
> on
> a crash, all because you don't want to use 'UNLOGGED' when you created the
> tables you want to load data into, or 'TRUNCATE' them in the transaction where
> you start the data load, either of which gives us enough indication and which
> we have infrastructure around dealing with in the event of a crash during the
> load without everything else having to be tossed and everything restored from 
> a
> backup.  That's both a better user experience from the perspective of having
> fewer WAL levels to understand and from just a general administration
> perspective so you don't have to go all the way back to a backup to bring the
> system back up.

The elegance of wal_level = none is that the user doesn't have to remember to 
add ALTER TABLE to the data loading job when they add load target 
tables/partitions.  If they build and use their own (shell) scripts to load 
data, that won't be burdon or forgotten.  But what would they have to do when 
they use ETL tools like Talend, Pentaho, and Informatica Power Center?  Do 
those tools allow users to add custom processing like ALTER TABLE to the data 
loading job steps for each table?  (AFAIK, not.)

wal_level = none is convenient and attractive for users who can backup and 
restore the entire database instantly with a storage or filesystem snapshot 
feature.


Regards
TakayukiTsunakawa

RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-22 Thread tsunakawa.ta...@fujitsu.com

From: Andrey Lepikhov 
> Macros _() at the postgresExecForeignCopy routine:
> if (PQputCopyEnd(conn, OK ? NULL : _("canceled by server")) <= 0)
> 
> uses gettext. Under linux it is compiled ok, because (as i understood)
> uses standard implementation of gettext:
> objdump -t contrib/postgres_fdw/postgres_fdw.so | grep 'gettext'
> gettext@@GLIBC_2.2.5
> 
> but in MacOS (and maybe somewhere else) we need to explicitly link
> libintl library in the Makefile:
> SHLIB_LINK += $(filter -lintl, $(LIBS)
> 
> Also, we may not use gettext at all in this part of the code.

I'm afraid so, because no extension in contrib/ has po/ directory.  I just 
removed _() and rebased the patch on HEAD.


Regards
TakayukiTsunakawa




v21-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v21-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch

RE: Disable WAL logging to speed up data loading

2021-03-21 Thread tsunakawa.ta...@fujitsu.com

From: Stephen Frost 
> The argument here seems to stem from the idea that issueing a 'TRUNCATE'
> inside the transaction before starting the 'COPY' command is 'too hard'.


> I could be wrong and perhaps opinions will change in the future, but it really
> doesn't seem like the there's much support for adding a new WAL level just to
> avoid doing the TRUNCATE.  Appealing to what other database systems
> support can be helpful in some cases but that's usually when we're looking at 
> a
> new capability which multiple other systems have implemented.  This isn't
> actually a new capability at all- the WAL level that you're looking for 
> already
> exists and already gives the optimization you're looking for, as long as you 
> issue
> a TRUNCATE at the start of the transaction.

No, we can't ask using TRUNCATE because the user wants to add data to a table.


Regards
TakayukiTsunakawa

RE: Disable WAL logging to speed up data loading

2021-03-19 Thread tsunakawa.ta...@fujitsu.com

From: Laurenz Albe 
> Now I'm not saying that this feature should not go in (I set it to
> "ready for committer", because I see no technical flaw with the
> implementation), but it remains debatable if we want the feature or not.

Oh, yes, thank you very much for supporting this and other relevant two threads!


> I certainly can see David's point of view.  And we don't view MySQL as
> a role model that we want to emulate.

Yes, what MySQL was over ten years ago would not be a role model for us.  OTOH, 
recent MySQL under Oracle should be improving much -- adopting InnoDB as a 
default storage engine, transactional data dictionary, etc.  (Somewhat 
offtopic, but their documentation quality is great.)


> All these things are annoying to users, but we'd rather take that than
> the complaints that a database got corrupted because somebody didn't read
> the documentation carefully.

Hmm, if that were the case, then some people would say the unlogged-table based 
approach is also be dangerous, saying "Users don't read the manual carefully 
and easily think that making all tables unlogged is good for performance."


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-19 Thread tsunakawa.ta...@fujitsu.com

Alvaro-san, Iwata-san,
cc: Tom-san, Horiguchi-san,


Thank you, it finally looks complete to me!

Alvaro-san,
We appreciate your help and patience, sometimes rewriting large part of the 
patch.  Could you do the final check (and possibly tweak) for commit?



Regards
Takayuki Tsunakawa

RE: Disable WAL logging to speed up data loading

2021-03-19 Thread tsunakawa.ta...@fujitsu.com

From: David Steele 
> After reading through the thread (but not reading the patch) I am -1 on
> this proposal.
> 
> The feature seems ripe for abuse and misunderstanding, and as has been
> noted in the thread, there are a variety of alternatives that can
> provide a similar effect.
> 
> It doesn't help that at several points along the way new WAL records
> have been found that still need to be included even when wal_level =
> none. It's not clear to me how we know when we have found them all.
> 
> The patch is marked Ready for Committer but as far as I can see there
> are no committers in favor of it and quite a few who are not.

I can understand that people are worried about not having WAL.  But as far as I 
remember, I'm afraid those concerns were emotional, not logical, i.e., 
something like "something may happen.".  Regarding concrete concerns that 
Stephen-san, Magnus-san, Horiguchi-san, Sawada-san and others raised, Osumi-san 
addressed them based on their advice and review, both in this thread and other 
threads.

I also understand we want to value people's emotion for worry-free PostgreSQL.  
At the same time, I'd like the emotion understood that we want Postgres to have 
this convenient, easy-to-use feature.  MySQL recently introduced this feature.  
Why can't Postgres do it?


> Perhaps it would be better to look at some of the more targeted
> approaches mentioned in the thread and see if any of them can be
> used/improved to achieve the desired result?

Other methods are not as easy-to-use, and more complex to implement.

What kind of destiny does this type of feature end up in?


Regards
Takayuki Tsunakawa}

RE: libpq debug log

2021-03-18 Thread tsunakawa.ta...@fujitsu.com

Hello Iwata-san,


Thanks to removing the static arrays, the code got much smaller.  I'm happy 
with this result.


(1)
+  ( for messages from client to server,
+  and  for messages from server to client),

I think this was meant to say "> or <".  So, maybe you should remove "," at the 
end of the first line, and replace "and" with "or".


(2)
+   case 8 :/* GSSENCRequest or SSLRequest */
+   /* These messsage does not reach here. */

messsage does -> messages do


(3)
+   fprintf(f, "\tAuthentication");

Why don't you move this \t in each message type to the end of:

+   fprintf(conn->Pfdebug, "%s\t%s\t%d", timestr, prefix, length);

Plus, don't we say in the manual that fields (timestamp, direction, length, 
message type, and message content) are separated by a tab?
Also, the code doesn't seem to separate the message type and content with a tab.


(4)
Where did the processing for unknown message go in pqTraceOutputMessage()?  Add 
default label?


(5)
+   case 16:/* CancelRequest */
+   fprintf(conn->Pfdebug, "%s\t>\t%d\tCancelRequest", 
timestr, length);

I think this should follow pqTraceOutputMessage().  That is, each case label 
only print the message type name in case other message types are added in the 
future.



Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-18 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> + pqTraceOutputR(const char *message, FILE *f)
> + {
> + int cursor = 0;
> +
> + pqTraceOutputInt32(message, , f);
> 
> I don't understand the reason for spliting message and  here.
> 
> + pqTraceOutputR(const char *message, FILE *f)
> + {
> + char *p = message;
> +
> + pqTraceOutputInt32(, f);
> 
> works well.

Yes, that would also work.  But I like the separate cursor + fixed starting 
point here, probably because it's sometimes confusing to see the pointer value 
changed inside functions.  (And a pointer itself is an allergy for some people, 
not to mention a pointer to ointer...)  Also, libpq uses cursors for network 
I/O buffers.  So, I think the patch can be as it is now.


> +static void
> +pqTraceOutputT(const char *message, int end, FILE *f)
> +{
> + int cursor = 0;
> + int nfields;
> + int i;
> +
> + nfields = pqTraceOutputInt16(message, , f);
> +
> + for (i = 0; i < nfields; i++)
> + {
> + pqTraceOutputString(message, , end, f);
> + pqTraceOutputInt32(message, , f);
> + pqTraceOutputInt16(message, , f);
> + pqTraceOutputInt32(message, , f);
> + pqTraceOutputInt16(message, , f);
> + pqTraceOutputInt32(message, , f);
> + pqTraceOutputInt16(message, , f);
> + }
> +}
> 
> I didn't looked closer, but lookong the usage of the variable "end",
> something's wrong in the function.

Ah, end doesn't serve any purpose.  I guess the compiler emitted a warning "end 
is not used".  I think end can be removed.



Regards
Takayuki Tsunakawa}

RE: libpq debug log

2021-03-18 Thread tsunakawa.ta...@fujitsu.com

From: Iwata, Aya/岩田 彩 
> > Yes, precisely, 2 bytes for the double quotes needs to be subtracted
> > as
> > follows:
> >
> > len = fprintf(...);
> > *cursor += (len - 2);
> 
> Thank you for your advice. I changed pqTraceOutputString set cursor to fprintf
> return -2.
> And I removed cursor movement from that function.

Ouch, not 2 but 3, to include a single whitespace at the beginning.

The rest looks good.  I hope we're almost at the finish line.


Regards
Takayuki Tsunakawa}

RE: libpq debug log

2021-03-17 Thread tsunakawa.ta...@fujitsu.com

From: Alvaro Herrera 
> In pqTraceOutputString(), you can use the return value from fprintf to
> move the cursor -- no need to count chars.

Yes, precisely, 2 bytes for the double quotes needs to be subtracted as follows:

len = fprintf(...);
*cursor += (len - 2);


> I still think that the message-type specific functions should print the
> message type, rather than having the string arrays.

I sort of think so to remove the big arrays.  But to minimize duplicate code, I 
think the code structure will look like:

fprintf(timestamp, length);
switch (type)
{
case '?':
pqTraceOutput?(...);
break;
case '?':
/* No message content */
fprintf("message_type_name");
break;
}   

void
pqTraceOutput?(...)
{
fprintf("message_type_name");
print message content;  
}

The order of length and message type is reversed.  The .sgml should also be 
changed accordingly.  What do you think?

Iwata-san, 
Why don't you submit v27 patch with the current arrays kept, and then v28 with 
the arrays removed soon after?


From: Kyotaro Horiguchi 
> It would help when the value is "255", which is confusing between -1
> (or 255) in byte or 255 in 2-byte word. Or when the value looks like
> broken. I'd suggest "b"yte for 1 byte, "s"hort for 2 bytes, "l"ong for
> 4 bytes ('l' is confusing with '1', but anyway it is not needed)..

I don't have a strong opinion on this.  (I kind of think I want to see 
unprefixed numbers; readers will look at the protocol reference anyway.)  I'd 
like to leave this up to Iwata-san and Alvaro-san.


Regards
Takayuki Tsunakawa}

RE: libpq debug log

2021-03-17 Thread tsunakawa.ta...@fujitsu.com

I'll look at the comments from Alvaro-san and Horiguchi-san.  Here are my 
review comments:


(23)
+   /* Trace message only when there is first 1 byte */
+   if (conn->Pfdebug && conn->outCount < conn->outMsgStart)
+   {
+   if (conn->outCount < conn->outMsgStart)
+   pqTraceOutputMessage(conn, conn->outBuffer + 
conn->outCount, true);
+   else
+   pqTraceOutputNoTypeByteMessage(conn,
+   
conn->outBuffer + conn->outMsgStart);
+   }

The inner else path doesn't seem to be reached because both the outer and inner 
if contain the same condition.  I think you want to remove the second condition 
from the outer if.


(24) pqTraceOutputNoTypeByteMessage
+   case 16:/* CancelRequest */
+   fprintf(conn->Pfdebug, "%s\t>\tCancelRequest\t%d", 
timestr, length);
+   pqTraceOutputInt32(message, , conn->Pfdebug);
+   pqTraceOutputInt32(message, , conn->Pfdebug);
+   break;

Another int32 data needs to be output as follows:

--
Int32(80877102)
The cancel request code. The value is chosen to contain 1234 in the most 
significant 16 bits, and 5678 in the least significant 16 bits. (To avoid 
confusion, this code must not be the same as any protocol version number.)

Int32
The process ID of the target backend.

Int32
The secret key for the target backend.
--


(25)
+   case 8 :/* GSSENRequest or SSLRequest */

GSSENRequest -> GSSENCRequest


(26)
+static void
+pqTraceOutputByte1(const char *data, int *cursor, FILE *pfdebug)
+{
+   const char *v = data + *cursor;
+   /*

+static void
+pqTraceOutputf(const char *message, int end, FILE *f)
+{
+   int cursor = 0;
+   pqTraceOutputString(message, , end, f);
+}

Put an empty line to separate the declaration part and execution part.


(27)
+   const char  *message_type = "UnkownMessage";
+
+   id = message[LogCursor++];
+
+   memcpy(, message + LogCursor , 4);
+   length = (int) pg_ntoh32(length);
+   LogCursor += 4;
+   LogEnd = length - 4;

+   /* Get a protocol type from first byte identifier */
+   if (toServer &&
+   id < lengthof(protocol_message_type_f) &&
+   protocol_message_type_f[(unsigned char)id] != NULL)
+   message_type = protocol_message_type_f[(unsigned char)id];
+   else if (!toServer &&
+   id < lengthof(protocol_message_type_b) &&
+   protocol_message_type_b[(unsigned char)id] != NULL)
+   message_type = protocol_message_type_b[(unsigned char)id];
+   else
+   {
+   fprintf(conn->Pfdebug, "Unknown message: %02x\n", id);
+   return;
+   }
+

The initial value "UnkownMessage" is not used.  So, you can initialize 
message_type with NULL and do like:

+if (...)
+   ...
+   else if (...)
+   ...
+
+   if (message_type == NULL)
+   {
+   fprintf(conn->Pfdebug, "Unknown message: %02x\n", id);
+   return;
+

Plus, I think this should be done before looking at the message length.


(28)
pqTraceOutputBinary() is only used in pqTraceOutputNchar().  Do they have to be 
separated?


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-16 Thread tsunakawa.ta...@fujitsu.com

Alvaro-san,


Thank you for taking your time to take a look at an incomplete patch.  I 
thought I would ask you for final check for commit after Iwata-san has 
reflected my review comments.

I discussed with Iwata-sn your below comments.  Let me convey her opinions.  
(She is now focusing on fixing the patch.)  We'd appreciate if you could tweak 
her completed patch.


From: Alvaro Herrera 
> * There's no cross-check that the protocol message length word
>   corresponds to what we actually print.  I think one easy way to
>   cross-check that would be to return the "count" from the type-specific
>   routines, and verify that it matches "length" in pqTraceOutputMessage.
>   (If the separate routines are removed and the code put back in one
>   giant switch, then the "count" is already available when the switch
>   ends.)

We don't think the length check is necessary, because 1) for FE->BE, correct 
messages are always assembled, and 2) for BE->FE, the parsing and decomposition 
of messages have already checked the messages.


> * If we have separate functions for each message type, then we can have
>   that function supply the message name by itself, rather than have the
>   separate protocol_message_type_f / _b arrays that print it.

I felt those two arrays are clumsy and thought of suggesting to remove them.  
But I didn't because the functions or case labels for each message type have to 
duplicate the printing of header fields: timestamp, message type, and length.  
Maybe we can change the order of output to "timestamp length message_type 
content", but I kind of prefer the current order.  What do you think?  (I can 
understand removing the clumsy static arrays should be better than the order of 
output fields.)


> * If we make each type-specific function print the message name, then we
>   need to make the same function print the message length, because it
>   comes after.  So the function would have to receive the length (so
>   that it can be printed).  But I think we should still have the
>   cross-check I mentioned in my first point occur in the main
>   pqTraceOutputMessage, not the type-specific function, for fear that we
>   will forget the cross-check in one of the functions and never realize
>   that we did.

As mentioned above, we think the current structure is better for smaller and 
readable code.


> * I would make the pqTraceOutputInt16() function and siblings advance
>   the cursor themselves, actually.  I think something like this:
>   static int
>   pqTraceOutputInt16(const char *data, int *cursor, FILE *pfdebug)
>   {
> uint16tmp;
> int   result;
> 
> memcpy(, data + *cursor, 2);
> *cursor += 2;
> result = (int) pg_ntoh16(tmp);
> fprintf(pfdebug, " #%d", result);
> 
> return result;
>   }
>   (So the caller has to pass the original "data" pointer, not
>   "data+cursor").  This means the caller no longer has to do "cursor+=N"
>   at each place.  Also, you get rid of the moveStrCursor() which does
>   not look good.

That makes sense, because in fact the patch mistakenly added 4 when it should 
add 2.  Also, the code would become smaller.


> * I'm not fond of the idea of prefixing "#" for 16bit ints and no
>   prefix for 32bit ints.  Seems obscure and the output looks weird.
>   I would use a one-letter prefix for each type, "w" for 32-bit ints
>   (stands for "word") and "h" for 16-bit ints (stands for "half-word").
>   Message length goes unadorned.  Then you'd have lines like this
> 
> 2021-03-15 08:10:44.324296  <   RowDescription  35 h1 "spcoptions"
> w1213 h5 w1009 h65535 w-1 h0
> 2021-03-15 08:10:44.324335  <   DataRow 32 h1 w22
> '{random_page_cost=3.0}'

Yes, actually I felt something similar.  Taking a second thought, I think we 
don't have to have any prefix because it doesn't help users.  So we're removing 
the prefix.  We don't have any opinion on adding some prefix.


> * I don't like that pqTraceOutputS/H print nothing when !toServer.  I
> think they should complain.

Yes, the caller should not call the function if there's no message content.  
That way, the function doesn't need the toServer argument.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-16 Thread tsunakawa.ta...@fujitsu.com

I've finished the review.  Here are the last set of comments.


(16)
(15) is also true for the processing of 'H' message.


(17) pqTraceOutputD
+   for (i = 0; i < nfields; i++)
+   {
+   len = pqTraceOutputInt32(message + cursor, f);
+   cursor += 4;
+   if (len == -1)
+   break;
+   pqTraceOutputNchar(message + cursor, f, len);
+   cursor += len;
+   }

Not break but continue, because non-NULL columns may follow a NULL column.


(18)
+   case 'E':   /* Execute(F) or Error Return(B) */
+   pqTraceOutputE(message + LogCursor, LogEnd, 
conn->Pfdebug, toServer);
+   break;

Error Return -> ErrorResponse


(19) pqTraceOutputF
Check the protocol again.  Two for loops should be required, one for format 
codes and another for arguments.


(20)
+   if ( len != -1)

Remove extra space before len.


(21)
I guess you intentionally omitted the following messages because they are only 
used during connection establishment.  I'm fine with it.  I wrote this just in 
case you missed them.

GSSENCRequest (F)
Int32(8)

GSSResponse (F)
Byte1('p')

PasswordMessage (F)
Byte1('p')

SASLInitialResponse (F)
Byte1('p')

SASLResponse (F)
Byte1('p')


(22)
+/* NotificationResponse */
+static void
+pqTraceOutputA(const char *message, int end, FILE *f)
+{
+   int cursor = 0;
+
+   pqTraceOutputInt16(message + cursor, f);
+   cursor += 2;
+   pqTraceOutputString(message + cursor, f);

Not Int16 but Int32.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-16 Thread tsunakawa.ta...@fujitsu.com

I've not finished reviewing yet, but there seems to be many mistakes.  I'm 
sending second set of review comments now so you can fix them in parallel.


(8)
+   charid = '\0';

This initialization is not required because id will always be assigned a value 
shortly.


(9)
+static int
+pqTraceOutputInt32(const char *data, FILE *pfdebug)
+{
+   int result;
+
+   memcpy(, data, 4);
+   result = (int) pg_ntoh32(result);
+   fputc(' ', pfdebug);
+   fprintf(pfdebug, "%d", result);
+
+   return result;
+}

fputc() and fprintf() can be merged into one fprintf() call for efficiency.

TBH, it feels strange that an "output" function returns a value.  But I 
understood this as a positive compromise for reducing code; without this, the 
caller has to do memcpy() and endianness conversion.


(10)
+/* BackendKeyData */
+static void
+pqTraceOutputK(const char *message, FILE *f)
+{
+   int cursor = 0;
+
+   pqTraceOutputInt32(message + cursor, f);
+   cursor += 4;
+   pqTraceOutputInt32(message + cursor, f);
+}

I don't think you need to always use a cursor variable in order to align with 
more complex messages.  That is, you can just write:

+   pqTraceOutputInt32(message, f);
+   pqTraceOutputInt32(message + 4, f);


(11)
+   default:
+   fprintf(conn->Pfdebug, "Invalid Protocol:%c\n", id);
+   break;
+

(This is related to (7))
You can remove this default label if you exit the function before the switch 
statement when the message type is unknown.  Make sure to output \n in that 
case.


(12) pqTraceOutputB
+   for (i = 0; i < nparams; i++)
+   {
+   nbytes = pqTraceOutputInt32(message + cursor, f);
+   cursor += 4;
+   if (nbytes == -1)
+   break;
+   pqTraceOutputNchar(message + cursor, f, nbytes);
+   cursor += nbytes;
+   }

Not break but continue, because non-NULL parameters may follow a NULL parameter.


(13) pqTraceOutputB
+   pqTraceOutputInt16(message + cursor, f);
+   cursor += 4;
+   pqTraceOutputInt16(message + cursor, f);
+}

This part doesn't seem to match the following description.

--
After the last parameter, the following fields appear:

Int16
The number of result-column format codes that follow (denoted R below). This 
can be zero to indicate that there are no result columns or that the result 
columns should all use the default format (text); or one, in which case the 
specified format code is applied to all result columns (if any); or it can 
equal the actual number of result columns of the query.

Int16[R]
The result-column format codes. Each must presently be zero (text) or one 
(binary).
--


(14)
The processing for CancelRequest message is missing?


(15)
+/* CopyInResponse */
+static void
+pqTraceOutputG(const char *message, int end, FILE *f)
+{
+   int cursor = 0;
+
+   pqTraceOutputByte1(message + cursor, f);
+   cursor++;
+
+   while (end > cursor)
+   {
+   pqTraceOutputInt16(message + cursor, f);
+   cursor += 2;
+   }
+}
+

According to the following description, for loop should be used.

--
Int16
The number of columns in the data to be copied (denoted N below).

Int16[N]
The format codes to be used for each column. Each must presently be zero (text) 
or one (binary). All must be zero if the overall copy format is textual.
--




Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-15 Thread tsunakawa.ta...@fujitsu.com

I'm looking at the last file libpq-trace.c.  I'll continue the review after 
lunch.  Below are some comments so far.


(1)
-  Enables  tracing of the client/server communication to a debugging file 
stream.
+  Enables tracing of the client/server communication to a debugging file
+  stream.
+  Only available when protocol version 3 or higher is used.

The last line is unnecessary now that v2 is not supported.


(2)
@@ -113,6 +114,7 @@ install: all installdirs install-lib
$(INSTALL_DATA) $(srcdir)/libpq-fe.h '$(DESTDIR)$(includedir)'
$(INSTALL_DATA) $(srcdir)/libpq-events.h '$(DESTDIR)$(includedir)'
$(INSTALL_DATA) $(srcdir)/libpq-int.h '$(DESTDIR)$(includedir_internal)'
+   $(INSTALL_DATA) $(srcdir)/libpq-trace.h 
'$(DESTDIR)$(includedir_internal)'
$(INSTALL_DATA) $(srcdir)/pqexpbuffer.h 
'$(DESTDIR)$(includedir_internal)'
$(INSTALL_DATA) $(srcdir)/pg_service.conf.sample 
'$(DESTDIR)$(datadir)/pg_service.conf.sample'

Why is this necessary?


(3) libpq-trace.h
+#ifdef __cplusplus
+extern "C"
+{

This is unnecessary because pqTraceOutputMessage() is for libpq's internal use. 
 This extern is for the user's C++ application to call public libpq functions.

+#include "libpq-fe.h"
+#include "libpq-int.h"

I think these can be removed by declaring the struct and function as follows:

struct pg_conn;
extern void pqTraceOutputMessage(struct pg_conn *conn, const char *message, 
bool toServer);

But... after doing the above, only this declaration is left in libpq-trace.h.  
Why don't you put your original declaration using PGconn in libpq-int.h and 
remove libpq-trace.h?


(4)
+   /* Trace message only when there is first 1 byte */
+   if (conn->Pfdebug && (conn->outCount < conn->outMsgStart))
+   pqTraceOutputMessage(conn, conn->outBuffer + conn->outCount, 
true);

() surrounding the condition after && can be removed.


(5)
+static const char *pqGetProtocolMsgType(unsigned char c,
+   
bool toServer);

This is unnecessary because the function definition precedes its only one call 
site.


(6)
+ * We accumulate frontend message pieces in an array as the libpq code writes
+ * them, and log the complete message when pqTraceOutputFeMsg is called.
+ * For backend, we print the pieces as soon as we receive them from the server.

The first paragraph is no longer true.  I think this entire comment is 
unnecessary.


(7)
+static const char *
+pqGetProtocolMsgType(unsigned char c, bool toServer)
+{
+   if (toServer == true &&
+   c < lengthof(protocol_message_type_f) &&
+   protocol_message_type_f[c] != NULL)
+   return protocol_message_type_f[c];
+
+   if (toServer == false &&
+   c < lengthof(protocol_message_type_b) &&
+   protocol_message_type_b[c] != NULL)
+   return protocol_message_type_b[c];
+
+   return "UnknownMessage:";
+}

"== true/false" can be removed.  libpq doesn't appear to use such style.

Why don't we embed this processing in pqTraceOutputMessage() because this 
function is short and called only once?  The added benefit would be that you 
can print an invalid message type byte like this:

fprintf(..., "Unknown message: %02x\n", ...);


Regards
Takayuki Tsunakawa

RE: Enhance traceability of wal_level changes for backup management

2021-03-11 Thread tsunakawa.ta...@fujitsu.com

From: David Steele 
> As a backup software author, I don't see this feature as very useful.
> 
> The problem is that there are lots of ways for WAL to go missing so
> monitoring the WAL archive for gaps is essential and this feature would
> not replace that requirement. The only extra information you'd get is
> the ability to classify the most recent gap as "intentional", maybe.

But how do you know there's any missing WAL?  I think there are the following 
cases of missing WAL:

1. A WAL segment file is missing.  e.g., 0001 and 0003 exist, but 
0002 doesn't.

2. All consecutive WAL segment files appear to exist, but some WAL records are 
missing.
This occurs ?only? when some WAL-optimized statements are run while wal_level = 
minimal.

Currently, backup management tools can detect 1 by scanning through the WAL 
archive directory.  But the can't notice 2.  The patch addresses this.

This is what some other people suggested in the thread for wal_level = none 
that Osumi-san referred to at the beginning of this thread.  I don't think this 
detection feature is not a blocker for wal_level = none.  So, I think this can 
be withdrawn if wal_level = none can be accepted.


Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> About the patch, it would be better to change the type of
> BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current
> value
> doesn't harm.

OK, attached, to be prepared for the distant future when NBuffers becomes 
64-bit.


Regards
Takayuki Tsunakawa



v2-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch
Description:  v2-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch

RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-11 Thread tsunakawa.ta...@fujitsu.com

From: Justin Pryzby 
> Could you rebase again and send an updated patch ?
> I could do it if you want.

Rebased and attached.  Fortunately, there was no rebase conflict this time.  
make check passed for PG core and postgres_fdw.


Regards
Takayuki Tsunakawa



v20-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v20-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> > uint64
> 
> +1

Thank you, the patch is attached (we tend to forget how large our world is... 
64-bit)  We're sorry to cause you trouble.


Regards
Takayuki Tsunakawa



v1-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch
Description:  v1-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> On Fri, Mar 12, 2021 at 5:20 PM Amit Kapila  wrote:
> > uint64
> 
> +1

+1
I'll send a patch later.


Regards
Takayuki Tsunakawa

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> Now, coming back to Hou-San's patch to introduce a GUC and reloption
> for this feature, I think both of those make sense to me because when
> the feature is enabled via GUC, one might want to disable it for
> partitioned tables? Do we agree on that part or someone thinks
> otherwise?
> 
> The other points to bikeshed could be:
> 1. The name of GUC and reloption. The two proposals at hand are
> enable_parallel_dml and enable_parallel_insert. I would prefer the
> second (enable_parallel_insert) because updates/deletes might not have
> a similar overhead.
> 
> 2. Should we keep the default value of GUC to on or off? It is
> currently off. I am fine keeping it off for this release and we can
> always turn it on in the later releases if required. Having said that,
> I see the value in keeping it on because in many cases Insert ...
> Select will be used for large data and there we will see a benefit of
> parallelism and users facing trouble (who have a very large number of
> partitions with less data to query) can still disable the parallel
> insert for that particular table. Also, the other benefit of keeping
> it on till at least the beta period is that this functionality will
> get tested and if we found reports of regression then we can turn it
> off for this release as well.
> 
> Thoughts?


TBH, I'm a bit unsure, but the names with _insert would be OK.

The reason why Oracle has ENABLE PARALLEL DML clause and the parallel DML is 
disabled by default is probably due to the following critical restriction (and 
they have other less severe restrictions in parallel execution.)  Our 
implementation does not have things like this.

"Serial or parallel statements that attempt to access a table that has been 
modified in parallel within the same transaction are rejected with an error 
message."

"A transaction can contain multiple parallel DML statements that modify 
different tables, but after a parallel DML statement modifies a table, no 
subsequent serial or parallel statement (DML or query) can access the same 
table again in that transaction."


OTOH, I'm a bit concerned about whether we would have a common reason to 
disable parallel INSERT, UPDATE and DELETE.  Oracle states space usage 
difference as follows.  I wonder if something similar would apply to our 
parallel UPDATE/DELETE, particularly UPDATE.  At least, we already know 
parallel INSERT results in larger tables and indexes compared to serial 
execution.

"Parallel UPDATE uses the existing free space in the object, while direct-path 
INSERT gets new extents for the data."

"Space usage characteristics may be different in parallel than serial execution 
because multiple concurrent child transactions modify the object."

Even if that's the case, we can add _update parameters, although we feel sorry 
to cause users trouble to have to be aware of multiple parameters.

Or, when it has turned out that we need _update and/or _delete parameters, we 
can opt to rename _insert to _dml, and keep the _insert parameter as an old, 
hidden synonym.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-10 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> Right. So something like this?
> 
> unsigned char p;
> 
> p = buf + *cursor;
> result = (uint32) (*p << 24) + (*(p + 1)) << 16 + ...);

Yes, that would work (if p is a pointer), but I think memcpy() is enough like 
pqGetInt() does.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-10 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> The output functions copy message bytes into local variable but the
> same effect can be obtained by just casing via typed pointer type.
> 
>   uint32 tmp4;
>   ..
>   memcpy(, buf + *cursor, 4);
>   result = (int) pg_ntoh32(tmp4);
> 
> can be written as
> 
>   result = pg_ntoh32(* (uint32 *) (buf + *cursor));

I'm afraid we need to memcpy() because of memory alignment.

> I think we can get rid of copying in the output functions for String
> and Bytes in different ways.

I haven't looked at this code, but you sound right.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-10 Thread tsunakawa.ta...@fujitsu.com

Alvaro-san, Tom-san, Horiguchi-san,

From: Kyotaro Horiguchi 
> +1 for the thanks for the quick work.  I have some random comments
> after a quick look on it.

Thank you very much for giving many comments.  And We're sorry to have caused 
you trouble.  I told Iwata-san yesterday to modify the patch to use the logging 
function interface that Tom-san, Alvaro-san and I agreed on, that I'd review 
after the new revision has been posted, and let others know she is modifying 
the patch again so that they won't start reviewing.  But I should have done the 
request on hackers.

We're catching up with your feedback and post the updated patch.  Then I'll 
review it.

We appreciate your help so much.


Regards
Takayuki Tsunakawa

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-10 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> After some more on how to support parallel insert into fk relation.
> It seems we do not have a cheap way to implement this feature.
> 
> In RI_FKey_check, Currently, postgres execute "select xx for key share" to
> check that foreign key exists in PK table.
> However "select for update/share" is considered as parallel unsafe. It may be
> dangerous to do this in parallel mode, we may want to change this.

Hmm, I guess the parallel leader and workers can execute SELECT FOR KEY SHARE, 
if the parallelism infrastructure allows execution of SPI calls.  The lock 
manager supports tuple locking in parallel leader and workers by the group 
locking.  Also, the tuple locking doesn't require combo Cid, which is necessary 
for parallel UPDATE and DELETE.

Perhaps the reason why SELECT FOR is treated as parallel-unsafe is that tuple 
locking modifies data pages to store lock information in the tuple header.  But 
now, page modification is possible in parallel processes, so I think we can 
consider SELECT FOR as parallel-safe.  (I may be too optimistic.)


> And also, "select for update/share" is considered as "not read only" which 
> will
> force readonly = false in _SPI_execute_plan.

read_only is used to do CCI.  Can we arrange to skip CCI?


> At the same time, " simplifying foreign key/RI checks " thread is trying to
> replace "select xx for key share" with index_beginscan()+table_tuple_lock() (I
> think it’s parallel safe).
> May be we can try to support parallel insert fk relation after " simplifying 
> foreign
> key/RI checks " patch applied ?

Why do you think it's parallel safe?

Can you try running parallel INSERT SELECT on the target table with FK and see 
if any problem happens?

If some problem occurs due to the tuple locking, I think we can work around it 
by avoiding tuple locking.  That is, we make parallel INSERT SELECT lock the 
parent tables in exclusive mode so that the check tuples won't be deleted.  
Some people may not like this, but it's worth considering because parallel 
INSERT SELECT would not have to be run concurrently with short OLTP 
transactions.  Anyway, tuple locking partly disturbs parallel INSERT speedup 
because it modifies pages in the parent tables and emits WAL.

Surprisingly, Oracle doesn't support parallel INSERT SELECT on a table with FK 
as follows.  SQL Server doesn't mention anything, so I guess it's supported.  
This is a good chance for PostgreSQL to exceed Oracle.

https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-D4CFC1F2-44D3-4BE3-B5ED-6A309EB8BF06

Table 8-1 Referential Integrity Restrictions
DML Statement   Issued on ParentIssued on Child Self-Referential
INSERT  (Not applicable)Not parallelizedNot parallelized


Regards
Takayuki Tsunakawa

RE: POC: Cleaning up orphaned files using undo logs

2021-03-09 Thread tsunakawa.ta...@fujitsu.com

I'm crawling like a snail to read the patch set.  Below are my first set of 
review comments, which are all minor.


(1)
+ 
tablespacetemporary

temporary -> undo


(2)
  undo_tablespaces (string)
+
...
+The value is a list of names of tablespaces.  When there is more than
+one name in the list, PostgreSQL chooses an
+arbitrary one.  If the name doesn't correspond to an existing
+tablespace, the next name is tried, and so on until all names have
+been tried.  If no valid tablespace is specified, an error is raised.
+The validation of the name doesn't happen until the first attempt to
+write undo data.

CREATE privilege needs to be mentioned like temp_tablespaces.


(3)
+The variable can only be changed before the first statement is
+executed in a transaction.

Does it include any first statement that doesn't emit undo?

(4)
+  One row for each undo log, showing current pointers,
+   transactions and backends.
+   See  for details.

I think this can just be written like "showing usage information about the undo 
log" just like other statistics views.  That way, we can avoid having to modify 
this sentence when we want to change the content of the view later.


(5)
+ discard
+ text
+ Location of the oldest data in this undo log.

The name does not match the description intuitively.  Maybe "oldest"?

BTW, how does this information help users?  (I don't mean to say we shouldn't 
output information that users cannot interpret; other DBMSs output such 
information probably for technical support staff.)


(6)
+ insert
+ text
+ Location where the next data will be written in this undo
+  log.
...
+ end
+ text
+ Location one byte past the end of the allocated physical storage
+  backing this undo log.

Again, how can these be used?  If they are useful to calculate the amount of 
used space, shouldn't they be bigint?


(7)
@@ -65,7 +65,7 @@
smgrid integer
 
 
- Block storage manager ID.  0 for regular relation data.
+ Block storage manager ID.  0 for regular relation data.
 
  

I guess this change is mistakenly included?


(8)
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
@@ -216,6 +216,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 

@@ -286,6 +286,7 @@



+   


It looks like this list needs to be ordered alphabetically.  So, the new line 
is better placed between pg_test_timing and pg_upgrade?


(9)
I don't want to be disliked because of being picky, but maybe pg_undo_dump 
should be pg_undodump.  Existing commands don't use '_' to separate words after 
pg_, except for pg_test_fsync and pg_test_timing.


(10)
+   This utility can only be run by the user who installed the server, because
+   it requires read-only access to the data directory.

I guess you copied this from pg_waldump or pg_resetwal, but I'm afraid this 
should be as follows, which is an excerpt from pg_controldata's page.  (The 
pages for pg_waldump and pg_resetwal should be fixed in a separate thread.)

This utility can only be run by the user who initialized the cluster because it 
requires read access to the data directory.


(11)
+The -m option cannot be used if
+either -c or -l is used.

-l -> -r
Or, why don't we align option characters with pg_waldump?  pg_waldump uses -r 
to filter by rmgr.  pg_undodump can output record contents by default like 
pg_waldump.  Considering pg_dump and pg_dumpall also output all data by 
default, that seems how PostgreSQL commands behave.


(12)
+   startsegendseg

startseg and endseg are not described.


(13)
+Undo files backing undo logs in the default tablespace are stored under
...
+Undo log files contain standard page headers as described in the next section,

Fluctuations in expressions can be seen: undo file and undo log file.  I think 
the following "undo data file" fits best.  What do you think?

+  UndoFileRead
+  Waiting for a read from an undo data file.


(14)
+Undo data exists in a 64-bit address space divided into 2^34 undo
+logs, each with a theoretical capacity of 1TB.  The first time a
+backend writes undo, it attaches to an existing undo log whose
+capacity is not yet exhausted and which is not currently being used by
+any other backend; or if no suitable undo log already exists, it
+creates a new one.  To avoid wasting space, each undo log is further
+divided into 1MB segment files, so that segments which are no longer
+needed can be removed (possibly recycling the underlying file by
+renaming it) and segments which are not yet needed do not need to be
+physically created on disk.  An undo segment file has a name like
+04.000120, where
+04 is the undo log number and
+000120 is the offset of the first byte
+held in the file.

The number of undo logs is not 2^34 but 2^24 (2^64 - 2^40 (1 TB)).


(15)

RE: [PoC] Non-volatile WAL buffer

2021-03-08 Thread tsunakawa.ta...@fujitsu.com

From: Takashi Menjo 
> > The other question is whether simply placing WAL on DAX (without any
> > code changes) is safe. If it's not, then all the "speedups" are
> > computed with respect to unsafe configuration and so are useless. And
> > BTT should be used instead, which would of course produce very different
> results.
> 
> I think it's safe, thanks to the checksum in the header of WAL record (xl_crc 
> in
> struct XLogRecord). In DAX mode, user data (WAL record
> here) is written to the PMEM device by a smaller unit (probably a byte or a
> cache line) than the traditional 512-byte disk sector. So a torn-write such 
> that
> "some bytes in a sector persist, other bytes not"
> can occur when crash. AFAICS, however, the checksum for WAL records can
> also support such a torn-write case.

I'm afraid I may be misunderstanding, so let me ask a naive question.

I understood "simply placing WAL on DAX (without any code changes)" means 
placing WAL files on DAX-aware filesystems such as ext4 and xfs, withoug 
modifying Postgres source code.  That is, use the PMEM as a high performance 
storage device.  Is this correct?

Second, does it what you represented as "master" in your test results?

I'd simply like to know what percentage of performance improvement we can 
expect by utilizing PMDK and modifying Postgres source code, and how much 
improvement we consider worthwhile.


Regards
Takayuki Tsunakawa

RE: Implementing Incremental View Maintenance

2021-03-08 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> It's probably time to move forward with the plan of pushing the
> results into a commitfest.postgresql.org API, and then making Magnus
> et al write the email spam code with a preferences screen linked to
> your community account :-D

+1
I wish to see all the patch status information on the CF app.


Regards
Takayuki Tsunakawa

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-07 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> For now, I have left 0005 and 0006 patches, we can come back to those once we
> are done with the first set of patches. The first patch looks good to me and I
> think we can commit it and then bikeshed about GUC/reloption patch.

Agreed, it looks good to me, too.


Regards
Takayuki Tsunakawa

RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-05 Thread tsunakawa.ta...@fujitsu.com

From: Justin Pryzby 
> I think this change to the regression tests is suspicous:
> 
> -CONTEXT:  remote SQL command: INSERT INTO public.loc2(f1, f2) VALUES
> ($1, $2)
> -COPY rem2, line 1: "-1 xyzzy"
> +CONTEXT:  COPY loc2, line 1: "-1   xyzzy"
> +remote SQL command: COPY public.loc2(f1, f2) FROM STDIN
> +COPY rem2, line 2
> 
> I think it shouldn't say "COPY rem2, line 2" but rather a remote version of 
> the
> same:
> |COPY loc2, line 1: "-1   xyzzy"

No, the output is OK.  The remote message is included as the first line of the 
CONTEXT message field.  The last line of the CONTEXT field is something that 
was added by the local COPY command.  (Anyway, useful enough information is 
included in the message -- the constraint violation and the data that caused 
it.)


> I have rebased this on my side over yesterday's libpq changes - I'll send it 
> if
> you want, but it's probably just as easy if you do it.

I've managed to rebased it, although it took unexpectedly long.  The patch is 
attached.  It passes make check against core and postgres_fdw.  I'll turn the 
CF status back to ready for committer shortly.



Regards
Takayuki Tsunakawa



v19-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v19-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-04 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> From the wiki[1], CCI is to let statements can not see the rows they modify.
> 
> Here is an example of the case 1):
> (Note table referenced and referencing are both empty)
> -
> postgres=# with cte as (insert into referenced values(1)) insert into
> referencing values(1);
> -
> The INSERT here will first modify the referenced table(pk table) and then
> modify the referencing table.
> When modifying the referencing table, it has to check if the tuple to be 
> insert
> exists in referenced table.
> At this moment, CCI can let the modification on referenced in WITH visible
> when check the existence of the tuple.
> If we do not CCI here, postgres will report an constraint error because it 
> can not
> find the tuple in referenced table.

Ah, got it.  Thank you.  I'd regret if Postgres has to give up parallel 
execution because of SQL statements that don't seem to be used like the above.


> >  I'm worried about having this dependency in RI check, because the planner
> may allow parallel INSERT in these cases in the future.
> 
> If we support parallel insert that can have other modifications in the 
> future, I
> think we will also be able to share or increment command ID in parallel wokers
> in that time.
> And it seems we can remove this dependency in that time.
> How about add some more comments about this to remind future developer.
> /*
>  * If extra parallel modification is support in the future, this dependency 
> should
> be removed.
>  */

Agreed.

I'm excited to see PostgreSQL's parallel DML work in wider use cases and 
satisfy users' expectations.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-04 Thread tsunakawa.ta...@fujitsu.com

From: Tom Lane 
> But I think passing the message start address explicitly might be better than
> having it understand the buffering behavior in enough detail to know where to
> find the message.  Part of the point here
> (IMO) is to decouple the tracing logic from the core libpq logic, in hopes of 
> not
> having common-mode bugs.

Ouch, you're perfectly right.  Then let's make the signature:

void pqLogMessage(PGconn *conn, const char *message, bool toServer);

Thank you!


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-04 Thread tsunakawa.ta...@fujitsu.com

From: tsunakawa.ta...@fujitsu.com 
> I understood that the former is pqParseInput3() and the latter is
> pqPutMsgEnd().  They call the logging function wne conn->pfdebug is not
> NULL.  Its signature is like this (that will be defined in libpq-trace.c):
> 
> void pqLogMessage(const char *message, bool toServer);

Oops, the logging facility needs the destination (conn->pfdebug).  So, it will 
be:

void pqLogMessage(PGconn *conn, bool toServer);

The function should know where to extract the message from.


Regards
Takayuki Tsunakawa

RE: libpq debug log

2021-03-04 Thread tsunakawa.ta...@fujitsu.com

Tom-san, Alvaro-san,


From: Tom Lane 
> I took a quick look through the v22 patch, and TBH I don't like much of 
> anything
> at all about the proposed architecture.  It's retained most of the flavor of 
> the
> way it was done before, which was a hangover from the old process-on-the-fly
> scheme.

Yes, the patch just follows the current style of interspersing.  (But some 
places are removed so it's a bit cleaner.)

Anyway, I think your suggestion should be better, resulting in much cleaner 
separation of message handling and logging.


> I think the right way to do this, now that we've killed off v2 protocol 
> support, is to
> rely on the fact that libpq fully buffers both incoming and outgoing messages.
> We should nuke every last bit of the existing code that intersperses tracing 
> logic
> with normal message decoding and construction, and instead have two tracing
> entry points:

FWIW, I was surprised and glad when I saw the commit message to remove the v2 
protocol.


> (1) Log a received message.  This is called as soon as we know (from the
> length word) that we have the whole message available, and it's just passed a
> pointer into the input buffer.  It should examine the input message for itself
> and print the details.
> 
> (2) Log a message-to-be-sent.  Again, call it as soon as we've constructed a
> complete message in the output buffer, and let it examine and print that
> message for itself.

I understood that the former is pqParseInput3() and the latter is 
pqPutMsgEnd().  They call the logging function wne conn->pfdebug is not NULL.  
Its signature is like this (that will be defined in libpq-trace.c):

void pqLogMessage(const char *message, bool toServer);

The message argument points to the message type byte.  toServer is true for 
messages sent to the server.  The signature could alternatively be one of the 
following, but this is a matter of taste:

void pqLogMessage(char message_type, const char *message, bool toServer);
void pqLogMessage(char message_type, int length, const char *message, bool 
toServer);

This function simply outputs the timestamp, message direction, message type, 
length, and message contents.  The output processing of message contents 
branches on a message type with a switch-case statement.  Perhaps the output 
processing of each message should be separated into its own function like 
pqLogMessageD() for 'D' message so that the indentation won't get deep in the 
main logging function.


> Both of these pieces of logic could be written directly from the protocol
> specification, without any interaction with the main libpq code paths, which
> would be a Good Thing IMO.  The current intertwined approach is somewhere
> between useless and counterproductive if you're in need of tracing down a
> libpq bug.  (In other words, I'm suggesting that the apparent need for
> duplicate logic would be good not bad, and indeed that it'd be best to write 
> the
> tracing logic without consulting the existing libpq code at all.)

Yes, the only thing I'm concerned about is to have two places that have to be 
aware of the message format -- message encoding/decoding and logging.  But this 
cleaner separation would pay it.

Alvaro-san, what do you think?  Anyway, we'll soon post the updated patch that 
fixes the repeated ParseComplete issue based on the current patch.  After that, 
we'll send a patch based on Tom-san's idea.


Regards
Takayuki Tsunakawa

RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-03 Thread tsunakawa.ta...@fujitsu.com

From: Zhihong Yu  
> This feature enables bulk COPY into foreign table in the case of
> multi inserts is possible
> 
> 'is possible' -> 'if possible'
> 
> FDWAPI was extended by next routines:
> 
> next routines -> the following routines

Thank you, fixed slightly differently.  (I feel the need for pgEnglish again.)


> +   if ((!OK && PQresultStatus(res) != PGRES_FATAL_ERROR) ||
> 
> Is PGRES_FATAL_ERROR handled somewhere else ? I don't seem to find that in 
> the patch.

Good catch.  ok doesn't need to be consulted here, because failure during row 
transmission causes PQputCopyEnd() to receive non-NULL for its second argument, 
which in turn makes PQgetResult() return non-COMMAND_OK.


Regards
Takayuki Tsunakawa



v18-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v18-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-03 Thread tsunakawa.ta...@fujitsu.com

Excuse me for asking probably stupid questions...

From: houzj.f...@fujitsu.com 
> As far as I can see, it’s only necessary to increment command id when the
> INSERT command modified the referenced table.

Why do we have to increment the command ID when the INSERT's target table is a 
referenced table?


> And INSERT command only have one target table, the modification on other
> tables can happen in the following cases.
> 
> 1) has modifyingcte which modifies the referenced table
> 2) has modifying function which modifies the referenced table.
> (If I missed something please let me know)

Also, why do we need CCI in these cases?  What kind of problem would happen if 
we don't do CCI?


> Since the above two cases are not supported in parallel mode(parallel unsafe).
> IMO, It seems it’s not necessary to increment command id in parallel mode, we
> can just skip commandCounterIncrement when in parallel mode.
> 
> + /*
> +  * We do not need to increment the command counter
> +  * in parallel mode, because any other modifications
> +  * other than the insert event itself are parallel unsafe.
> +  * So, there is no chance to modify the pk relation.
> +  */
> + if (IsInParallelMode())
> + needCCI = false;

I'm worried about having this dependency in RI check, because the planner may 
allow parallel INSERT in these cases in the future.


Regards
Takayuki Tsunakawa

1 2 3 4 5 >

1 - 100 of 404 matches

Mail list logo