Hi Kristian!
On Mon, Jul 15, 2019 at 4:00 PM Kristian Nielsen
wrote:
>
> sachin.set...@mariadb.com writes:
>
> > revision-id: 7cabdc461b24fdebe599799d7964efa4b53815e3
> > (mariadb-10.1.39-91-g7cabdc461b2)
> >
> > MDEV-6860 Parallel async replication hangs on a Galera node
> >
> > Wait for previous commit beore preparing next transation for galera
>
> > diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
> > index 8fef2d66635..7d38c36b840 100644
> > --- a/sql/rpl_parallel.cc
> > +++ b/sql/rpl_parallel.cc
> > @@ -1181,7 +1181,7 @@ handle_rpl_parallel_thread(void *arg)
> >before, then wait now for the prior transaction to complete its
> >commit.
> > */
> > -if (rgi->speculation == rpl_group_info::SPECULATE_WAIT &&
> > +if ((rgi->speculation == rpl_group_info::SPECULATE_WAIT ||
> > WSREP_ON) &&
> > (err= thd->wait_for_prior_commit()))
>
> Ouch! That's killing _all_ parallel replication when WSREP_ON :-/
>
> Do you really need to do this? It seems quite a restriction for replicating
> to Galera if parallel replication is not allowed.
>
Actually this was just a temporary fix , it has not been reviewed by Andrei.
> (I wonder if this isn't just another symptom of the underlying problem that
> Galera has never been integrated properly into MariaDB and the group commit
> algorithm / transaction master?).
So the actual issue , Galera sends event at the time of prepare phase
and And creates deadlock.
For example lets us consider the replication A -> B <==> C (A,B
parallel replication optimistic, B,C Galera cluster nodes)
Lets assume 2 inserts(T1 gtid x-x-1 and T2 x-x-2) from master A arrive
to slave B.
2nd insert prepares faster then 1st insert, So it has already sent the
writeset to node C. Now it is the queue waiting for its turn to commit
While the first insert does prepare on galera
(wsrep_run_wsrep_commit), but it is stuck because T2 transaction still
haven't run post_commit on galera
so galera state is still in S_WAITING
T2 cant run post_commit on galera because it is waiting for T1 commit
, T1 cant commit because it is waiting in prepare stage for
transaction T2 to clear the galera state.
Backtrace from gdb
Gtid_seq_no= 2
Thread 34 (Thread 0x7fcd966d2700 (LWP 23891)):
#0 0x7fcda6d56415 in pthread_cond_wait@@GLIBC_2.3.2 () from
/usr/lib/libpthread.so.0
#1 0x5569d607d380 in safe_cond_wait (cond=0x7fcd854078e8,
mp=0x7fcd85407838, file=0x5569d6240360
"/home/sachin/10.1/server/include/mysql/psi/mysql_thread.h",
line=1154) at /home/sachin/10.1/server/mysys/thr_mutex.c:493
#2 0x5569d5aec4d0 in inline_mysql_cond_wait (that=0x7fcd854078e8,
mutex=0x7fcd85407838, src_file=0x5569d6240cb8
"/home/sachin/10.1/server/sql/log.cc", src_line=7387) at
/home/sachin/10.1/server/include/mysql/psi/mysql_thread.h:1154
#3 0x5569d5afeee5 in MYSQL_BIN_LOG::queue_for_group_commit
(this=0x5569d692d7c0 , orig_entry=0x7fcd966cf440) at
/home/sachin/10.1/server/sql/log.cc:7387
#4 0x5569d5aff5c9 in
MYSQL_BIN_LOG::write_transaction_to_binlog_events (this=0x5569d692d7c0
, entry=0x7fcd966cf440) at
/home/sachin/10.1/server/sql/log.cc:7607
#5 0x5569d5afecff in MYSQL_BIN_LOG::write_transaction_to_binlog
(this=0x5569d692d7c0 , thd=0x7fcd84c068b0,
cache_mngr=0x7fcd84c72c70, end_ev=0x7fcd966cf5e0, all=true,
using_stmt_cache=true, using_trx_cache=true) at
/home/sachin/10.1/server/sql/log.cc:7290
#6 0x5569d5af0ce6 in binlog_flush_cache (thd=0x7fcd84c068b0,
cache_mngr=0x7fcd84c72c70, end_ev=0x7fcd966cf5e0, all=true,
using_stmt=true, using_trx=true) at
/home/sachin/10.1/server/sql/log.cc:1751
#7 0x5569d5af11bb in binlog_commit_flush_xid_caches
(thd=0x7fcd84c068b0, cache_mngr=0x7fcd84c72c70, all=true, xid=2) at
/home/sachin/10.1/server/sql/log.cc:1859
#8 0x5569d5b045c8 in MYSQL_BIN_LOG::log_and_order
(this=0x5569d692d7c0 , thd=0x7fcd84c068b0, xid=2,
all=true, need_prepare_ordered=false, need_commit_ordered=true) at
/home/sachin/10.1/server/sql/log.cc:9575
#9 0x5569d5a1ec0d in ha_commit_trans (thd=0x7fcd84c068b0,
all=true) at /home/sachin/10.1/server/sql/handler.cc:1497
#10 0x5569d5925e7e in trans_commit (thd=0x7fcd84c068b0) at
/home/sachin/10.1/server/sql/transaction.cc:235
#11 0x5569d5b1b1fa in Xid_log_event::do_apply_event
(this=0x7fcd8542a770, rgi=0x7fcd85407800) at
/home/sachin/10.1/server/sql/log_event.cc:7720
#12 0x5569d5743fa1 in Log_event::apply_event (this=0x7fcd8542a770,
rgi=0x7fcd85407800) at /home/sachin/10.1/server/sql/log_event.h:1343
#13 0x5569d573987e in apply_event_and_update_pos_apply
(ev=0x7fcd8542a770, thd=0x7fcd84c068b0, rgi=0x7fcd85407800, reason=0)
at /home/sachin/10.1/server/sql/slave.cc:3479
#14 0x5569d5739deb in apply_event_and_update_pos_for_parallel
(ev=0x7fcd8542a770, thd=0x7fcd84c068b0, rgi=0x7fcd85407800) at
/home/sachin/10.1/server/sql/slave.cc:3623
#15 0x5569d597bfbe in rpt_handle_event (qev=0x7fcd85424770,
rpt=0x7fcd85421c88) at /home/sachin/10.1/server/sql/rpl_parallel.cc:50
#16 0x00