Re: [Maria-developers] set_time() on slave and unversioned -> versioned replication

2019-04-04 Thread Sergei Golubchik
Hi, Aleksey!

On Apr 04, Aleksey Midenkov wrote:
> On Thu, Apr 4, 2019 at 5:08 PM Sergei Golubchik  wrote:
> 
> > > > And I didn't want to force the master to include microseconds in
> > > > every single event for every single user just in case someone
> > > > would decide to do unversioned->versioned replication.
> >
> But is it really an issue: do you know setups where replication
> communication is a bottleneck?

It's few percent increase of the binlog size. Not much.

> > And it's a general principle - there will be definitely less than 1% of
> > users, who will use this. Less than 0.1% too. Most probably, less than
> > 0.01%. So, the remaining 99.99% should not pay the price for it.
> >
> Btw, it would be good to see the stats. We have some feedback plugin
> that does the job, don't we?

Yes.

> > > > Also, I thought that processing of 100 Query_log_event's in
> > > > a second is not realistic.
> > >
> > > Now it fails just with several events. I guess because
> > > system_time.sec_part is not reset to 0 initially.
> >
> > You said yourself that is is reset:
> >
> >   start_time_sec_part= system_time.sec_part= 0;
> >
> > Initially it's reset in THD::THD() too:
> >
> >   system_time.start.val= system_time.sec= system_time.sec_part= 0;
> >
> It is synchronized with hardware clock on each set_start_time().

It must be a bug. Hardware clock shouldn't overwrite
the counter as it comes from the slave.

Regards,
Sergei
Chief Architect MariaDB
and secur...@mariadb.org

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


[Maria-developers] MDEV-18822: possible slowdown with optimizer trace

2019-04-04 Thread Sergey Petrunia
Hi Varun,

So I was thinking of the impact of the optimizer trace, and decided to look at
how the most common check "if optimizer trace is enabled" is done.

The code at SQL layer does this kind of check 

  if (thd->trace_started())

(it is wrapped somewhere with "unlikely" but this is not consistent)


THD::trace_started is defined as:


bool THD::trace_started()
{
  return opt_trace.is_started();
}

which opt_trace here is Opt_trace_context. looking there:

class Opt_trace_context 
{
  ...
  bool is_started()
  {
return current_trace && is_enabled();
  }

following:

bool Opt_trace_context::is_enabled()
{
  if (current_trace)
return current_trace->is_enabled();
  return false;
}

current_trace is Opt_trace_stmt, looking there:

bool Opt_trace_stmt::is_enabled()
{
  return I_S_disabled == 0;
}

This is quite convoluted, and could probably made shorter.

Disassembling sql_select.cc, I can see that the calls to thd->trace_started()
are not inlined:

   if (unlikely(thd->trace_started()))
   10ac9:   4c 89 ffmovrdi,r15
   10acc:   e8 00 00 00 00  call   10ad1 
<_ZL32best_extension_by_limited_searchP4JOINyjddjjj+0x1a1>
   10ad1:   84 c0   test   al,al
   10ad3:   0f 85 fc 02 00 00   jne10dd5 
<_ZL32best_extension_by_limited_searchP4JOINyjddjjj+0x4a5>


if (unlikely(thd->trace_started()))
   1197b:   4c 89 f7movrdi,r14
   1197e:   e8 00 00 00 00  call   11983 
<_Z11choose_planP4JOINy+0x8a3>
   11983:   84 c0   test   al,al
   11985:   0f 85 2f 01 00 00   jne11aba 
<_Z11choose_planP4JOINy+0x9da>

  if (thd->trace_started())
   2d6f7:   48 8b bd b0 fc ff ffmovrdi,QWORD PTR [rbp-0x350]
   2d6fe:   e8 00 00 00 00  call   2d703 
<_ZN4JOIN14optimize_innerEv+0x1963>
   2d703:   84 c0   test   al,al
   2d705:   0f 85 6c 09 00 00   jne2e077 
<_ZN4JOIN14optimize_innerEv+0x22d7>

if (thd->trace_started())
   2d7da:   48 8b bd b0 fc ff ffmovrdi,QWORD PTR [rbp-0x350]
   2d7e1:   e8 00 00 00 00  call   2d7e6 
<_ZN4JOIN14optimize_innerEv+0x1a46>
   2d7e6:   84 c0   test   al,al
   2d7e8:   0f 85 ff 1a 00 00   jne2f2ed 
<_ZN4JOIN14optimize_innerEv+0x354d>

But does it all matter? 

I think the action plan should be:

1. Reproduce the slowdown as it was reported in MDEV-18822. 
2. If #1 succeeds in several re-runs, declare THD::trace_started as:

class THD {
public:
  bool trace_started; // initialize this to false in constructor and never 
touch it 
  inline bool trace_started() const { return trace_started; }
};
 
and then use search-replace to change "thd->trace_started()"  to 
"unlikely(thd->trace_started)" 

and then re-run the benchmark.

Any thoughts?

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog



___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] set_time() on slave and unversioned -> versioned replication

2019-04-04 Thread Aleksey Midenkov
On Thu, Apr 4, 2019 at 5:54 PM Aleksey Midenkov  wrote:

>
> But it can't recover correct statements order anyway. The statements came
> from many master threads to single slave thread in some arbitrary order.
> What is the point in ordering them at slave end?
>
>
This is probably incorrect. The order is predefined, so last event is most
important. We just need to overwrite history in case of conflict.

-- 
All the best,

Aleksey Midenkov
@midenok
___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] set_time() on slave and unversioned -> versioned replication

2019-04-04 Thread Aleksey Midenkov
On Thu, Apr 4, 2019 at 5:08 PM Sergei Golubchik  wrote:

> Hi, Aleksey!
>
> On Apr 04, Aleksey Midenkov wrote:
> > On Thu, Apr 4, 2019 at 2:43 PM Sergei Golubchik 
> wrote:
> > >
> > If there is an installation from unversioned 5.3 to versioned 10.3 we
> > can warn user about lost microseconds. This is minor issue since such
> > setups are rare, I guess. But for what microseconds are sent in 5.3?
>
> When it's needed for replication. Say, in
>
>   INSERT t1 VALUES (NOW(6));
>
> thd->query_start_sec_part() sets the flag.
>
> > > And I didn't want to force the master to include microseconds in
> > > every single event for every single user just in case someone would
> > > decide to do unversioned->versioned replication.
> >
> > If it's critical, this can be configured. But is it really performance
> > issue?
>
> A small performance and storage issue. At least 3 bytes per event.
>
>
But is it really an issue: do you know setups where replication
communication is a bottleneck?


> And it's a general principle - there will be definitely less than 1% of
> users, who will use this. Less than 0.1% too. Most probably, less than
> 0.01%. So, the remaining 99.99% should not pay the price for it.
>
>
Btw, it would be good to see the stats. We have some feedback plugin that
does the job, don't we?


> > > Also, I thought that processing of 100 Query_log_event's in a
> > > second is not realistic.
> >
> > Now it fails just with several events. I guess because
> > system_time.sec_part is not reset to 0 initially.
>
> You said yourself that is is reset:
>
>   start_time_sec_part= system_time.sec_part= 0;
>
> Initially it's reset in THD::THD() too:
>
>   system_time.start.val= system_time.sec= system_time.sec_part= 0;
>
>
It is synchronized with hardware clock on each set_start_time().


> > > But now I see some issues with that. One can freeze the time on the
> > > master with 'SET TIMESTAMP' and send an arbitrary number of events
> > > with the same timestamp.
> > >
> > > Or one can generate a query event that includes microseconds, to
> > > force the slave to count not from X.0, but from, say, X.98.
> > >
> > > So, a wraparound is possible and we need some fix for it.
> >
> > Looks like a lots of complications for a minor issue. Other DBMS don't
> > use microseconds for System Versioning at all. I guess user should
> > cope either with sent microseconds unconditionally (for both SBR and
> > RBR) or he should not use microseconds-precise System Versioning on
> > slave.
>
> one cannot possibly use microseconds-precise System Versioning on slave,
> if the server does not send microseconds. the slave merely tries to
> distingush between different statements, it doesn't try being precise.
>
>
But it can't recover correct statements order anyway. The statements came
from many master threads to single slave thread in some arbitrary order.
What is the point in ordering them at slave end?


> Regards,
> Sergei
> Chief Architect MariaDB
> and secur...@mariadb.org
>


-- 
All the best,

Aleksey Midenkov
@midenok
___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] set_time() on slave and unversioned -> versioned replication

2019-04-04 Thread Sergei Golubchik
Hi, Aleksey!

On Apr 04, Aleksey Midenkov wrote:
> On Thu, Apr 4, 2019 at 2:43 PM Sergei Golubchik  wrote:
> >
> If there is an installation from unversioned 5.3 to versioned 10.3 we
> can warn user about lost microseconds. This is minor issue since such
> setups are rare, I guess. But for what microseconds are sent in 5.3?

When it's needed for replication. Say, in

  INSERT t1 VALUES (NOW(6));

thd->query_start_sec_part() sets the flag.

> > And I didn't want to force the master to include microseconds in
> > every single event for every single user just in case someone would
> > decide to do unversioned->versioned replication.
> 
> If it's critical, this can be configured. But is it really performance
> issue?

A small performance and storage issue. At least 3 bytes per event.

And it's a general principle - there will be definitely less than 1% of
users, who will use this. Less than 0.1% too. Most probably, less than
0.01%. So, the remaining 99.99% should not pay the price for it.

> > Also, I thought that processing of 100 Query_log_event's in a
> > second is not realistic.
> 
> Now it fails just with several events. I guess because
> system_time.sec_part is not reset to 0 initially.

You said yourself that is is reset:

  start_time_sec_part= system_time.sec_part= 0;

Initially it's reset in THD::THD() too:

  system_time.start.val= system_time.sec= system_time.sec_part= 0;

> > But now I see some issues with that. One can freeze the time on the
> > master with 'SET TIMESTAMP' and send an arbitrary number of events
> > with the same timestamp.
> >
> > Or one can generate a query event that includes microseconds, to
> > force the slave to count not from X.0, but from, say, X.98.
> >
> > So, a wraparound is possible and we need some fix for it.
> 
> Looks like a lots of complications for a minor issue. Other DBMS don't
> use microseconds for System Versioning at all. I guess user should
> cope either with sent microseconds unconditionally (for both SBR and
> RBR) or he should not use microseconds-precise System Versioning on
> slave.

one cannot possibly use microseconds-precise System Versioning on slave,
if the server does not send microseconds. the slave merely tries to
distingush between different statements, it doesn't try being precise.

Regards,
Sergei
Chief Architect MariaDB
and secur...@mariadb.org

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [Commits] a1929d001c7: MDEV-18741: Optimizer trace: multi-part key ranges are printed incorrectly

2019-04-04 Thread Sergey Petrunia
Hi Varun,

In general, the patch moves in the right direction, but it is not yet there. 

Consider this example:

create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (
  a int not null,
  b int not null,
  c int not null,
  d int not null,
  key(a,b,c)
);

insert into t1 select a,a, a,a from ten;

set optimizer_trace=1;
explain select * from t1 where a between 1 and 4 and b < 50;

mysql> select
-> JSON_DETAILED(JSON_EXTRACT(trace, '$**.analyzing_range_alternatives'))
-> from INFORMATION_SCHEMA.OPTIMIZER_TRACE;

{
"range_scan_alternatives": 
[

{
"index": "a",
"ranges": 
[
"(a) < (4,50)"
],
"rowid_ordered": false,
"using_mrr": false,
"index_only": false,
"rows": 1,
"cost": 2.5002,
"chosen": true
}
],
"analyzing_roworder_intersect": 
{
"cause": "too few roworder scans"
},
"analyzing_index_merge_union": 
[
]
}
] |

while print_quick produces:

T@18   : | | | | | | | | | | >print_quick
quick range select, key a, length: 8
  1 <= X < 4/50
other_keys: 0x0:
T@18   : | | | | | | | | | |  revision-id: a1929d001c7b25e26d178834a196020bade819b8 
> (mariadb-10.3.6-318-ga1929d001c7)

Any idea why does the commit trigger show "mariadb-10.3.." ? The patch is
against 10.4, isnt it?

> parent(s): 0dc442ac61ac7ff1a1d6bd7d6090c0f2251fb558
> author: Varun Gupta
> committer: Varun Gupta
> timestamp: 2019-04-03 13:10:07 +0530
> message:
> 
> MDEV-18741: Optimizer trace: multi-part key ranges are printed incorrectly
> 
> Changed the function append_range_all_keyparts to use sel_arg_range_seq_init 
> / sel_arg_range_seq_next to produce ranges.
> Also adjusted to print format for the ranges, now the ranges are printed as:
> (keypart1_min, keypart2_min,..)  OP (keypart1_name,keypart2_name, ..) OP 
> (keypart1_max,keypart2_max, ..)
> 
> Also added more tests for range and index merge access for optimizer trace

...
> diff --git a/sql/opt_range.cc b/sql/opt_range.cc
> index 5ab3d70214d..f53efc55158 100644
> --- a/sql/opt_range.cc
> +++ b/sql/opt_range.cc
> @@ -431,16 +431,18 @@ static int and_range_trees(RANGE_OPT_PARAM *param,
>  static bool remove_nonrange_trees(RANGE_OPT_PARAM *param, SEL_TREE *tree);
>  
>  static void print_key_value(String *out, const KEY_PART_INFO *key_part,
> -const uchar *key);
> +const uchar* key, uint length);
> +static void print_keyparts_name(String *out, const KEY_PART_INFO *key_part,
> + uint n_keypart, key_part_map keypart_map);
Please fix indentation ^

>  
>  static void append_range_all_keyparts(Json_writer_array *range_trace,
> -  String *range_string,
> -  String *range_so_far, const SEL_ARG 
> *keypart,
> +  PARAM param, uint idx,
> +  SEL_ARG *keypart,
>const KEY_PART_INFO *key_parts);
>  
>  static
> -void append_range(String *out, const KEY_PART_INFO *key_parts,
> -  const uchar *min_key, const uchar *max_key, const uint 
> flag);
> +void append_range(String *out, const KEY_PART_INFO *key_part,
> +  KEY_MULTI_RANGE *range, uint n_key_parts);
Please fix indentation ^
>  
>  
>  /*
> @@ -2273,10 +2275,7 @@ void TRP_RANGE::trace_basic_info(const PARAM *param,
>// TRP_RANGE should not be created if there are no range intervals
>DBUG_ASSERT(key);
>  
> -  String range_info;
> -  range_info.length(0);
> -  range_info.set_charset(system_charset_info);
> -  append_range_all_keyparts(_range, NULL, _info, key, key_part);
> +  append_range_all_keyparts(_range, *param, key_idx, key, key_part);
>  }
>  
>  
> @@ -2489,10 +2488,8 @@ void TRP_GROUP_MIN_MAX::trace_basic_info(const PARAM 
> *param,
>// can have group quick without ranges
>if (index_tree)
>{
> -String range_info;
> -range_info.set_charset(system_charset_info);
> -append_range_all_keyparts(_range, NULL, _info, index_tree,
> -  key_part);
> +append_range_all_keyparts(_range, *param, param_idx,
> +  index_tree, key_part);
>}
>  }
>  
> @@ -6398,20 +6395,9 @@ void TRP_ROR_INTERSECT::trace_basic_info(const PARAM 
> *param,
>  trace_isect_idx.add("rows", (*cur_scan)->records);
>  
>  Json_writer_array trace_range(thd, "ranges");
> -for (const SEL_ARG *current= (*cur_scan)->sel_arg->first(); current;
> - current= current->next)
> -{
> -  String range_info;
> -  range_info.set_charset(system_charset_info);
> -  

Re: [Maria-developers] set_time() on slave and unversioned -> versioned replication

2019-04-04 Thread Aleksey Midenkov
On Thu, Apr 4, 2019 at 2:43 PM Sergei Golubchik  wrote:

> Hi, Aleksey!
>
> On Apr 04, Aleksey Midenkov wrote:
> > Hello, Sergei!
> >
> > In unversioned -> versioned scenario in the code below it first gets to
> Set
> > time 4, creates some records (on slave) and then seconds on slave
> increased
> > (X+1) while on master seconds are not yet increased (X). Then we get to
> Set
> > time 3 and reset time on slave to X.0 therefore we get back in time and
> all
> > stored records with timestamp X.n will be in future. 'n' came from
> > ++system_time.sec_part in Set time 4.
>
> That's not how it is supposed to work.
> As long as the master sends events with seconds=X,
> the slave will generate microsecons, X.0, X.1, X.2, etc.
>
> When the master sends an event with a new timestamp, Y,
> the slave goes back to Y.0, and continues with Y.1, etc.


> > Why did you decided to use such logic of getting seconds from master
> > and microseconds from slave? Since microseconds sooner or later reset
> > to 0 it's not better than just assigning some random number. What
> > sending microseconds from master conditionally only is good for?
>
> Because the master was sending microseconds conditionally, since 5.3.
> The slave had to cope with that somehow anyway.
>
>
If there is an installation from unversioned 5.3 to versioned 10.3 we can
warn user about lost microseconds. This is minor issue since such setups
are rare, I guess. But for what microseconds are sent in 5.3?


> And I didn't want to force the master to include microseconds in every
> single event for every single user just in case someone would decide to
> do unversioned->versioned replication.
>

If it's critical, this can be configured. But is it really performance
issue?


>
> Also, I thought that processing of 100 Query_log_event's in a second
> is not realistic.
>

Now it fails just with several events. I guess because system_time.sec_part
is not reset to 0 initially.


>
> But now I see some issues with that. One can freeze the time on the
> master with 'SET TIMESTAMP' and send an arbitrary number of events with
> the same timestamp.
>
> Or one can generate a query event that includes microseconds, to force
> the slave to count not from X.0, but from, say, X.98.
>
> So, a wraparound is possible and we need some fix for it.
>

Looks like a lots of complications for a minor issue. Other DBMS don't use
microseconds for System Versioning at all. I guess user should cope either
with sent microseconds unconditionally (for both SBR and RBR) or he should
not use microseconds-precise System Versioning on slave.


>
> Regards,
> Sergei
> Chief Architect MariaDB
> and secur...@mariadb.org
>


-- 
All the best,

Aleksey Midenkov
@midenok
___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] set_time() on slave and unversioned -> versioned replication

2019-04-04 Thread Sergei Golubchik
Hi, Aleksey!

On Apr 04, Aleksey Midenkov wrote:
> Hello, Sergei!
> 
> In unversioned -> versioned scenario in the code below it first gets to Set
> time 4, creates some records (on slave) and then seconds on slave increased
> (X+1) while on master seconds are not yet increased (X). Then we get to Set
> time 3 and reset time on slave to X.0 therefore we get back in time and all
> stored records with timestamp X.n will be in future. 'n' came from
> ++system_time.sec_part in Set time 4.

That's not how it is supposed to work.
As long as the master sends events with seconds=X,
the slave will generate microsecons, X.0, X.1, X.2, etc.

When the master sends an event with a new timestamp, Y,
the slave goes back to Y.0, and continues with Y.1, etc.

> Why did you decided to use such logic of getting seconds from master
> and microseconds from slave? Since microseconds sooner or later reset
> to 0 it's not better than just assigning some random number. What
> sending microseconds from master conditionally only is good for?

Because the master was sending microseconds conditionally, since 5.3.
The slave had to cope with that somehow anyway.

And I didn't want to force the master to include microseconds in every
single event for every single user just in case someone would decide to
do unversioned->versioned replication.

Also, I thought that processing of 100 Query_log_event's in a second
is not realistic.

But now I see some issues with that. One can freeze the time on the
master with 'SET TIMESTAMP' and send an arbitrary number of events with
the same timestamp.

Or one can generate a query event that includes microseconds, to force
the slave to count not from X.0, but from, say, X.98.

So, a wraparound is possible and we need some fix for it.

Regards,
Sergei
Chief Architect MariaDB
and secur...@mariadb.org

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


[Maria-developers] set_time() on slave and unversioned -> versioned replication

2019-04-04 Thread Aleksey Midenkov
Hello, Sergei!

In unversioned -> versioned scenario in the code below it first gets to Set
time 4, creates some records (on slave) and then seconds on slave increased
(X+1) while on master seconds are not yet increased (X). Then we get to Set
time 3 and reset time on slave to X.0 therefore we get back in time and all
stored records with timestamp X.n will be in future. 'n' came from
++system_time.sec_part in Set time 4.

Why did you decided to use such logic of getting seconds from master and
microseconds from slave? Since microseconds sooner or later reset to 0 it's
not better than just assigning some random number. What sending
microseconds from master conditionally only is good for?

And for RBR since you don't send microseconds at all I see no good solution
for this.

  inline void set_time(my_time_t t, ulong sec_part)
  {
if (opt_secure_timestamp > (slave_thread ? SECTIME_REPL :
SECTIME_SUPER))
{
  set_time(); // note that BINLOG itself requires SUPER
  DBUG_EXECUTE("time", print_start_time("Set time 1"););
}
else
{
  if (sec_part <= TIME_MAX_SECOND_PART)
  {
start_time= system_time.sec= t;
start_time_sec_part= system_time.sec_part= sec_part;
DBUG_EXECUTE("time", print_start_time("Set time 2"););
  }
  else if (t != system_time.sec)
  {
DBUG_EXECUTE("time", print_system_time("System time"););;
start_time= system_time.sec= t;
start_time_sec_part= system_time.sec_part= 0;
DBUG_EXECUTE("time", print_start_time("Set time 3"););
  }
  else
  {
start_time= t;
start_time_sec_part= ++system_time.sec_part;
DBUG_EXECUTE("time", print_start_time("Set time 4"););
  }
  user_time.val= hrtime_from_time(start_time) + start_time_sec_part;
  PSI_CALL_set_thread_start_time(start_time);
  start_utime= utime_after_lock= microsecond_interval_timer();
}
  }

-- 
All the best,

Aleksey Midenkov
@midenok
___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp