Re: [Maria-developers] commit_checkpoint_request() vs. thd_get_durability_property() (in relation to MDEV-11937)

2017-08-08 Thread Kristian Nielsen
Marko Mäkelä  writes:

> Thank you for your analysis. I suggest that we remove the unused
> declarations and code as soon as possible. Would you do that already under
> MDEV-11937?

I started on this, but then noticed that several other storage engines
reference the APIs also, and thought I should not remove it without
discussion first.

Another example is the added argument binlog_group_flush to the flush_logs
handlerton method. This is the binlog-based group prepare, which is not in
MariaDB. The InnoDB code seems to always pass binlog_group_flush as true,
isn't this wrong? It will skip log sync in srv_flush_log_at_trx_commit==0,
not sure what bad consequences that will have though, if any.

My opinion would be that storage engines should be adapted to the mariadb
APIs before being pushed into any main tree. It really should not take much
effort to do so, apart from taking the time to understand the code properly
first. For InnoDB, it should be already done, so new code just needs to be
adapted to the different APIs.

(BTW, I hope it is clear that I am not blaming any individual bug on any
individual engineer here. I know full well how MariaDB development happens
and what policies lead to these consequences).

 - Kristian.

> I think that future code merges from MySQL should be based on individual
> commits rather than snapshots. For merging the changes from 5.7.14 to
> 5.7.18, I already did this, and I rewrote or omitted some commits; see
> MDEV-11751.
>
> One particular problem with snapshots is that changes or additions to test
> files were often omitted. When merging changes commit by commit ("git
> format-patch" followed by "git am"), some manual work is needed to import
> the affected test files. The Oracle policy should be to only withhold test
> files for security bugs; we should take full advantage of the public tests.
>
> Best regards,
>
> Marko
>
> On Mon, Aug 7, 2017 at 2:25 PM, Jan Lindström 
> wrote:
>
>> Hi,
>>
>> On Mon, Aug 7, 2017 at 2:09 PM, Kristian Nielsen > > wrote:
>>
>>>
>>> The problem is that somehow the thd_get_durability_property() function was
>>> introduced into MariaDB code, but it is completely non-functional. So now
>>> there is code in InnoDB, TokuDB and RocksDB that calls this function and
>>> does not work correctly. This lead to performance regression due to extra
>>> fsync() calls.
>>>
>>
>> This function was introduced while I was doing InnoDB 5.7.9 merge, this
>> was huge
>> merge but in my understanding all server changes i.e. also storage API
>> changes were
>> reviewed. But, as patch was significant there was error. That function
>> should not be introduced.
>>
>>
>>>
>>>
>>> Or was the intention to eventually replace the whole MariaDB binlog group
>>> commit implementation with the MySQL one, to make MariaDB less divergent?
>>> This would require a number of changes to MariaDB binlog and replication.
>>>
>>
>> No this was not the intention.
>>
>> R: Jan
>>
>> ___
>> Mailing list: https://launchpad.net/~maria-developers
>> Post to : maria-developers@lists.launchpad.net
>> Unsubscribe : https://launchpad.net/~maria-developers
>> More help   : https://help.launchpad.net/ListHelp
>>
>>

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] commit_checkpoint_request() vs. thd_get_durability_property() (in relation to MDEV-11937)

2017-08-07 Thread Marko Mäkelä
Kristian,

Thank you for your analysis. I suggest that we remove the unused
declarations and code as soon as possible. Would you do that already under
MDEV-11937?

I think that future code merges from MySQL should be based on individual
commits rather than snapshots. For merging the changes from 5.7.14 to
5.7.18, I already did this, and I rewrote or omitted some commits; see
MDEV-11751.

One particular problem with snapshots is that changes or additions to test
files were often omitted. When merging changes commit by commit ("git
format-patch" followed by "git am"), some manual work is needed to import
the affected test files. The Oracle policy should be to only withhold test
files for security bugs; we should take full advantage of the public tests.

Best regards,

Marko

On Mon, Aug 7, 2017 at 2:25 PM, Jan Lindström 
wrote:

> Hi,
>
> On Mon, Aug 7, 2017 at 2:09 PM, Kristian Nielsen  > wrote:
>
>>
>> The problem is that somehow the thd_get_durability_property() function was
>> introduced into MariaDB code, but it is completely non-functional. So now
>> there is code in InnoDB, TokuDB and RocksDB that calls this function and
>> does not work correctly. This lead to performance regression due to extra
>> fsync() calls.
>>
>
> This function was introduced while I was doing InnoDB 5.7.9 merge, this
> was huge
> merge but in my understanding all server changes i.e. also storage API
> changes were
> reviewed. But, as patch was significant there was error. That function
> should not be introduced.
>
>
>>
>>
>> Or was the intention to eventually replace the whole MariaDB binlog group
>> commit implementation with the MySQL one, to make MariaDB less divergent?
>> This would require a number of changes to MariaDB binlog and replication.
>>
>
> No this was not the intention.
>
> R: Jan
>
> ___
> Mailing list: https://launchpad.net/~maria-developers
> Post to : maria-developers@lists.launchpad.net
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp
>
>


-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB Corporation
___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] commit_checkpoint_request() vs. thd_get_durability_property() (in relation to MDEV-11937)

2017-08-07 Thread Jan Lindström
Hi,

On Mon, Aug 7, 2017 at 2:09 PM, Kristian Nielsen 
wrote:

>
> The problem is that somehow the thd_get_durability_property() function was
> introduced into MariaDB code, but it is completely non-functional. So now
> there is code in InnoDB, TokuDB and RocksDB that calls this function and
> does not work correctly. This lead to performance regression due to extra
> fsync() calls.
>

This function was introduced while I was doing InnoDB 5.7.9 merge, this was
huge
merge but in my understanding all server changes i.e. also storage API
changes were
reviewed. But, as patch was significant there was error. That function
should not be introduced.


>
>
> Or was the intention to eventually replace the whole MariaDB binlog group
> commit implementation with the MySQL one, to make MariaDB less divergent?
> This would require a number of changes to MariaDB binlog and replication.
>

No this was not the intention.

R: Jan
___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


[Maria-developers] commit_checkpoint_request() vs. thd_get_durability_property() (in relation to MDEV-11937)

2017-08-07 Thread Kristian Nielsen
Monty asked me to fix MDEV-11937. This particular one is a performance
regression in InnoDB commit. But there is a wider problem that I thought I
should explain, so it can be perhaps avoided in the future.

MariaDB and MySQL use different mechanisms for storage engines to avoid
having to fsync during commit when binlog is enabled. In MariaDB, storage
engines implement the commit_checkpoint_request() handlerton method. In
MySQL, storage engines call thd_get_durability_property() to check if they
can avoid fsync.

The problem is that somehow the thd_get_durability_property() function was
introduced into MariaDB code, but it is completely non-functional. So now
there is code in InnoDB, TokuDB and RocksDB that calls this function and
does not work correctly. This lead to performance regression due to extra
fsync() calls.

This seems to me a serious problem. Now new code can be merged and compile
fine, where in reality it is wrong. There really should not be two separate
and different mechanisms for the same thing, and certainly not with one of
them non-functional.

The "expected" approach would be to remove thd_get_durability_property() and
update storage engines to use the corresponding MariaDB APIs
(commit_ordered() and commit_checkpoint_request()). This should not be hard.
A simple commit_checkpoint_request() implementation can just fsync all
transactions immediately (similar to what MySQL does). A more detailed
implementation can avoid any extra fsyncs, and just asynchroneously notify
the upper layer with commit_checkpoint_notify_ha() later when such fsync
happens normally (this is what InnoDB does). See comments in handler.h for
details.

Or was the intention to eventually replace the whole MariaDB binlog group
commit implementation with the MySQL one, to make MariaDB less divergent?
This would require a number of changes to MariaDB binlog and replication.

The binlog recovery code should be replaced (MySQL does not have the ability
to recover from more than one binlog). The binlog group commit code must be
replaced as well, and the commit_ordered() mechanism removed. I think this
would also require a re-design of the MariaDB in-order parallel replication.
MySQL has some optional mechanism for in-order, but my understanding is that
it is not sufficient to do optimistic parallel replication. I am not
intimately familiar with the MySQL code though.

I hope this helps. The wider problem behind MDEV-11937 is one of policy more
than one of code bugs, so not too much else I can do to address it.

---

Incidentally, I noticed some code in InnoDB trx0trx.cc:

/* We set the HA_IGNORE_DURABILITY during prepare phase of
binlog group commit to not flush redo log for every transaction
here. So that we can flush prepared records of transactions to
redo log in a group right before writing them to binary log
during flush stage of binlog group commit. */

So the idea is to do group prepare with the same group of transactions that
will later group commit to the binlog. In MariaDB, this concept does not
exist. Storage engine prepares are allowed to run in parallel and in any
order compared to binlog commit. So the InnoDB group prepare can include
more transactions than participate in the binlog commit (but also less, of
course).

IIRC, the important thing is to ensure that all transactions are durably
prepared in storage engines before being written to the binlog. In MariaDB,
there is MYSQL_LOG_BIN::group_commit_queue that holds the list of
transactions to be group committed to the binlog.

A similar mechanism in MariaDB might use the group_commit_queue as the set
of transactions to send to storage engines for group prepare. But if we wait
for a prepare fsync() after building this list, some transactions that could
prepare during this fsync may be unnecesarily delayed to the next binlog
commit.

The MySQL 5.7 code grabs the list of transactions to binlog group commit,
and then flushes the log in _all_ storage engines, unconditionally. That
just seems horrybly wrong, but I can't see any other way to read the code
(from MYSQL_BIN_LOG::process_flush_stage_queue()):

  ha_flush_logs(NULL, true);

It even does so while holding LOCK_log :-( I guess the MySQL idea is that
there is only one storage engine anyway, InnoDB.

 - Kristian.

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp