Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-10-24 Thread andrei . elkin
Kristian, hello.

The patch is polished to address your comments and Ian's editorial work.
I apologize for a possible invinient delay with the final version. It's here:

https://github.com/MariaDB/server/pull/460/commits/56b000b2e7d9c4dec61429f0ff7affe9c75409ca

as well as appended to the end of this mail.

Cheers,

Andrei


andrei.el...@pp.inet.fi writes:

> Kristian, salve.
>
> Thanks for checking the patch so promptly!
>
>> andrei.el...@pp.inet.fi writes:
>>
>>> The patch is ready for review and can be located on bb-10.1-andrei,
>>> https://github.com/MariaDB/server/pull/460
>>
>> I think the patch is ok, it looks of good quality and well thought out.
>>
>> A few comments/suggestions:
>>
>>
>> 1. In drop_domain(), the use of the condition if (strlen(errbuf)) would be
>> clearer if it was a separate flag. As the code is now, it implicitly
>> requires errbuf to be initialised to the empty string by caller, which is
>> needlessly errorprone.
>
> You have a point. No need to work the caller in this case.
>
>>
>> (In general, I think using the errmsg also as a flag is best avoided, but I
>> realise this is partly inherited also from existing code, as a work-around
>> for C not allowing to return multiple values easily. But inside
>> drop_domain() it is easy to use a separate flag).
>>
>>
>> 2. I would re-consider if this warning is useful:
>>
>>   sprintf(errbuf,
>>   "missing gtids from '%u-%u' domain-server pair "
>>   "which is referred in Gtid list describing earlier binlog state; "
>>   "ignore it if the domain was already explicitly deleted",
>>   glev->list[l].domain_id, glev->list[l].server_id);
>>
>> Since, as you write in the patch, with this feature it will be a normal
>> state of affairs that domains can be removed from the binlog state.
>
> There still exists a possibility (C) of "manual" and "malign" composition of 
> binlog
> files which I am trying to take control over.
> If you think its paranoid too much I won't object :-).
>
>>
>>
>> 3. I am not sure I understand the purpose of this warning:
>>
>>   sprintf(errbuf,
>>   "having gtid '%u-%u-%llu' which is lesser than "
>>   "'%u-%u-%llu' of Gtid list describing earlier binlog state; "
>>   "possibly the binlog state was affected by smaller sequence number 
>> "
>>   "gtid injection (manually or via replication)",
>>   rb_state_gtid->domain_id, rb_state_gtid->server_id,
>>   rb_state_gtid->seq_no, glev->list[l].domain_id,
>>   glev->list[l].server_id, glev->list[l].seq_no);
>>   push_warning_printf(current_thd, Sql_condition::WARN_LEVEL_WARN,
>>   ER_BINLOG_CANT_DELETE_GTID_DOMAIN,
>>   "the current gtid binlog state is incompatible to "
>>   "a former one %s", errbuf);
>>
>
>> The ER_BINLOG_CANT_DELETE_GTID_DOMAIN says "Could not delete gtid
>> domain".
>> But if I understand correctly, this warning is actually unrelated to the
>> domains being listed in DELETE DOMAIN_ID (and in fact such domains can be
>> deleted successfully despite this message).
>
> A glitch, right.
>
>
>>
>> Having a warning here might be ok (this condition would be considered a
>> corruption of the binary log, out-of-order within the same server-id is not
>> valid). But it might be confusing to users the way it is done here?
>
> I'm correcting the head part of the warning message (naturally the same
> applies to the first one of "missing gtids from '%u-%u' domain-server
> pair " should you agree to keep it.
>
>
>>
>>
>> 4. Also consider asking Ian Gilfillan (who did a lot of documentation on
>> MariaDB) for help in clarifying the different error and warning messages in
>> the patch. Eg. "is lesser than" is not correct English (like you, I also am
>> not a native English speaker).
>
> I will try to get Ian's input into the new patch.
>
>>
>>
>> Thanks for the patch, this is something that has been requested a number of
>> times. You might consider taking over this task from Jira, which (if I
>> understand the description correctly) you have basically solved (if with a
>> different/better syntax):
>>
>>   https://jira.mariadb.org/browse/MDEV-9241
>>
>>  - Kristian.
>
> Above all it's a good piece of collective work which I enjoy!
>
> Cheers,
>
> Andrei

diff --git a/mysql-test/include/show_gtid_list.inc 
b/mysql-test/include/show_gtid_list.inc
new file mode 100644
index 000..96f813f180c
--- /dev/null
+++ b/mysql-test/include/show_gtid_list.inc
@@ -0,0 +1,15 @@
+#  Purpose 
+#
+# Extract Gtid_list info from SHOW BINLOG EVENTS output masking
+# non-deterministic fields.
+#
+#  Usage 
+#
+# [--let $binlog_file=filename
+#
+if ($binlog_file)
+{
+  --let $_in_binlog_file=in '$binlog_file'
+}
+--replace_column 2 # 5 #
+--eval show binlog events $_in_binlog_file limit 1,1
diff --git 
a/mysql-test/suite/binlog/r/binlog_flush_binlogs_delete_domain.result 

Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-10-05 Thread andrei . elkin
Kristian, salve.

Thanks for checking the patch so promptly!

> andrei.el...@pp.inet.fi writes:
>
>> The patch is ready for review and can be located on bb-10.1-andrei,
>> https://github.com/MariaDB/server/pull/460
>
> I think the patch is ok, it looks of good quality and well thought out.
>
> A few comments/suggestions:
>
>
> 1. In drop_domain(), the use of the condition if (strlen(errbuf)) would be
> clearer if it was a separate flag. As the code is now, it implicitly
> requires errbuf to be initialised to the empty string by caller, which is
> needlessly errorprone.

You have a point. No need to work the caller in this case.

>
> (In general, I think using the errmsg also as a flag is best avoided, but I
> realise this is partly inherited also from existing code, as a work-around
> for C not allowing to return multiple values easily. But inside
> drop_domain() it is easy to use a separate flag).
>
>
> 2. I would re-consider if this warning is useful:
>
>   sprintf(errbuf,
>   "missing gtids from '%u-%u' domain-server pair "
>   "which is referred in Gtid list describing earlier binlog state; "
>   "ignore it if the domain was already explicitly deleted",
>   glev->list[l].domain_id, glev->list[l].server_id);
>
> Since, as you write in the patch, with this feature it will be a normal
> state of affairs that domains can be removed from the binlog state.

There still exists a possibility (C) of "manual" and "malign" composition of 
binlog
files which I am trying to take control over.
If you think its paranoid too much I won't object :-).

>
>
> 3. I am not sure I understand the purpose of this warning:
>
>   sprintf(errbuf,
>   "having gtid '%u-%u-%llu' which is lesser than "
>   "'%u-%u-%llu' of Gtid list describing earlier binlog state; "
>   "possibly the binlog state was affected by smaller sequence number "
>   "gtid injection (manually or via replication)",
>   rb_state_gtid->domain_id, rb_state_gtid->server_id,
>   rb_state_gtid->seq_no, glev->list[l].domain_id,
>   glev->list[l].server_id, glev->list[l].seq_no);
>   push_warning_printf(current_thd, Sql_condition::WARN_LEVEL_WARN,
>   ER_BINLOG_CANT_DELETE_GTID_DOMAIN,
>   "the current gtid binlog state is incompatible to "
>   "a former one %s", errbuf);
>

> The ER_BINLOG_CANT_DELETE_GTID_DOMAIN says "Could not delete gtid
> domain".
> But if I understand correctly, this warning is actually unrelated to the
> domains being listed in DELETE DOMAIN_ID (and in fact such domains can be
> deleted successfully despite this message).

A glitch, right.


>
> Having a warning here might be ok (this condition would be considered a
> corruption of the binary log, out-of-order within the same server-id is not
> valid). But it might be confusing to users the way it is done here?

I'm correcting the head part of the warning message (naturally the same
applies to the first one of "missing gtids from '%u-%u' domain-server
pair " should you agree to keep it.


>
>
> 4. Also consider asking Ian Gilfillan (who did a lot of documentation on
> MariaDB) for help in clarifying the different error and warning messages in
> the patch. Eg. "is lesser than" is not correct English (like you, I also am
> not a native English speaker).

I will try to get Ian's input into the new patch.

>
>
> Thanks for the patch, this is something that has been requested a number of
> times. You might consider taking over this task from Jira, which (if I
> understand the description correctly) you have basically solved (if with a
> different/better syntax):
>
>   https://jira.mariadb.org/browse/MDEV-9241
>
>  - Kristian.

Above all it's a good piece of collective work which I enjoy!

Cheers,

Andrei

>
>
>> From 3e9d06db84ab8cd761717fcb5ca4a05dfed70da0 Mon Sep 17 00:00:00 2001
>> From: Andrei Elkin 
>> Date: Fri, 29 Sep 2017 21:56:59 +0300
>> Subject: [PATCH] MDEV-12012/MDEV-11969 Can't remove GTIDs for a stale GTID
>>  Domain ID
>> 
>> As reported in MDEV-11969 "there's no way to ditch knowledge" about some
>> domain that is no longer updated on a server. Besides being of annoyance to
>> clutter output in DBA console stale domains can prevent the slave
>> to connect the master as MDEV-12012 witnesses.
>> What domain is obsolete must be evaluated by the user (DBA) according
>> to whether the domain info is still relevant and will the domain ever
>> receive any update.
>> 
>> This patch introduces a method to discard obsolete gtid domains from
>> the server binlog state. The removal requires no event group from such
>> domain present in existing binlog files though. If there are any the
>> containing logs must be first PURGEd in order for
>> 
>>   FLUSH BINARY LOGS DELETE_DOMAIN_ID=(list-of-domains)
>> 
>> succeed. Otherwise the command returns an error.
>> 
>> The list of obsolete domains can be computed through
>> 

Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-10-04 Thread andrei . elkin
Kristian, hello.

The patch is ready for review and can be located on bb-10.1-andrei,
https://github.com/MariaDB/server/pull/460
(Ignore https://github.com/MariaDB/server/pull/459 which specified 10.2
base by mistake)

In case you won't be able to, I'll find replacement, no worries.

Have a good time.

Andrei

> Kristian, thanks for more remarks!
>
 If you “forget" the domain on the upstream server what happens if
 there
 are downstream slaves?  I think you’ll break replication if they
 disconnect
 from this box and try to reconnect. Their GTID information will no
 longer match.
 IMO and if I’ve understood correctly this is broken.
>>
>> It should not break replication. It is allowed for a slave with GTID
>> position 0-1-100,10-2-200 to connect to a master that has nothing in
>> domain
>> 10, this is normal.
>
> To me in a sense this is "implicit" IGNORE_DOMAIN_IDS on domains that
> master
> does not have.
>
>>
>> I am not sure what the use-case of replicating DELETE DOMAIN to a
>> slave
>> would be. Domain deletion does not have a point-in-time property
>> like normal
>> transactions, so it does not help to have it replicated inline in
>> the event
>> stream. If it has an effect on a slave, this effect occurs only when
>> the
>> slave is restarted/reconnected.
>
> The use-case must've been the suspected loss of connectivity by
> slaves.
>
>>
 I really think there’s a need to indicate what domains should be
 forgotten/ignored
>>
>> If CHANGE MASTER ... IGNORE_DOMAIN_IDS is fixed to also ignore the
>> extra
>> domains on master upon connect, it is probably a better way to
>> ignore
>> domains in many cases. It is persisted (in the slave's master.info),
>> and it
>> can be set individually for each slave, which is more flexible (what
>> if one
>> slave needs to ignore a domain but another slave needs to replicate
>> it?).
>>
>>>>KN> The procedure to fix it will then be:
>>>>> 
>>>>> 1. FLUSH BINARY LOGS, note the new GTID position.
>>>>> 
>>>>> 2. Ensure that all slaves are past the problematic point with
>>>>> MASTER_GTID_WAIT(). After this, the old errorneous
>>>>> binlog files
>>>>> are no
>>>>> longer needed.
>>>>> 3. PURGE BINARY LOGS to remove the errorneous logs.
>>>>> 
>>>>> 4. FLUSH BINARY LOG DELETE DOMAIN d
>>
>> So this was what I suggested at some point related to
>> MDEV-12012. But
>> probably this is not the best suggestion, as I realised later.
>>
>> 1. In MDEV-12012, two independent masters were originally using the
>> same
>> domain id, so their history looks diverged in terms of GTID. This
>> can be
>> fixed by injecting a dummy transaction to make them up-to-date with
>> one
>> another in that domain.
>> Deleting (possibly valuable) part of the history is
>> not needed.
>>
>> 2. Another case, a slave needs to ignore the part of the history on
>> a master
>> connected with some domain. IGNORE_DOMAIN_IDS, once fixed, can do
>> this,
>> again there is no need to delete possibly valuable history on the
>> master.
>>
>
> Right. The feature we've been discussing solely deals with p.3.
>
>> 3. At some point, a domain that was unused for long may no longer
>> appear
>> anywhere, _except_ in gtid_binlog_state and gtid_slave_pos. This may
>> eventually clutter the output and be an annoyance. The original idea
>> with
>> FLUSH BINARY LOGS DELETE DOMAIN was to allow to fix this annoyance
>> by
>> removing such domains from gtid_binlog_state once they are no longer
>> needed
>> anywhere.
>>
>> I am not sure my original suggestion of using PURGE LOGS was ever a
>> good
>> idea, or is ever needed.
>
> I think it remains as optional which I wrote in my reply last night.
>
> Cheers,
>
> Andrei
>
>>
>>  - Kristian.
>
> ___
> Mailing list: https://launchpad.net/~maria-developers
> Post to : maria-developers@lists.launchpad.net
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-29 Thread andrei . elkin
Kristian, thanks for more remarks!

>>> If you “forget" the domain on the upstream server what happens if
>>> there
>>> are downstream slaves?  I think you’ll break replication if they
>>> disconnect
>>> from this box and try to reconnect. Their GTID information will no
>>> longer match.
>>> IMO and if I’ve understood correctly this is broken.
>
> It should not break replication. It is allowed for a slave with GTID
> position 0-1-100,10-2-200 to connect to a master that has nothing in domain
> 10, this is normal.

To me in a sense this is "implicit" IGNORE_DOMAIN_IDS on domains that master
does not have.

>
> I am not sure what the use-case of replicating DELETE DOMAIN to a slave
> would be. Domain deletion does not have a point-in-time property like normal
> transactions, so it does not help to have it replicated inline in the event
> stream. If it has an effect on a slave, this effect occurs only when the
> slave is restarted/reconnected.

The use-case must've been the suspected loss of connectivity by slaves.

>
>>> I really think there’s a need to indicate what domains should be
>>> forgotten/ignored
>
> If CHANGE MASTER ... IGNORE_DOMAIN_IDS is fixed to also ignore the extra
> domains on master upon connect, it is probably a better way to ignore
> domains in many cases. It is persisted (in the slave's master.info), and it
> can be set individually for each slave, which is more flexible (what if one
> slave needs to ignore a domain but another slave needs to replicate
> it?).
>
>>>KN> The procedure to fix it will then be:
>>>> 
>>>> 1. FLUSH BINARY LOGS, note the new GTID position.
>>>> 
>>>> 2. Ensure that all slaves are past the problematic point with
>>>> MASTER_GTID_WAIT(). After this, the old errorneous binlog files
>>>> are no
>>>> longer needed.
>>>> 3. PURGE BINARY LOGS to remove the errorneous logs.
>>>> 
>>>> 4. FLUSH BINARY LOG DELETE DOMAIN d
>
> So this was what I suggested at some point related to MDEV-12012. But
> probably this is not the best suggestion, as I realised later.
>
> 1. In MDEV-12012, two independent masters were originally using the same
> domain id, so their history looks diverged in terms of GTID. This can be
> fixed by injecting a dummy transaction to make them up-to-date with one
> another in that domain.
> Deleting (possibly valuable) part of the history is
> not needed.
>
> 2. Another case, a slave needs to ignore the part of the history on a master
> connected with some domain. IGNORE_DOMAIN_IDS, once fixed, can do this,
> again there is no need to delete possibly valuable history on the master.
>

Right. The feature we've been discussing solely deals with p.3.

> 3. At some point, a domain that was unused for long may no longer appear
> anywhere, _except_ in gtid_binlog_state and gtid_slave_pos. This may
> eventually clutter the output and be an annoyance. The original idea with
> FLUSH BINARY LOGS DELETE DOMAIN was to allow to fix this annoyance by
> removing such domains from gtid_binlog_state once they are no longer needed
> anywhere.
>
> I am not sure my original suggestion of using PURGE LOGS was ever a good
> idea, or is ever needed.

I think it remains as optional which I wrote in my reply last night.

Cheers,

Andrei

>
>  - Kristian.

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-29 Thread Kristian Nielsen

>> If you “forget" the domain on the upstream server what happens if
>> there
>> are downstream slaves?  I think you’ll break replication if they
>> disconnect
>> from this box and try to reconnect. Their GTID information will no
>> longer match.
>> IMO and if I’ve understood correctly this is broken.

It should not break replication. It is allowed for a slave with GTID
position 0-1-100,10-2-200 to connect to a master that has nothing in domain
10, this is normal.

I am not sure what the use-case of replicating DELETE DOMAIN to a slave
would be. Domain deletion does not have a point-in-time property like normal
transactions, so it does not help to have it replicated inline in the event
stream. If it has an effect on a slave, this effect occurs only when the
slave is restarted/reconnected.

>> I really think there’s a need to indicate what domains should be 
>> forgotten/ignored

If CHANGE MASTER ... IGNORE_DOMAIN_IDS is fixed to also ignore the extra
domains on master upon connect, it is probably a better way to ignore
domains in many cases. It is persisted (in the slave's master.info), and it
can be set individually for each slave, which is more flexible (what if one
slave needs to ignore a domain but another slave needs to replicate it?).

>>KN> The procedure to fix it will then be:
>>> 
>>> 1. FLUSH BINARY LOGS, note the new GTID position.
>>> 
>>> 2. Ensure that all slaves are past the problematic point with
>>> MASTER_GTID_WAIT(). After this, the old errorneous binlog files
>>> are no
>>> longer needed.
>>> 3. PURGE BINARY LOGS to remove the errorneous logs.
>>> 
>>> 4. FLUSH BINARY LOG DELETE DOMAIN d

So this was what I suggested at some point related to MDEV-12012. But
probably this is not the best suggestion, as I realised later.

1. In MDEV-12012, two independent masters were originally using the same
domain id, so their history looks diverged in terms of GTID. This can be
fixed by injecting a dummy transaction to make them up-to-date with one
another in that domain. Deleting (possibly valuable) part of the history is
not needed.

2. Another case, a slave needs to ignore the part of the history on a master
connected with some domain. IGNORE_DOMAIN_IDS, once fixed, can do this,
again there is no need to delete possibly valuable history on the master.

3. At some point, a domain that was unused for long may no longer appear
anywhere, _except_ in gtid_binlog_state and gtid_slave_pos. This may
eventually clutter the output and be an annoyance. The original idea with
FLUSH BINARY LOGS DELETE DOMAIN was to allow to fix this annoyance by
removing such domains from gtid_binlog_state once they are no longer needed
anywhere.

I am not sure my original suggestion of using PURGE LOGS was ever a good
idea, or is ever needed.

 - Kristian.

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-27 Thread andrei . elkin
Kristian, Simon, hello.

Replication side of  FLUSH BINARY LOGS DELETE_DOMAIN_ID is actually bound
to another requirement specified in early mails.
The command is successful only *after* the user has run

  PURGE BINARY LOGS to 'the-first-log-free-of-old-domains'

which is not replicated. Therefore in order to propagate
the delete domain instruction we would have to
involve into the feature a search for a first satisfactory log and purge.

Earlier I fancied it would be syntactically

  PURGE [NO_WRITE_TO_BINLOG|LOCAL] BINARY LOGS 
DELETE_DOMAIN_ID=(list-of-domain-ids)

with the FLUSH-like final piece of logics.
Unlike the plain PURGE this one can be replicated.

If we go this way, I also feel replicating behaviour would not be made by
default. Apparently an opposite to NO_WRITE_TO_BINLOG|LOCAL would need
introduction into parser; say DO_WRITE_TO_BINLOG|REPLICATE ?

Now I am wondering what colleguas could say..

Cheers,

Andrei



>> Hi Andrei,
>>
>>> On 25 Sep 2017, at 20:17, andrei.el...@pp.inet.fi wrote:
>>
>> ...
>>> the "vanilla" FLUSH-LOGS is not binlogged by decision commented in
>>> reload_acl_and_cache():
>>
>> In the normal case I think it makes sense to not trigger another
>> flush
>> and to not binlog the command.
>>
>>>  if (options & REFRESH_BINARY_LOG)
>>>  {
>>>/*
>>>  Writing this command to the binlog may result in infinite loops
>>>  when doing mysqlbinlog|mysql, and anyway it does not really make
>>>  sense to log it automatically (would cause more trouble to users
>>>  than it would help them)
>>>*/
>>>tmp_write_to_binlog= 0;
>>>...
>>> 
>>> I read them in a way how
>> ??
>
> Well, I should've spent one 1 (really) to understand them properly.
> Apparently
>mysqlbinlog --read-from-remote-server
> is meant and it's clear how the pipe is nirvana-like endless, indeed
> :-).
>
>>
>> However for the FLUSH LOGS DELETE DOMAIN …. I’m not so sure.
>
> You are directing to the point. Unlike the plain FLUSH BINARY LOGS
> the DELETE-DOMAIN one does not have to rotate the logs in case
> the domains have been already deleted or missed altogether.
> Therefore the above pipe is not dangerous.
>
>>
>> If you “forget" the domain on the upstream server what happens if
>> there
>> are downstream slaves?  I think you’ll break replication if they
>> disconnect
>> from this box and try to reconnect. Their GTID information will no
>> longer match.
>> IMO and if I’ve understood correctly this is broken.
>>
>
> Proper usage of FLUSH-DELETE-DOMAIN was thought to require
> slaves replication position be past the deleted domain. This implies
> a sort of
>   client> SELECT MASTER_GTID_WAIT()
> on each slave before FLUSH-DELETE-DOMAIN can be run on the master.
>
>> Please do not expect the DBA to fix this manually. I have lots of
>> places of multi-tier hierarchies
>> and I do not want to touch anything downstream of a master I push
>> changes into.
>>
>> “It should just work”.
>
> This is understood. Now that the new FLUSH-DELETE-DOMAIN is loop-free
> I don't have
> any doubt on its replication anymore.
>
>>
>> If FLUSH LOGS should not be binlogged for the reasons stated do not
>> overload this
>> command with something which behaves differently. Use a different
>> command,
>> which you can BINLOG and which won’t trigger confusion.
>
> Well, even a new command would rotate binlog so being loop-prone. (But
> we don't have to make any 2nd FLUSH-DELETE-DOMAIN to rotate, as said
> above).
>
>>
>> The signal in the binlogs of the fact you’re removing “old domains”
>> is
>> _needed_ by downstream
>> slaves to ensure that they “lose” these domains at the same point in
>> time binlog-wise and thus keep
>> in sync. That’s important.
>>
>> Simon
>
> Thanks for your response.
> I am going to exempt FLUSH-BINARY-LOGS from replication ban
> when it's run with the new DELETE_DOMAIN_ID=(list-of-domain-ids)
> option^\footnote{akin to {DO,IGNORE}_DOMAIN_IDS of CHANGE-MASTER}.
> Ineffective DELETE_DOMAIN_ID (e.g for a domain that is not in the gtid
> binlog state) won't cause rotation (the plain FLUSH-BINARY-LOGS part
> is
> not run).
>
> Existing NO_WRITE_TO_BINLOG|LOCAL options will *actually* control
> FLUSH-DELETE-DOMAIN replication.
>
>
> I hope this is satisfactory to everybody now.
>
> Cheers,
>
> Andrei

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-27 Thread andrei . elkin
Simon, hello.

> Hi Andrei,
>
>> On 25 Sep 2017, at 20:17, andrei.el...@pp.inet.fi wrote:
>
> ...
>> the "vanilla" FLUSH-LOGS is not binlogged by decision commented in
>> reload_acl_and_cache():
>
> In the normal case I think it makes sense to not trigger another flush
> and to not binlog the command.
>
>>  if (options & REFRESH_BINARY_LOG)
>>  {
>>/*
>>  Writing this command to the binlog may result in infinite loops
>>  when doing mysqlbinlog|mysql, and anyway it does not really make
>>  sense to log it automatically (would cause more trouble to users
>>  than it would help them)
>>*/
>>tmp_write_to_binlog= 0;
>>...
>> 
>> I read them in a way how
> ??

Well, I should've spent one 1 (really) to understand them properly.
Apparently
   mysqlbinlog --read-from-remote-server
is meant and it's clear how the pipe is nirvana-like endless, indeed :-).

>
> However for the FLUSH LOGS DELETE DOMAIN …. I’m not so sure.

You are directing to the point. Unlike the plain FLUSH BINARY LOGS
the DELETE-DOMAIN one does not have to rotate the logs in case
the domains have been already deleted or missed altogether.
Therefore the above pipe is not dangerous.

>
> If you “forget" the domain on the upstream server what happens if
> there
> are downstream slaves?  I think you’ll break replication if they
> disconnect
> from this box and try to reconnect. Their GTID information will no
> longer match.
> IMO and if I’ve understood correctly this is broken.
>

Proper usage of FLUSH-DELETE-DOMAIN was thought to require
slaves replication position be past the deleted domain. This implies
a sort of
  client> SELECT MASTER_GTID_WAIT()
on each slave before FLUSH-DELETE-DOMAIN can be run on the master. 

> Please do not expect the DBA to fix this manually. I have lots of
> places of multi-tier hierarchies
> and I do not want to touch anything downstream of a master I push
> changes into.
>
> “It should just work”.

This is understood. Now that the new FLUSH-DELETE-DOMAIN is loop-free I don't 
have
any doubt on its replication anymore.

>
> If FLUSH LOGS should not be binlogged for the reasons stated do not
> overload this
> command with something which behaves differently. Use a different
> command,
> which you can BINLOG and which won’t trigger confusion.

Well, even a new command would rotate binlog so being loop-prone. (But
we don't have to make any 2nd FLUSH-DELETE-DOMAIN to rotate, as said above).

>
> The signal in the binlogs of the fact you’re removing “old domains” is
> _needed_ by downstream
> slaves to ensure that they “lose” these domains at the same point in
> time binlog-wise and thus keep
> in sync. That’s important.
>
> Simon

Thanks for your response.
I am going to exempt FLUSH-BINARY-LOGS from replication ban
when it's run with the new DELETE_DOMAIN_ID=(list-of-domain-ids)
option^\footnote{akin to {DO,IGNORE}_DOMAIN_IDS of CHANGE-MASTER}.
Ineffective DELETE_DOMAIN_ID (e.g for a domain that is not in the gtid
binlog state) won't cause rotation (the plain FLUSH-BINARY-LOGS part is
not run).

Existing NO_WRITE_TO_BINLOG|LOCAL options will *actually* control
FLUSH-DELETE-DOMAIN replication.


I hope this is satisfactory to everybody now.

Cheers,

Andrei

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-25 Thread andrei . elkin
Hello.

I've completed with the patch which passes few tests.

However I had to make one concession with regard to replication.
Actually..

>>> So I see the DELETE DOMAIN (MariaDB) or “remove old UUID” (MySQL)
>>> type request
>>> to be one that means the master will only pretend that it can serve
>>> or knows about
>>> the remaining domains or UUIDs and if the slaves are sufficiently
>>> up to date they
>>> really don’t care as their vision will be similar.  Such a command
>>> would be replicated,
>>> right? It has to be for the slaves to change “their view” at the
>>> same moment
>>> in replication (not necessarily time) as the master.
>>
>> Hm, good point about whether it will be replicated.
>>
>> FLUSH LOGS is replicated by default with an option not to, so a
>> DELETE
>> DOMAIN would be also, I suppose. This makes it seem even more

the "vanilla" FLUSH-LOGS is not binlogged by decision commented in
reload_acl_and_cache():

  if (options & REFRESH_BINARY_LOG)
  {
/*
  Writing this command to the binlog may result in infinite loops
  when doing mysqlbinlog|mysql, and anyway it does not really make
  sense to log it automatically (would cause more trouble to users
  than it would help them)
*/
tmp_write_to_binlog= 0;
...

I read them in a way how

>> dangerous,

Kristian estimates.


>> frankly. Imagine an active domain being deleted by mistake
>
> So the point is to have a slave that is not affected and can rectify
> (e.g with fail-over to it as promoted Master)?
>
>>, now the mistake
>> immediately propagates to all servers in the replication topology, ouch.
>>
>> Maybe there should be an option, for example
>>
>>   FLUSH BINARY LOGS DELETE DOMAIN 10 NOCHECK
>>
>> or
>>
>>   FLUSH BINARY LOGS DELETE DOMAIN 10 ALLOW ACTIVE
>
> Something like this and also the choice between 'NOCHECK' and 'ALLOW
> ACTIVE' would be mandatory, that is no replication default for 'DELETE
> DOMAIN'.
> So the user first weighs how much risky it would be replicate.

I would to step back from this. The new option does not change
the nature of
   FLUSH BINARY LOGS
so the threat comments remain.
Also considering that all this measure makes sense only to master, its
rush accomplishment on a slave until its promotion to master does not
seem necessary.

So I would take a simpler originally considered no-replication route.

In case the replication requirement will receive more support, we might
consider to turn the feature's syntax into something different \footnote{%
Consider an "exotic" form of it

  SET @@global.gtid_binlog_state = "-domain_1,domain_2,..."

where '-' hints for decrements}.

Yet I think we'll stay with FLUSH BINARY LOGS which I despite some
trying could not find any better (The SET I liked though :-)).


Cheers,

Andrei

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-21 Thread andrei . elkin
Simon, Kristian, salute.

> Simon, thanks for your detailed answer.
>
> I see your point on having access to powerful tools when they are needed,
> even when such tools can be dangerous when used incorrectly. It reminds me
> of the old "goto considered harmful" which I never agreed with.
>
> It occurs to me that there are actually implicitly two distict features
> being discussed here.
>
> One is: Forget this domain_id, it has been unused since forever, but do
> check that it actually _is_ unused to avoid mistakes (the check would be
> that the domain is already absent from all available binlog files). This is
> the one I originally had in mind.
>
> Another is: There is this domain_id in the old binlog files, it is causing
> problems, we need to recover and we know what we are doing. I think this is
> the one you have in mind in what you write, and it seems very valid as well.
>
> It helped me to think of them explicitly as two distinct features.
>
> Also, Andrei's suggestion to fix IGNORE_DOMAIN_IDS to be able to connect to
> a master with that domain and completely ignore it seems useful for some of
> the scenarious you mention.

And this would be for a Master-Slave where Master knows more domains
than Slave.
In a reverse scenario (mdev-12012 is a sort of that) when Slave knows more,
we've also learned (thanks to Kristian, see mdev-12012 latest comments) another 
method
to ignore *discrepancy* via faking the "problematic" master's binlog state with 
a
dummy gtid group. After it's done Slave should be able to connect to such 
Master.

This observation must be relieving as don't have to consider keeping the old
logs with discarded domain's events. The new FLUSH-delete-domain is to
run at the user convenience.

>
>> Imagine a replication chain of M[aster] ―> S[lave]1, S[lave]2, A[ggregate]1
>> and A[ggregate]1 ―> A[ggregate]2 , A[ggregate]3, ….
>
>> If M dies and say A1 happens to be more up to date than S1, S2 then we may 
>> want to promote
>> A1 to be the new master, and move S1, S2 under A1, move A2 under A1
>> (but promote as the aggregate writeable master),
>> and move A3 under A2. This would not be the “desired” setup as probably we’d 
>> end
>> up thowing away all the aggregate data on A1.
>
> Right, I see. Throwing away table data needs matching editing of the binlog
> history to give a consistent replication state. And indeed in a failover
> scenario, waiting for logs to be purged/purgeable does not seem appropriate.
>
>> In this specific case it may be you really do want to hide the 2 sets
>> of domains and only show one
>> to the S1, S2 boxes, but maintain 2 domains on A2, A3.
>
> Agree. So a fixed IGNORE_DOMAIN_IDS would seem helpful here.

True.

>
>> It depends but in my opinion in most cases letting replication flow is more
>> important than having 100% master and slave consistency. The longer the
>> slave is stopped the more differences there are.
>>
>> And when you get in a situation like this you’re very tempted to go back to
>> binlog file plus position, to scan the bin logs with tools like mysqlbinlog
>> and do it the old way like we used to do years ago.  This is tedious and 
>> error
>> prone but if you’re careful it works fine. The whole idea of GTID is to avoid
>> the DBA ever having to do this…
>
> Right. Though once multiple domains are involved, the binlog is effectively
> multiple streams, and using the old-style single file/offset position may be
> tricky.
>
> But if IGNORE_DOMAIN_IDS works for master connection as well, then the slave
> has the ability to say exactly which domains it wants to see, and exactly
> where in each of those domains it wants to start (gtid_slave_pos), so that
> should be quite flexible.
>
> When I designed GTID I actually had this very much in mind, to allow GTID to
> be a full replacement for the old style of replication and to allow to do
> what is needed to solve the problem at hand. For example, this is why the
> code tries so hard to deal with out-of-order GTID sequence numbers (as
> opposed to just refusing to ever operate with those).
>
> On the other hand, it was also a goal to be much more consistent and strict
> and try to prevent silent failures and inconsistencies. These two goals tend
> to get in conflicts in some areas. Hence for example the
> gtid_strict_mode.

I can only add up, the master side (think of fan related metaphor) is
always better be strict.

>
> There are still a few features that were never implemented but should have
> been (like DELETE DOMAIN and binlog indexes for example), and it is surely
> not perfect.
>
>> So I see the DELETE DOMAIN (MariaDB) or “remove old UUID” (MySQL) type 
>> request
>> to be one that means the master will only pretend that it can serve or knows 
>> about
>> the remaining domains or UUIDs and if the slaves are sufficiently up to date 
>> they
>> really don’t care as their vision will be similar.  Such a command would be 
>> replicated,
>> right? It has to be for the slaves to change “their 

Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-21 Thread Kristian Nielsen
Simon, thanks for your detailed answer.

I see your point on having access to powerful tools when they are needed,
even when such tools can be dangerous when used incorrectly. It reminds me
of the old "goto considered harmful" which I never agreed with.

It occurs to me that there are actually implicitly two distict features
being discussed here.

One is: Forget this domain_id, it has been unused since forever, but do
check that it actually _is_ unused to avoid mistakes (the check would be
that the domain is already absent from all available binlog files). This is
the one I originally had in mind.

Another is: There is this domain_id in the old binlog files, it is causing
problems, we need to recover and we know what we are doing. I think this is
the one you have in mind in what you write, and it seems very valid as well.

It helped me to think of them explicitly as two distinct features.

Also, Andrei's suggestion to fix IGNORE_DOMAIN_IDS to be able to connect to
a master with that domain and completely ignore it seems useful for some of
the scenarious you mention.

> Imagine a replication chain of M[aster] —> S[lave]1, S[lave]2, A[ggregate]1
> and A[ggregate]1 —> A[ggregate]2 , A[ggregate]3, ….

> If M dies and say A1 happens to be more up to date than S1, S2 then we may 
> want to promote
> A1 to be the new master, and move S1, S2 under A1, move A2 under A1
> (but promote as the aggregate writeable master),
> and move A3 under A2. This would not be the “desired” setup as probably we’d 
> end
> up thowing away all the aggregate data on A1.

Right, I see. Throwing away table data needs matching editing of the binlog
history to give a consistent replication state. And indeed in a failover
scenario, waiting for logs to be purged/purgeable does not seem appropriate.

> In this specific case it may be you really do want to hide the 2 sets
> of domains and only show one
> to the S1, S2 boxes, but maintain 2 domains on A2, A3.

Agree. So a fixed IGNORE_DOMAIN_IDS would seem helpful here.

> It depends but in my opinion in most cases letting replication flow is more
> important than having 100% master and slave consistency. The longer the
> slave is stopped the more differences there are.
>
> And when you get in a situation like this you’re very tempted to go back to
> binlog file plus position, to scan the bin logs with tools like mysqlbinlog
> and do it the old way like we used to do years ago.  This is tedious and error
> prone but if you’re careful it works fine. The whole idea of GTID is to avoid
> the DBA ever having to do this…

Right. Though once multiple domains are involved, the binlog is effectively
multiple streams, and using the old-style single file/offset position may be
tricky.

But if IGNORE_DOMAIN_IDS works for master connection as well, then the slave
has the ability to say exactly which domains it wants to see, and exactly
where in each of those domains it wants to start (gtid_slave_pos), so that
should be quite flexible.

When I designed GTID I actually had this very much in mind, to allow GTID to
be a full replacement for the old style of replication and to allow to do
what is needed to solve the problem at hand. For example, this is why the
code tries so hard to deal with out-of-order GTID sequence numbers (as
opposed to just refusing to ever operate with those).

On the other hand, it was also a goal to be much more consistent and strict
and try to prevent silent failures and inconsistencies. These two goals tend
to get in conflicts in some areas. Hence for example the gtid_strict_mode.

There are still a few features that were never implemented but should have
been (like DELETE DOMAIN and binlog indexes for example), and it is surely
not perfect.

> So I see the DELETE DOMAIN (MariaDB) or “remove old UUID” (MySQL) type request
> to be one that means the master will only pretend that it can serve or knows 
> about
> the remaining domains or UUIDs and if the slaves are sufficiently up to date 
> they
> really don’t care as their vision will be similar.  Such a command would be 
> replicated,
> right? It has to be for the slaves to change “their view” at the same moment
> in replication (not necessarily time) as the master.

Hm, good point about whether it will be replicated.

FLUSH LOGS is replicated by default with an option not to, so a DELETE
DOMAIN would be also, I suppose. This makes it seem even more dangerous,
frankly. Imagine an active domain being deleted by mistake, now the mistake
immediately propagates to all servers in the replication topology, ouch.

Maybe there should be an option, for example

  FLUSH BINARY LOGS DELETE DOMAIN 10 NOCHECK

or

  FLUSH BINARY LOGS DELETE DOMAIN 10 ALLOW ACTIVE

or something.
Note that the effect of deleting a domain is basically to add at the head of
the binlog a mark that says the domain never existed. All of the old binlog
is unchanged. So the command does not really immediately affect running
replication, only new slave re-connections.


Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-20 Thread Simon Mudd
Hi Kristian,

Sorry for the late response.

> On 12 Sep 2017, at 10:21, Kristian Nielsen  wrote:
> 
> Simon Mudd  writes:
> 
>> ids. Obviously once all appropriate bin logs have been purged
>> (naturally by other means) then no special processing will be needed.
> 
> Right. Hence my original idea (which was unfortunately never implemented so
> far). If at some point a domain has been unused for so long that all GTIDs
> in that domain are gone, it is relatively safe to pretend that the domain
> never existed.
> 
> I would like to understand if you can think of significant use cases where
> the DBA needs to have active binlog files in the master containing some
> domain, while simultaneously pretending that this domain never existed.

The only case I can think of would be this:

Imagine a replication chain of M[aster] —> S[lave]1, S[lave]2, A[ggregate]1
and A[ggregate]1 —> A[ggregate]2 , A[ggregate]3, ….

A1 one has all the data of M but also extra databases where aggregate 
information is
made of the original data in M/S1/…  Additionally it might make sense that 
writes
to A1 use a different domain id.

If M dies and say A1 happens to be more up to date than S1, S2 then we may want 
to promote
A1 to be the new master, and move S1, S2 under A1, move A2 under A1 (but 
promote as the aggregate writeable master),
and move A3 under A2. This would not be the “desired” setup as probably we’d end
up thowing away all the aggregate data on A1.

The topology would end up as:

A1 [ignore any aggregate database/data] -> S1, S2, A2
and A2 —> A3

In this specific case it may be you really do want to hide the 2 sets of 
domains and only show one
to the S1, S2 boxes, but maintain 2 domains on A2, A3.

I have such a setup but have never had to handle such a failure scenario and it
may be that there are better ways to handle this but one thing I’m sure about:
I wouldn’t want to wipe out the binlogs especially on A1 as they hold valuable 
information
which may not be stored anywhere else.

Does that answer your question?

> Or if it is more of a general concern, and the inconvenience for users to
> have to save old binlogs somewhere else than the master's data directory and
> binlog index (SHOW BINARY LOGS).

If you have to "save binlogs somewhere” you’re doing stuff manually. If you 
manage
a lot of servers that’s really undesirable.

...
> 
> I understand the desire to not delete binlog files.
> 
> The problem is: If you want to have GTIDs with some domain in your active
> binlog files, _and_ you also want to pretend that this domain never existed,
> what does it mean? What is the semantics? It creates a lot of complexities
> for defining the semantics, for documenting it, for the users to understand
> it, and for the code to implement it correctly.

Yes. However, it also reflects realities of a slave connecting and trying to 
get back
in sync to a master which may have “more information” (in terms of GTID info)
than the slave but which is unable to “serve that information” to the slave.

I’ve forgotten now exactly how this is handled by MariaDB but know that in
a situation like that with MySQL GTID the sync just won’t work as you can’t
afaik tell the slave to “not worry about “uuid:X” or “domain:Y” but just try to 
sync
the rest.  And in MySQL GTID at least injecting millions of empty events to
get the slave’s GTID state inline with the missing UUID would be well ... rather
stupid.

In the perfect world this situation never happens. In the real world it
does and the DBA often has to just live with inconsistencies between data
stored on a master and a slave (which he can fix later) but *let replication 
flow*.

It depends but in my opinion in most cases letting replication flow is more
important than having 100% master and slave consistency. The longer the
slave is stopped the more differences there are.

And when you get in a situation like this you’re very tempted to go back to
binlog file plus position, to scan the bin logs with tools like mysqlbinlog
and do it the old way like we used to do years ago.  This is tedious and error
prone but if you’re careful it works fine. The whole idea of GTID is to avoid
the DBA ever having to do this…

> So basically, I do not understand what is the intended meaning of FLUSH
> BINARY LOGS DELETE DOMAIN d _and_ at the same time keeping GTIDs with domain
> d around in active binlog files? In what respects is the domain deleted, and
> in what respects not?

expire_logs_days allows you to keep bin logs for a time you’re comfortable with.
So you can restore from an old system and then roll forward and do point in time
recovery or simply catch up with the master. Normally slaves are _well_ ahead of
the oldest bin logs.

You have a copy of the bin logs on the master and if log_slave_updates is 
enabled
an "equivalent backup” on a number of slaves. You normally don’t expect to use 
these
files, but they’re there if you need them. If you wipe 

Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-13 Thread andrei . elkin
Hello, Simon, Kristian.

(The mail was meant to be sent out yesterday, but it got stuck in
my outgoing box).

> Simon Mudd  writes:
>
>> ids. Obviously once all appropriate bin logs have been purged
>> (naturally by other means) then no special processing will be needed.
>
> Right. Hence my original idea (which was unfortunately never implemented so
> far). If at some point a domain has been unused for so long that all GTIDs
> in that domain are gone, it is relatively safe to pretend that the domain
> never existed.
>
> I would like to understand if you can think of significant use cases where
> the DBA needs to have active binlog files in the master containing some
> domain, while simultaneously pretending that this domain never existed.
>
> Or if it is more of a general concern, and the inconvenience for users to
> have to save old binlogs somewhere else than the master's data directory and
> binlog index (SHOW BINARY LOGS).
>
>> removing old binary logs should _not_ IMO be done as a way of
>> forgetting the past obsolete domains.
>> BINLOGS are important so throwing them away is an issue. I think
>> that somehow the code needs
>> to be aware of the cut-off point and when the “stale domain ids” are 
>> removed.)

Simon, initially I thought of masking out the problematic domain so that
the most recent binlog file would not have it in its Gtid_list header.
Yet I've given up that idea to have agreed the strict setup on master
weighs much more.

>
> I understand the desire to not delete binlog files.

And the mdev-12012 use case might even not require to conduct this
purging/flush-delete-domain procedure if IGNORE_DOMAIN_IDS (for the zero
id domain in question) would do. There seems to be MDEV-9108 in the way
though, but conceptually the DBA may have a way to stay with the binlog
files even having a problematic domain.

>
> The problem is: If you want to have GTIDs with some domain in your active
> binlog files, _and_ you also want to pretend that this domain never existed,
> what does it mean? What is the semantics? It creates a lot of complexities
> for defining the semantics, for documenting it, for the users to understand
> it, and for the code to implement it correctly.
>
> So basically, I do not understand what is the intended meaning of FLUSH
> BINARY LOGS DELETE DOMAIN d _and_ at the same time keeping GTIDs with domain
> d around in active binlog files? In what respects is the domain deleted, and
> in what respects not?
>
> For the master, the binlog files are mainly used to stream to connecting
> slaves. Deleting a domain means replacing the conceptual binlog history with
> one in which that domain never existed. So that domain will be ignored in a
> connecting slaves position, assuming it is served by another multi-source
> master. If a new GTID in that domain appears later, it will be considered
> the very first GTID ever in that domain.
>
> So consider what happens if there is anyway GTIDs in that domain deeper in
> the binlog:
>
> 1. An already connected slave may be happily replicating those GTIDs. If
> that slave reconnects (temporary network error for example), it will instead
> fail with unknown GTID, or perhaps just start silently ignoring all further
> GTIDs in that domain. This kind of unpredictable behaviour seems bad.
>
> 2. Suppose a slave connects with a position without the deleted domain. The
> master starts reading the binlog from some point. What happens if a GTID is
> encountered that contains the deleted domain? The slave will start
> replicating that domain from some arbitrary point that depends on where it
> happened to be in other domains at the last disconnect. This also seems
> undesirable.
>
> There may be other scenarios that I did not think about.
>
>> DBAs do not like to remove bin logs “early" as unless you keep a copy
>> somewhere you may lose valuable information,
>> for recovery, for backups etc. Not everyone will be making automatic
>> copies (as MySQL does not provide an automatic way to do this)
>
> Understood. Maybe what is needed is a PURGE BINARY LOGS that removes the
> entries from the binlog index (SHOW BINARY LOGS), but leaves the files in
> the file system for the convenience of the sysadmin? (Well, you can just
> hand-edit binlog.index, but that requires master restart I think).

Like I said above, a filtering solution could be helpful.

>
>> The other comment I see mentioned here was “make sure all slaves are
>> up to date”. That’s going to be hard. The master can only be
>> aware of “connected slaves” and if you have intermediate masters, or a
>
> Indeed, the master cannot ensure this. The idea is that the DBA, who decides
> to delete a domain, must understand that this should not be done if any
> slave still needs GTIDs from that domain. This is similar to configuring
> normal binlog purge, where the DBA needs to ensure that binlogs are kept
> long enough for the needs of the slowest slave.
>
>> FWIW expiring old domains is 

Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-12 Thread Kristian Nielsen
Simon Mudd  writes:

> ids. Obviously once all appropriate bin logs have been purged
> (naturally by other means) then no special processing will be needed.

Right. Hence my original idea (which was unfortunately never implemented so
far). If at some point a domain has been unused for so long that all GTIDs
in that domain are gone, it is relatively safe to pretend that the domain
never existed.

I would like to understand if you can think of significant use cases where
the DBA needs to have active binlog files in the master containing some
domain, while simultaneously pretending that this domain never existed.

Or if it is more of a general concern, and the inconvenience for users to
have to save old binlogs somewhere else than the master's data directory and
binlog index (SHOW BINARY LOGS).

> removing old binary logs should _not_ IMO be done as a way of
> forgetting the past obsolete domains.
> BINLOGS are important so throwing them away is an issue. I think that somehow 
> the code needs
> to be aware of the cut-off point and when the “stale domain ids” are removed.)

I understand the desire to not delete binlog files.

The problem is: If you want to have GTIDs with some domain in your active
binlog files, _and_ you also want to pretend that this domain never existed,
what does it mean? What is the semantics? It creates a lot of complexities
for defining the semantics, for documenting it, for the users to understand
it, and for the code to implement it correctly.

So basically, I do not understand what is the intended meaning of FLUSH
BINARY LOGS DELETE DOMAIN d _and_ at the same time keeping GTIDs with domain
d around in active binlog files? In what respects is the domain deleted, and
in what respects not?

For the master, the binlog files are mainly used to stream to connecting
slaves. Deleting a domain means replacing the conceptual binlog history with
one in which that domain never existed. So that domain will be ignored in a
connecting slaves position, assuming it is served by another multi-source
master. If a new GTID in that domain appears later, it will be considered
the very first GTID ever in that domain.

So consider what happens if there is anyway GTIDs in that domain deeper in
the binlog:

1. An already connected slave may be happily replicating those GTIDs. If
that slave reconnects (temporary network error for example), it will instead
fail with unknown GTID, or perhaps just start silently ignoring all further
GTIDs in that domain. This kind of unpredictable behaviour seems bad.

2. Suppose a slave connects with a position without the deleted domain. The
master starts reading the binlog from some point. What happens if a GTID is
encountered that contains the deleted domain? The slave will start
replicating that domain from some arbitrary point that depends on where it
happened to be in other domains at the last disconnect. This also seems
undesirable.

There may be other scenarios that I did not think about.

> DBAs do not like to remove bin logs “early" as unless you keep a copy
> somewhere you may lose valuable information,
> for recovery, for backups etc. Not everyone will be making automatic
> copies (as MySQL does not provide an automatic way to do this)

Understood. Maybe what is needed is a PURGE BINARY LOGS that removes the
entries from the binlog index (SHOW BINARY LOGS), but leaves the files in
the file system for the convenience of the sysadmin? (Well, you can just
hand-edit binlog.index, but that requires master restart I think).

> The other comment I see mentioned here was “make sure all slaves are
> up to date”. That’s going to be hard. The master can only be
> aware of “connected slaves” and if you have intermediate masters, or a

Indeed, the master cannot ensure this. The idea is that the DBA, who decides
to delete a domain, must understand that this should not be done if any
slave still needs GTIDs from that domain. This is similar to configuring
normal binlog purge, where the DBA needs to ensure that binlogs are kept
long enough for the needs of the slowest slave.

> FWIW expiring old domains is good to do. There’s a similar FR for

> completely different the problem space is the same. Coming up with a
> solution which is simple to use and understand and also
> avoids where that’s possible making mistakes which may break
> replication is good. So thanks for looking at this.

Indeed. And the input from people like you with strong operational
experience is very valuable to end up with a good solution, hence my request
for additional input.

 - Kristian.

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [External] Obsolete GTID domain delete on master (MDEV-12012, MDEV-11969)

2017-09-11 Thread Simon Mudd
Hi all,

> On 8 Sep 2017, at 16:48, andrei.el...@pp.inet.fi wrote:
> 
> Kristian, hello.
> 
> Now to the implementation matter,
> 
>> The procedure to fix it will then be:
>> 
>> 1. FLUSH BINARY LOGS, note the new GTID position.
>> 
>> 2. Ensure that all slaves are past the problematic point with
>> MASTER_GTID_WAIT(). After this, the old errorneous binlog files
>> are no
>> longer needed.
>> 
>> 3. PURGE BINARY LOGS to remove the errorneous logs.
>> 
>> 4. FLUSH BINARY LOG DELETE DOMAIN d
>> 
> 
> I think we could optimize the list. How about
> 
> 1. Take note of @@global.gtid_binlog_state
> 2. Ensure that all slaves are past the last event of being deleted domain 'd'
> 3. PURGE BINARY LOGS DELETE DELETE 'd'
> 
>  The effect of the last step would include purging all the binary log
>  files plus a planned implicit FLUSH LOGS discarding 'd' from the new
>  emerged binlog.

removing old binary logs should _not_ IMO be done as a way of forgetting the 
past obsolete domains.
BINLOGS are important so throwing them away is an issue. I think that somehow 
the code needs
to be aware of the cut-off point and when the “stale domain ids” are removed.)

DBAs do not like to remove bin logs “early" as unless you keep a copy somewhere 
you may lose valuable information,
for recovery, for backups etc. Not everyone will be making automatic copies (as 
MySQL does not provide an automatic way to do this)
so in theory you have just one copy. Throwing these away is a really bad idea 
if it’s part of the solution of forgetting about “some of the past”.

Please consider the operational point of view and make MariaDB aware of the 
past and aware that it can ignore/forget these domain
ids. Obviously once all appropriate bin logs have been purged (naturally by 
other means) then no special processing will be needed.

The other comment I see mentioned here was “make sure all slaves are up to 
date”. That’s going to be hard. The master can only be
aware of “connected slaves” and if you have intermediate masters, or a stopped 
slave then it won’t be aware of these servers. That may
be obvious but there’s always the situation that “stopped slaves” or 
“downstream slaves (of an intermediate master)” are still lagging.
Of course catching and checking that is going to be hard to please make the 
comments explicit if really all you are going to do is to 
check “connected slaves” as MariaDB is never going to be aware of servers not 
connected directly to the master. If the required
pre-conditions to trigger the “obsolete old domains” is that a DBA needs to be 
“aware” then make this requirement clear so that
people reading the documentation understand what’s needed and what MariaDB 
expects to see etc.

FWIW expiring old domains is good to do.  There’s a similar FR for Oracle’s 
MySQL and while the GTID implementations are
completely different the problem space is the same. Coming up with a solution 
which is simple to use and understand and also
avoids where that’s possible making mistakes which may break replication is 
good.  So thanks for looking at this.

Just a thought.

Simon



___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp