Re: [DISCUSSION] Maintenance Mode feature

2020-09-29 Thread Nikolay Izhikov
+1 to make new master key name explicit parameter.

> 29 сент. 2020 г., в 16:35, Sergey Chugunov  
> написал(а):
> 
> Hello Nikolay,
> 
>> AFAIKU There is third use-case for this mode.
> 
> Sorry for the late reply.
> 
> I took a look at the code and maintenance mode indeed looks a good match
> for changing master key situation.
> 
> I want to clarify only one thing. In current implementation we pass new
> master key name via system property. Do you think of getting rid of this
> property and passing new master key name to encryption manager with
> maintenance parameters? In terms of original IEP it is parameters passed
> with MaintenanceRecord.
> 
> --
> Thanks!
> 
> On Mon, Sep 21, 2020 at 3:20 PM Nikolay Izhikov  wrote:
> 
>> Hello, Sergey.
>> 
>>> At the moment I'm aware about two use cases for this feature: corrupted
>> PDS cleanup and defragmentation.
>> 
>> AFAIKU There is third use-case for this mode.
>> 
>> Change encryption master key in case node was down during cluster master
>> key change.
>> In this case, node can’t join to the cluster, because it’s master key
>> differs from the cluster.
>> To recover node Ignite should locally change master key before join.
>> 
>> Please, take a look into source code [1]
>> 
>> [1]
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
>> 
>>> 21 сент. 2020 г., в 14:37, Sergey Chugunov 
>> написал(а):
>>> 
>>> Ivan,
>>> 
>>> Sorry for some confusion, MM indeed is not a normal mode. What I was
>> trying
>>> to say is that when in MM node still starts and allows the user to
>> perform
>>> actions with it like sending commands via control utility/JMX APIs or
>>> reading metrics.
>>> 
>>> This is the key point: although the node is not in the cluster but it is
>>> still alive can be monitored and supports management to do maintenance.
>>> 
>>> From  the code complexity perspective I'm trying to design the feature in
>>> such a way that all maintenance code is as encapsulated as possible and
>>> avoids massive interventions into main workflows of components.
>>> At the moment I'm aware about two use cases for this feature: corrupted
>> PDS
>>> cleanup and defragmentation. As far as I know it won't bring too much
>>> complexity in both cases.
>>> 
>>> I cannot say for other components but I believe it will be possible to
>>> integrate MM feature into their workflow as well with reasonable amount
>> of
>>> refactoring.
>>> 
>>> Does it make sense to you?
>>> 
>>> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin 
>> wrote:
>>> 
 Sergey,
 
 Thank you for your answer!
 
 Might be I am looking at the subject from a different angle.
 
> I think of a node in MM as an almost normal one
 I cannot think of such a mode as a normal one, because it apparently
 does not perform usual cluster node functions. It is not a part of a
 cluster, caches data is not available, Discovery and Communication are
 not needed.
 
 I fear that with "node started in a special mode" approach we will get
 an additional flag in the code making the code more complex and
 fragile. Should not I worry about it?
 
 2020-09-02 10:45 GMT+03:00, Sergey Chugunov >> :
> Vladislav, Ivan,
> 
> Thank you for your questions and suggestions. Let me answer them.
> 
> Vladislav,
> 
> If I understood you correctly, you're talking about a node performing
 some
> automatic actions to fix the problem and then join the cluster as
>> usual.
> 
> However the original ticket [1] where we faced the need for Maintenance
> Mode is about exactly the opposite: avoid doing automatic actions and
 give
> a user the ability to decide what to do.
> 
> Also the idea of Maintenance Mode is that the node is able to accept
> commands, expose metrics and so on, thus we need all components to be
> initialized (some of them may be partially initialized due to their own
> maintenance).
> To achieve that we need to go through a full cycle of node
>> initialization
> including discovery initialization. When discovery is initialized (in
> special isolated mode) I don't think it is easy to switch back to
>> normal
> operations without a restart.
> 
> Ivan,
> 
> I think of a node in MM as an almost normal one (maybe with some
 components
> skipped some steps of their initialization). Commands are accepted,
> appropriate metrics are exposed e.g. through JMX API and so on.
> 
> So as I see it we'll have special commands for control.{sh|bat} CLI
> allowing user to see reasons why node switched to maintenance mode
>> and/or
> trigger actions to fix the problem (I'm still thinking about proper
 design
> of these actions though).
> 
> Of course the user should also be able to fix the problem manually e.g.
 by
> manually deleting 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-29 Thread Sergey Chugunov
Hello Nikolay,

> AFAIKU There is third use-case for this mode.

Sorry for the late reply.

I took a look at the code and maintenance mode indeed looks a good match
for changing master key situation.

I want to clarify only one thing. In current implementation we pass new
master key name via system property. Do you think of getting rid of this
property and passing new master key name to encryption manager with
maintenance parameters? In terms of original IEP it is parameters passed
with MaintenanceRecord.

--
Thanks!

On Mon, Sep 21, 2020 at 3:20 PM Nikolay Izhikov  wrote:

> Hello, Sergey.
>
> > At the moment I'm aware about two use cases for this feature: corrupted
> PDS cleanup and defragmentation.
>
> AFAIKU There is third use-case for this mode.
>
> Change encryption master key in case node was down during cluster master
> key change.
> In this case, node can’t join to the cluster, because it’s master key
> differs from the cluster.
> To recover node Ignite should locally change master key before join.
>
> Please, take a look into source code [1]
>
> [1]
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
>
> > 21 сент. 2020 г., в 14:37, Sergey Chugunov 
> написал(а):
> >
> > Ivan,
> >
> > Sorry for some confusion, MM indeed is not a normal mode. What I was
> trying
> > to say is that when in MM node still starts and allows the user to
> perform
> > actions with it like sending commands via control utility/JMX APIs or
> > reading metrics.
> >
> > This is the key point: although the node is not in the cluster but it is
> > still alive can be monitored and supports management to do maintenance.
> >
> > From  the code complexity perspective I'm trying to design the feature in
> > such a way that all maintenance code is as encapsulated as possible and
> > avoids massive interventions into main workflows of components.
> > At the moment I'm aware about two use cases for this feature: corrupted
> PDS
> > cleanup and defragmentation. As far as I know it won't bring too much
> > complexity in both cases.
> >
> > I cannot say for other components but I believe it will be possible to
> > integrate MM feature into their workflow as well with reasonable amount
> of
> > refactoring.
> >
> > Does it make sense to you?
> >
> > On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin 
> wrote:
> >
> >> Sergey,
> >>
> >> Thank you for your answer!
> >>
> >> Might be I am looking at the subject from a different angle.
> >>
> >>> I think of a node in MM as an almost normal one
> >> I cannot think of such a mode as a normal one, because it apparently
> >> does not perform usual cluster node functions. It is not a part of a
> >> cluster, caches data is not available, Discovery and Communication are
> >> not needed.
> >>
> >> I fear that with "node started in a special mode" approach we will get
> >> an additional flag in the code making the code more complex and
> >> fragile. Should not I worry about it?
> >>
> >> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov  >:
> >>> Vladislav, Ivan,
> >>>
> >>> Thank you for your questions and suggestions. Let me answer them.
> >>>
> >>> Vladislav,
> >>>
> >>> If I understood you correctly, you're talking about a node performing
> >> some
> >>> automatic actions to fix the problem and then join the cluster as
> usual.
> >>>
> >>> However the original ticket [1] where we faced the need for Maintenance
> >>> Mode is about exactly the opposite: avoid doing automatic actions and
> >> give
> >>> a user the ability to decide what to do.
> >>>
> >>> Also the idea of Maintenance Mode is that the node is able to accept
> >>> commands, expose metrics and so on, thus we need all components to be
> >>> initialized (some of them may be partially initialized due to their own
> >>> maintenance).
> >>> To achieve that we need to go through a full cycle of node
> initialization
> >>> including discovery initialization. When discovery is initialized (in
> >>> special isolated mode) I don't think it is easy to switch back to
> normal
> >>> operations without a restart.
> >>>
> >>> Ivan,
> >>>
> >>> I think of a node in MM as an almost normal one (maybe with some
> >> components
> >>> skipped some steps of their initialization). Commands are accepted,
> >>> appropriate metrics are exposed e.g. through JMX API and so on.
> >>>
> >>> So as I see it we'll have special commands for control.{sh|bat} CLI
> >>> allowing user to see reasons why node switched to maintenance mode
> and/or
> >>> trigger actions to fix the problem (I'm still thinking about proper
> >> design
> >>> of these actions though).
> >>>
> >>> Of course the user should also be able to fix the problem manually e.g.
> >> by
> >>> manually deleting corrupted PDS files when node is down. Ideally
> >>> Maintenance Mode should be smart enough to figure that out and switch
> to
> >>> normal operations without a restart but I'm not sure if it is possible
> >>> 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-23 Thread Sergey Chugunov
Ivan,

If you come up with any ideas that may make this feature better, don't
hesitate to share them!

Thank you!

On Tue, Sep 22, 2020 at 11:27 AM Ivan Pavlukhin  wrote:

> Sergey,
>
> Thank you for your answer. While I am not happy with the proposed
> approach but things never were easy. Unfortunately I cannot suggest
> 100% better approaches so far. So, I should trust your vision.
>
> 2020-09-22 10:29 GMT+03:00, Sergey Chugunov :
> > Ivan,
> >
> > Checkpointer in Maintenance Mode is started and allows normal operations
> as
> > it may be needed for defragmentation and possibly other cases.
> >
> > Discovery is started with a special implementation of SPI that doesn't
> make
> > attempts to seek and/or connect to the rest of the cluster. From that
> > perspective node in MM is totally isolated.
> >
> > Communication is started as usual but I believe it doesn't matter as
> > discovery no other nodes are observed in topology and connection attempt
> > should not happen. But it may make sense to implement isolated version of
> > communication SPI as well to have 100% guarantee that no communication
> with
> > other nodes will happen.
> >
> > It is important to note that GridRestProcessor is started normally as we
> > need it to connect to the node via control utility.
> >
> > On Mon, Sep 21, 2020 at 7:04 PM Ivan Pavlukhin 
> wrote:
> >
> >> Sergey,
> >>
> >> > From  the code complexity perspective I'm trying to design the feature
> >> in such a way that all maintenance code is as encapsulated as possible
> >> and
> >> avoids massive interventions into main workflows of components.
> >>
> >> Could please briefly tell what means do you use to achieve
> >> encapsulation? Are Discovery, Communication, Checkpointer and other
> >> components started in a maintenance mode in current design?
> >>
> >> 2020-09-21 15:19 GMT+03:00, Nikolay Izhikov :
> >> > Hello, Sergey.
> >> >
> >> >> At the moment I'm aware about two use cases for this feature:
> >> >> corrupted
> >> >> PDS cleanup and defragmentation.
> >> >
> >> > AFAIKU There is third use-case for this mode.
> >> >
> >> > Change encryption master key in case node was down during cluster
> >> > master
> >> key
> >> > change.
> >> > In this case, node can’t join to the cluster, because it’s master key
> >> > differs from the cluster.
> >> > To recover node Ignite should locally change master key before join.
> >> >
> >> > Please, take a look into source code [1]
> >> >
> >> > [1]
> >> >
> >>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
> >> >
> >> >> 21 сент. 2020 г., в 14:37, Sergey Chugunov <
> sergey.chugu...@gmail.com>
> >> >> написал(а):
> >> >>
> >> >> Ivan,
> >> >>
> >> >> Sorry for some confusion, MM indeed is not a normal mode. What I was
> >> >> trying
> >> >> to say is that when in MM node still starts and allows the user to
> >> >> perform
> >> >> actions with it like sending commands via control utility/JMX APIs or
> >> >> reading metrics.
> >> >>
> >> >> This is the key point: although the node is not in the cluster but it
> >> >> is
> >> >> still alive can be monitored and supports management to do
> >> >> maintenance.
> >> >>
> >> >> From  the code complexity perspective I'm trying to design the
> feature
> >> in
> >> >> such a way that all maintenance code is as encapsulated as possible
> >> >> and
> >> >> avoids massive interventions into main workflows of components.
> >> >> At the moment I'm aware about two use cases for this feature:
> >> >> corrupted
> >> >> PDS
> >> >> cleanup and defragmentation. As far as I know it won't bring too much
> >> >> complexity in both cases.
> >> >>
> >> >> I cannot say for other components but I believe it will be possible
> to
> >> >> integrate MM feature into their workflow as well with reasonable
> >> >> amount
> >> >> of
> >> >> refactoring.
> >> >>
> >> >> Does it make sense to you?
> >> >>
> >> >> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin 
> >> >> wrote:
> >> >>
> >> >>> Sergey,
> >> >>>
> >> >>> Thank you for your answer!
> >> >>>
> >> >>> Might be I am looking at the subject from a different angle.
> >> >>>
> >>  I think of a node in MM as an almost normal one
> >> >>> I cannot think of such a mode as a normal one, because it apparently
> >> >>> does not perform usual cluster node functions. It is not a part of a
> >> >>> cluster, caches data is not available, Discovery and Communication
> >> >>> are
> >> >>> not needed.
> >> >>>
> >> >>> I fear that with "node started in a special mode" approach we will
> >> >>> get
> >> >>> an additional flag in the code making the code more complex and
> >> >>> fragile. Should not I worry about it?
> >> >>>
> >> >>> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov
> >> >>>  >> >:
> >>  Vladislav, Ivan,
> >> 
> >>  Thank you for your questions and suggestions. Let me answer them.
> >> 
> >>  Vladislav,
> >> 
> >>  If I understood 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-22 Thread Ivan Pavlukhin
Sergey,

Thank you for your answer. While I am not happy with the proposed
approach but things never were easy. Unfortunately I cannot suggest
100% better approaches so far. So, I should trust your vision.

2020-09-22 10:29 GMT+03:00, Sergey Chugunov :
> Ivan,
>
> Checkpointer in Maintenance Mode is started and allows normal operations as
> it may be needed for defragmentation and possibly other cases.
>
> Discovery is started with a special implementation of SPI that doesn't make
> attempts to seek and/or connect to the rest of the cluster. From that
> perspective node in MM is totally isolated.
>
> Communication is started as usual but I believe it doesn't matter as
> discovery no other nodes are observed in topology and connection attempt
> should not happen. But it may make sense to implement isolated version of
> communication SPI as well to have 100% guarantee that no communication with
> other nodes will happen.
>
> It is important to note that GridRestProcessor is started normally as we
> need it to connect to the node via control utility.
>
> On Mon, Sep 21, 2020 at 7:04 PM Ivan Pavlukhin  wrote:
>
>> Sergey,
>>
>> > From  the code complexity perspective I'm trying to design the feature
>> in such a way that all maintenance code is as encapsulated as possible
>> and
>> avoids massive interventions into main workflows of components.
>>
>> Could please briefly tell what means do you use to achieve
>> encapsulation? Are Discovery, Communication, Checkpointer and other
>> components started in a maintenance mode in current design?
>>
>> 2020-09-21 15:19 GMT+03:00, Nikolay Izhikov :
>> > Hello, Sergey.
>> >
>> >> At the moment I'm aware about two use cases for this feature:
>> >> corrupted
>> >> PDS cleanup and defragmentation.
>> >
>> > AFAIKU There is third use-case for this mode.
>> >
>> > Change encryption master key in case node was down during cluster
>> > master
>> key
>> > change.
>> > In this case, node can’t join to the cluster, because it’s master key
>> > differs from the cluster.
>> > To recover node Ignite should locally change master key before join.
>> >
>> > Please, take a look into source code [1]
>> >
>> > [1]
>> >
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
>> >
>> >> 21 сент. 2020 г., в 14:37, Sergey Chugunov 
>> >> написал(а):
>> >>
>> >> Ivan,
>> >>
>> >> Sorry for some confusion, MM indeed is not a normal mode. What I was
>> >> trying
>> >> to say is that when in MM node still starts and allows the user to
>> >> perform
>> >> actions with it like sending commands via control utility/JMX APIs or
>> >> reading metrics.
>> >>
>> >> This is the key point: although the node is not in the cluster but it
>> >> is
>> >> still alive can be monitored and supports management to do
>> >> maintenance.
>> >>
>> >> From  the code complexity perspective I'm trying to design the feature
>> in
>> >> such a way that all maintenance code is as encapsulated as possible
>> >> and
>> >> avoids massive interventions into main workflows of components.
>> >> At the moment I'm aware about two use cases for this feature:
>> >> corrupted
>> >> PDS
>> >> cleanup and defragmentation. As far as I know it won't bring too much
>> >> complexity in both cases.
>> >>
>> >> I cannot say for other components but I believe it will be possible to
>> >> integrate MM feature into their workflow as well with reasonable
>> >> amount
>> >> of
>> >> refactoring.
>> >>
>> >> Does it make sense to you?
>> >>
>> >> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin 
>> >> wrote:
>> >>
>> >>> Sergey,
>> >>>
>> >>> Thank you for your answer!
>> >>>
>> >>> Might be I am looking at the subject from a different angle.
>> >>>
>>  I think of a node in MM as an almost normal one
>> >>> I cannot think of such a mode as a normal one, because it apparently
>> >>> does not perform usual cluster node functions. It is not a part of a
>> >>> cluster, caches data is not available, Discovery and Communication
>> >>> are
>> >>> not needed.
>> >>>
>> >>> I fear that with "node started in a special mode" approach we will
>> >>> get
>> >>> an additional flag in the code making the code more complex and
>> >>> fragile. Should not I worry about it?
>> >>>
>> >>> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov
>> >>> > >:
>>  Vladislav, Ivan,
>> 
>>  Thank you for your questions and suggestions. Let me answer them.
>> 
>>  Vladislav,
>> 
>>  If I understood you correctly, you're talking about a node
>>  performing
>> >>> some
>>  automatic actions to fix the problem and then join the cluster as
>>  usual.
>> 
>>  However the original ticket [1] where we faced the need for
>> Maintenance
>>  Mode is about exactly the opposite: avoid doing automatic actions
>>  and
>> >>> give
>>  a user the ability to decide what to do.
>> 
>>  Also the idea of Maintenance Mode is that the node is 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-22 Thread Sergey Chugunov
Ivan,

Checkpointer in Maintenance Mode is started and allows normal operations as
it may be needed for defragmentation and possibly other cases.

Discovery is started with a special implementation of SPI that doesn't make
attempts to seek and/or connect to the rest of the cluster. From that
perspective node in MM is totally isolated.

Communication is started as usual but I believe it doesn't matter as
discovery no other nodes are observed in topology and connection attempt
should not happen. But it may make sense to implement isolated version of
communication SPI as well to have 100% guarantee that no communication with
other nodes will happen.

It is important to note that GridRestProcessor is started normally as we
need it to connect to the node via control utility.

On Mon, Sep 21, 2020 at 7:04 PM Ivan Pavlukhin  wrote:

> Sergey,
>
> > From  the code complexity perspective I'm trying to design the feature
> in such a way that all maintenance code is as encapsulated as possible and
> avoids massive interventions into main workflows of components.
>
> Could please briefly tell what means do you use to achieve
> encapsulation? Are Discovery, Communication, Checkpointer and other
> components started in a maintenance mode in current design?
>
> 2020-09-21 15:19 GMT+03:00, Nikolay Izhikov :
> > Hello, Sergey.
> >
> >> At the moment I'm aware about two use cases for this feature: corrupted
> >> PDS cleanup and defragmentation.
> >
> > AFAIKU There is third use-case for this mode.
> >
> > Change encryption master key in case node was down during cluster master
> key
> > change.
> > In this case, node can’t join to the cluster, because it’s master key
> > differs from the cluster.
> > To recover node Ignite should locally change master key before join.
> >
> > Please, take a look into source code [1]
> >
> > [1]
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
> >
> >> 21 сент. 2020 г., в 14:37, Sergey Chugunov 
> >> написал(а):
> >>
> >> Ivan,
> >>
> >> Sorry for some confusion, MM indeed is not a normal mode. What I was
> >> trying
> >> to say is that when in MM node still starts and allows the user to
> >> perform
> >> actions with it like sending commands via control utility/JMX APIs or
> >> reading metrics.
> >>
> >> This is the key point: although the node is not in the cluster but it is
> >> still alive can be monitored and supports management to do maintenance.
> >>
> >> From  the code complexity perspective I'm trying to design the feature
> in
> >> such a way that all maintenance code is as encapsulated as possible and
> >> avoids massive interventions into main workflows of components.
> >> At the moment I'm aware about two use cases for this feature: corrupted
> >> PDS
> >> cleanup and defragmentation. As far as I know it won't bring too much
> >> complexity in both cases.
> >>
> >> I cannot say for other components but I believe it will be possible to
> >> integrate MM feature into their workflow as well with reasonable amount
> >> of
> >> refactoring.
> >>
> >> Does it make sense to you?
> >>
> >> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin 
> >> wrote:
> >>
> >>> Sergey,
> >>>
> >>> Thank you for your answer!
> >>>
> >>> Might be I am looking at the subject from a different angle.
> >>>
>  I think of a node in MM as an almost normal one
> >>> I cannot think of such a mode as a normal one, because it apparently
> >>> does not perform usual cluster node functions. It is not a part of a
> >>> cluster, caches data is not available, Discovery and Communication are
> >>> not needed.
> >>>
> >>> I fear that with "node started in a special mode" approach we will get
> >>> an additional flag in the code making the code more complex and
> >>> fragile. Should not I worry about it?
> >>>
> >>> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov  >:
>  Vladislav, Ivan,
> 
>  Thank you for your questions and suggestions. Let me answer them.
> 
>  Vladislav,
> 
>  If I understood you correctly, you're talking about a node performing
> >>> some
>  automatic actions to fix the problem and then join the cluster as
>  usual.
> 
>  However the original ticket [1] where we faced the need for
> Maintenance
>  Mode is about exactly the opposite: avoid doing automatic actions and
> >>> give
>  a user the ability to decide what to do.
> 
>  Also the idea of Maintenance Mode is that the node is able to accept
>  commands, expose metrics and so on, thus we need all components to be
>  initialized (some of them may be partially initialized due to their
> own
>  maintenance).
>  To achieve that we need to go through a full cycle of node
>  initialization
>  including discovery initialization. When discovery is initialized (in
>  special isolated mode) I don't think it is easy to switch back to
>  normal
>  operations 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-21 Thread Ivan Pavlukhin
Sergey,

> From  the code complexity perspective I'm trying to design the feature in 
> such a way that all maintenance code is as encapsulated as possible and 
> avoids massive interventions into main workflows of components.

Could please briefly tell what means do you use to achieve
encapsulation? Are Discovery, Communication, Checkpointer and other
components started in a maintenance mode in current design?

2020-09-21 15:19 GMT+03:00, Nikolay Izhikov :
> Hello, Sergey.
>
>> At the moment I'm aware about two use cases for this feature: corrupted
>> PDS cleanup and defragmentation.
>
> AFAIKU There is third use-case for this mode.
>
> Change encryption master key in case node was down during cluster master key
> change.
> In this case, node can’t join to the cluster, because it’s master key
> differs from the cluster.
> To recover node Ignite should locally change master key before join.
>
> Please, take a look into source code [1]
>
> [1]
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
>
>> 21 сент. 2020 г., в 14:37, Sergey Chugunov 
>> написал(а):
>>
>> Ivan,
>>
>> Sorry for some confusion, MM indeed is not a normal mode. What I was
>> trying
>> to say is that when in MM node still starts and allows the user to
>> perform
>> actions with it like sending commands via control utility/JMX APIs or
>> reading metrics.
>>
>> This is the key point: although the node is not in the cluster but it is
>> still alive can be monitored and supports management to do maintenance.
>>
>> From  the code complexity perspective I'm trying to design the feature in
>> such a way that all maintenance code is as encapsulated as possible and
>> avoids massive interventions into main workflows of components.
>> At the moment I'm aware about two use cases for this feature: corrupted
>> PDS
>> cleanup and defragmentation. As far as I know it won't bring too much
>> complexity in both cases.
>>
>> I cannot say for other components but I believe it will be possible to
>> integrate MM feature into their workflow as well with reasonable amount
>> of
>> refactoring.
>>
>> Does it make sense to you?
>>
>> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin 
>> wrote:
>>
>>> Sergey,
>>>
>>> Thank you for your answer!
>>>
>>> Might be I am looking at the subject from a different angle.
>>>
 I think of a node in MM as an almost normal one
>>> I cannot think of such a mode as a normal one, because it apparently
>>> does not perform usual cluster node functions. It is not a part of a
>>> cluster, caches data is not available, Discovery and Communication are
>>> not needed.
>>>
>>> I fear that with "node started in a special mode" approach we will get
>>> an additional flag in the code making the code more complex and
>>> fragile. Should not I worry about it?
>>>
>>> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov :
 Vladislav, Ivan,

 Thank you for your questions and suggestions. Let me answer them.

 Vladislav,

 If I understood you correctly, you're talking about a node performing
>>> some
 automatic actions to fix the problem and then join the cluster as
 usual.

 However the original ticket [1] where we faced the need for Maintenance
 Mode is about exactly the opposite: avoid doing automatic actions and
>>> give
 a user the ability to decide what to do.

 Also the idea of Maintenance Mode is that the node is able to accept
 commands, expose metrics and so on, thus we need all components to be
 initialized (some of them may be partially initialized due to their own
 maintenance).
 To achieve that we need to go through a full cycle of node
 initialization
 including discovery initialization. When discovery is initialized (in
 special isolated mode) I don't think it is easy to switch back to
 normal
 operations without a restart.

 Ivan,

 I think of a node in MM as an almost normal one (maybe with some
>>> components
 skipped some steps of their initialization). Commands are accepted,
 appropriate metrics are exposed e.g. through JMX API and so on.

 So as I see it we'll have special commands for control.{sh|bat} CLI
 allowing user to see reasons why node switched to maintenance mode
 and/or
 trigger actions to fix the problem (I'm still thinking about proper
>>> design
 of these actions though).

 Of course the user should also be able to fix the problem manually e.g.
>>> by
 manually deleting corrupted PDS files when node is down. Ideally
 Maintenance Mode should be smart enough to figure that out and switch
 to
 normal operations without a restart but I'm not sure if it is possible
 without invasive changes of our components' lifecycle.
 So I believe this model (node truly started in Maintenance Mode and new
 commands in control.{sh|bat}) is a good fit for our 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-21 Thread Nikolay Izhikov
Hello, Sergey.

> At the moment I'm aware about two use cases for this feature: corrupted PDS 
> cleanup and defragmentation. 

AFAIKU There is third use-case for this mode.

Change encryption master key in case node was down during cluster master key 
change.
In this case, node can’t join to the cluster, because it’s master key differs 
from the cluster.
To recover node Ignite should locally change master key before join.

Please, take a look into source code [1]

[1] 
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710

> 21 сент. 2020 г., в 14:37, Sergey Chugunov  
> написал(а):
> 
> Ivan,
> 
> Sorry for some confusion, MM indeed is not a normal mode. What I was trying
> to say is that when in MM node still starts and allows the user to perform
> actions with it like sending commands via control utility/JMX APIs or
> reading metrics.
> 
> This is the key point: although the node is not in the cluster but it is
> still alive can be monitored and supports management to do maintenance.
> 
> From  the code complexity perspective I'm trying to design the feature in
> such a way that all maintenance code is as encapsulated as possible and
> avoids massive interventions into main workflows of components.
> At the moment I'm aware about two use cases for this feature: corrupted PDS
> cleanup and defragmentation. As far as I know it won't bring too much
> complexity in both cases.
> 
> I cannot say for other components but I believe it will be possible to
> integrate MM feature into their workflow as well with reasonable amount of
> refactoring.
> 
> Does it make sense to you?
> 
> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin  wrote:
> 
>> Sergey,
>> 
>> Thank you for your answer!
>> 
>> Might be I am looking at the subject from a different angle.
>> 
>>> I think of a node in MM as an almost normal one
>> I cannot think of such a mode as a normal one, because it apparently
>> does not perform usual cluster node functions. It is not a part of a
>> cluster, caches data is not available, Discovery and Communication are
>> not needed.
>> 
>> I fear that with "node started in a special mode" approach we will get
>> an additional flag in the code making the code more complex and
>> fragile. Should not I worry about it?
>> 
>> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov :
>>> Vladislav, Ivan,
>>> 
>>> Thank you for your questions and suggestions. Let me answer them.
>>> 
>>> Vladislav,
>>> 
>>> If I understood you correctly, you're talking about a node performing
>> some
>>> automatic actions to fix the problem and then join the cluster as usual.
>>> 
>>> However the original ticket [1] where we faced the need for Maintenance
>>> Mode is about exactly the opposite: avoid doing automatic actions and
>> give
>>> a user the ability to decide what to do.
>>> 
>>> Also the idea of Maintenance Mode is that the node is able to accept
>>> commands, expose metrics and so on, thus we need all components to be
>>> initialized (some of them may be partially initialized due to their own
>>> maintenance).
>>> To achieve that we need to go through a full cycle of node initialization
>>> including discovery initialization. When discovery is initialized (in
>>> special isolated mode) I don't think it is easy to switch back to normal
>>> operations without a restart.
>>> 
>>> Ivan,
>>> 
>>> I think of a node in MM as an almost normal one (maybe with some
>> components
>>> skipped some steps of their initialization). Commands are accepted,
>>> appropriate metrics are exposed e.g. through JMX API and so on.
>>> 
>>> So as I see it we'll have special commands for control.{sh|bat} CLI
>>> allowing user to see reasons why node switched to maintenance mode and/or
>>> trigger actions to fix the problem (I'm still thinking about proper
>> design
>>> of these actions though).
>>> 
>>> Of course the user should also be able to fix the problem manually e.g.
>> by
>>> manually deleting corrupted PDS files when node is down. Ideally
>>> Maintenance Mode should be smart enough to figure that out and switch to
>>> normal operations without a restart but I'm not sure if it is possible
>>> without invasive changes of our components' lifecycle.
>>> So I believe this model (node truly started in Maintenance Mode and new
>>> commands in control.{sh|bat}) is a good fit for our current APIs and ways
>>> to interact with the node.
>>> 
>>> Does it sound reasonable to you?
>>> 
>>> Thank you!
>>> 
>>> [1] https://issues.apache.org/jira/browse/IGNITE-13366
>>> 
>>> On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin 
>> wrote:
>>> 
 Sergey,
 
 Actually, I missed the point that the discussed mode affects a single
 node but not a whole cluster. Perhaps I mixed terms "mode" and
 "state".
 
 My next thoughts about maintenance routines are about special
 utilities. As far as I remember MySQL provides a bunch of scripts for
 various 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-21 Thread Sergey Chugunov
Ivan,

Sorry for some confusion, MM indeed is not a normal mode. What I was trying
to say is that when in MM node still starts and allows the user to perform
actions with it like sending commands via control utility/JMX APIs or
reading metrics.

This is the key point: although the node is not in the cluster but it is
still alive can be monitored and supports management to do maintenance.

>From  the code complexity perspective I'm trying to design the feature in
such a way that all maintenance code is as encapsulated as possible and
avoids massive interventions into main workflows of components.
At the moment I'm aware about two use cases for this feature: corrupted PDS
cleanup and defragmentation. As far as I know it won't bring too much
complexity in both cases.

I cannot say for other components but I believe it will be possible to
integrate MM feature into their workflow as well with reasonable amount of
refactoring.

Does it make sense to you?

On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin  wrote:

> Sergey,
>
> Thank you for your answer!
>
> Might be I am looking at the subject from a different angle.
>
> > I think of a node in MM as an almost normal one
> I cannot think of such a mode as a normal one, because it apparently
> does not perform usual cluster node functions. It is not a part of a
> cluster, caches data is not available, Discovery and Communication are
> not needed.
>
> I fear that with "node started in a special mode" approach we will get
> an additional flag in the code making the code more complex and
> fragile. Should not I worry about it?
>
> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov :
> > Vladislav, Ivan,
> >
> > Thank you for your questions and suggestions. Let me answer them.
> >
> > Vladislav,
> >
> > If I understood you correctly, you're talking about a node performing
> some
> > automatic actions to fix the problem and then join the cluster as usual.
> >
> > However the original ticket [1] where we faced the need for Maintenance
> > Mode is about exactly the opposite: avoid doing automatic actions and
> give
> > a user the ability to decide what to do.
> >
> > Also the idea of Maintenance Mode is that the node is able to accept
> > commands, expose metrics and so on, thus we need all components to be
> > initialized (some of them may be partially initialized due to their own
> > maintenance).
> > To achieve that we need to go through a full cycle of node initialization
> > including discovery initialization. When discovery is initialized (in
> > special isolated mode) I don't think it is easy to switch back to normal
> > operations without a restart.
> >
> > Ivan,
> >
> > I think of a node in MM as an almost normal one (maybe with some
> components
> > skipped some steps of their initialization). Commands are accepted,
> > appropriate metrics are exposed e.g. through JMX API and so on.
> >
> > So as I see it we'll have special commands for control.{sh|bat} CLI
> > allowing user to see reasons why node switched to maintenance mode and/or
> > trigger actions to fix the problem (I'm still thinking about proper
> design
> > of these actions though).
> >
> > Of course the user should also be able to fix the problem manually e.g.
> by
> > manually deleting corrupted PDS files when node is down. Ideally
> > Maintenance Mode should be smart enough to figure that out and switch to
> > normal operations without a restart but I'm not sure if it is possible
> > without invasive changes of our components' lifecycle.
> > So I believe this model (node truly started in Maintenance Mode and new
> > commands in control.{sh|bat}) is a good fit for our current APIs and ways
> > to interact with the node.
> >
> > Does it sound reasonable to you?
> >
> > Thank you!
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
> >
> > On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin 
> wrote:
> >
> >> Sergey,
> >>
> >> Actually, I missed the point that the discussed mode affects a single
> >> node but not a whole cluster. Perhaps I mixed terms "mode" and
> >> "state".
> >>
> >> My next thoughts about maintenance routines are about special
> >> utilities. As far as I remember MySQL provides a bunch of scripts for
> >> various maintenance purposes. What user interface for maintenance
> >> tasks execution is assumed? And what do we mean by "starting" a node
> >> in a maintenance mode? Can we do some routines without "starting"
> >> (e.g. try to recover PDS or cleanup)?
> >>
> >> 2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov :
> >> > Hi Sergey.
> >> >
> >> > As I understand any switching from/to MM possible only through manual
> >> > restart a node.
> >> > But in your example that look like a technical actions, that only
> >> possible
> >> > in the case.
> >> > Do you plan to provide a possibility for client where he can make a
> >> > decision without a manual intervention?
> >> >
> >> > For example: Start node and manually agree with an option and after
> >> > automatically resolve conflict and back 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-05 Thread Ivan Pavlukhin
Sergey,

Thank you for your answer!

Might be I am looking at the subject from a different angle.

> I think of a node in MM as an almost normal one
I cannot think of such a mode as a normal one, because it apparently
does not perform usual cluster node functions. It is not a part of a
cluster, caches data is not available, Discovery and Communication are
not needed.

I fear that with "node started in a special mode" approach we will get
an additional flag in the code making the code more complex and
fragile. Should not I worry about it?

2020-09-02 10:45 GMT+03:00, Sergey Chugunov :
> Vladislav, Ivan,
>
> Thank you for your questions and suggestions. Let me answer them.
>
> Vladislav,
>
> If I understood you correctly, you're talking about a node performing some
> automatic actions to fix the problem and then join the cluster as usual.
>
> However the original ticket [1] where we faced the need for Maintenance
> Mode is about exactly the opposite: avoid doing automatic actions and give
> a user the ability to decide what to do.
>
> Also the idea of Maintenance Mode is that the node is able to accept
> commands, expose metrics and so on, thus we need all components to be
> initialized (some of them may be partially initialized due to their own
> maintenance).
> To achieve that we need to go through a full cycle of node initialization
> including discovery initialization. When discovery is initialized (in
> special isolated mode) I don't think it is easy to switch back to normal
> operations without a restart.
>
> Ivan,
>
> I think of a node in MM as an almost normal one (maybe with some components
> skipped some steps of their initialization). Commands are accepted,
> appropriate metrics are exposed e.g. through JMX API and so on.
>
> So as I see it we'll have special commands for control.{sh|bat} CLI
> allowing user to see reasons why node switched to maintenance mode and/or
> trigger actions to fix the problem (I'm still thinking about proper design
> of these actions though).
>
> Of course the user should also be able to fix the problem manually e.g. by
> manually deleting corrupted PDS files when node is down. Ideally
> Maintenance Mode should be smart enough to figure that out and switch to
> normal operations without a restart but I'm not sure if it is possible
> without invasive changes of our components' lifecycle.
> So I believe this model (node truly started in Maintenance Mode and new
> commands in control.{sh|bat}) is a good fit for our current APIs and ways
> to interact with the node.
>
> Does it sound reasonable to you?
>
> Thank you!
>
> [1] https://issues.apache.org/jira/browse/IGNITE-13366
>
> On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin  wrote:
>
>> Sergey,
>>
>> Actually, I missed the point that the discussed mode affects a single
>> node but not a whole cluster. Perhaps I mixed terms "mode" and
>> "state".
>>
>> My next thoughts about maintenance routines are about special
>> utilities. As far as I remember MySQL provides a bunch of scripts for
>> various maintenance purposes. What user interface for maintenance
>> tasks execution is assumed? And what do we mean by "starting" a node
>> in a maintenance mode? Can we do some routines without "starting"
>> (e.g. try to recover PDS or cleanup)?
>>
>> 2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov :
>> > Hi Sergey.
>> >
>> > As I understand any switching from/to MM possible only through manual
>> > restart a node.
>> > But in your example that look like a technical actions, that only
>> possible
>> > in the case.
>> > Do you plan to provide a possibility for client where he can make a
>> > decision without a manual intervention?
>> >
>> > For example: Start node and manually agree with an option and after
>> > automatically resolve conflict and back to topology as a stable node.
>> >
>> > On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov <
>> sergey.chugu...@gmail.com>
>> > wrote:
>> >
>> >> Hello Ivan,
>> >>
>> >> Thank you for raising the good question, I didn't think of Maintenance
>> >> Mode
>> >> from that perspective.
>> >>
>> >> In short, Maintenance Mode isn't related to Cluster States concept.
>> >> According to javadoc documentation of ClusterState enum [1] it is
>> >> solely
>> >> about cache operations and to some extent doesn't affect other
>> components
>> >> of Ignite node.
>> >> From APIs perspective putting the methods to manage Cluster State to
>> >> IgniteCluster interface doesn't look ideal to me but it is as it is.
>> >>
>> >> On the other hand Maintenance Mode as I see it will be managed through
>> >> different APIs than a ClusterState and this difference definitely will
>> be
>> >> reflected in the documentation of the feature.
>> >>
>> >> Ignite node is a complex piece of many components interacting with
>> >> each
>> >> other, they may have different lifecycles and states; states of
>> different
>> >> components cannot be reduced to the lowest common denominator.
>> >>
>> >> However if you have an idea of how to call 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-02 Thread Sergey Chugunov
Vladislav, Ivan,

Thank you for your questions and suggestions. Let me answer them.

Vladislav,

If I understood you correctly, you're talking about a node performing some
automatic actions to fix the problem and then join the cluster as usual.

However the original ticket [1] where we faced the need for Maintenance
Mode is about exactly the opposite: avoid doing automatic actions and give
a user the ability to decide what to do.

Also the idea of Maintenance Mode is that the node is able to accept
commands, expose metrics and so on, thus we need all components to be
initialized (some of them may be partially initialized due to their own
maintenance).
To achieve that we need to go through a full cycle of node initialization
including discovery initialization. When discovery is initialized (in
special isolated mode) I don't think it is easy to switch back to normal
operations without a restart.

Ivan,

I think of a node in MM as an almost normal one (maybe with some components
skipped some steps of their initialization). Commands are accepted,
appropriate metrics are exposed e.g. through JMX API and so on.

So as I see it we'll have special commands for control.{sh|bat} CLI
allowing user to see reasons why node switched to maintenance mode and/or
trigger actions to fix the problem (I'm still thinking about proper design
of these actions though).

Of course the user should also be able to fix the problem manually e.g. by
manually deleting corrupted PDS files when node is down. Ideally
Maintenance Mode should be smart enough to figure that out and switch to
normal operations without a restart but I'm not sure if it is possible
without invasive changes of our components' lifecycle.
So I believe this model (node truly started in Maintenance Mode and new
commands in control.{sh|bat}) is a good fit for our current APIs and ways
to interact with the node.

Does it sound reasonable to you?

Thank you!

[1] https://issues.apache.org/jira/browse/IGNITE-13366

On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin  wrote:

> Sergey,
>
> Actually, I missed the point that the discussed mode affects a single
> node but not a whole cluster. Perhaps I mixed terms "mode" and
> "state".
>
> My next thoughts about maintenance routines are about special
> utilities. As far as I remember MySQL provides a bunch of scripts for
> various maintenance purposes. What user interface for maintenance
> tasks execution is assumed? And what do we mean by "starting" a node
> in a maintenance mode? Can we do some routines without "starting"
> (e.g. try to recover PDS or cleanup)?
>
> 2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov :
> > Hi Sergey.
> >
> > As I understand any switching from/to MM possible only through manual
> > restart a node.
> > But in your example that look like a technical actions, that only
> possible
> > in the case.
> > Do you plan to provide a possibility for client where he can make a
> > decision without a manual intervention?
> >
> > For example: Start node and manually agree with an option and after
> > automatically resolve conflict and back to topology as a stable node.
> >
> > On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov <
> sergey.chugu...@gmail.com>
> > wrote:
> >
> >> Hello Ivan,
> >>
> >> Thank you for raising the good question, I didn't think of Maintenance
> >> Mode
> >> from that perspective.
> >>
> >> In short, Maintenance Mode isn't related to Cluster States concept.
> >> According to javadoc documentation of ClusterState enum [1] it is solely
> >> about cache operations and to some extent doesn't affect other
> components
> >> of Ignite node.
> >> From APIs perspective putting the methods to manage Cluster State to
> >> IgniteCluster interface doesn't look ideal to me but it is as it is.
> >>
> >> On the other hand Maintenance Mode as I see it will be managed through
> >> different APIs than a ClusterState and this difference definitely will
> be
> >> reflected in the documentation of the feature.
> >>
> >> Ignite node is a complex piece of many components interacting with each
> >> other, they may have different lifecycles and states; states of
> different
> >> components cannot be reduced to the lowest common denominator.
> >>
> >> However if you have an idea of how to call the feature better to let the
> >> user easier distinguish it from other similar features please share it
> >> with
> >> us. Personally I'm very welcome to any suggestions that make design more
> >> intuitive and easy-to-use.
> >>
> >> Thanks!
> >>
> >> [1]
> >>
> >>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java
> >>
> >> On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin 
> >> wrote:
> >>
> >> > Hi Sergey,
> >> >
> >> > Thank you for bringing attention to that important subject!
> >> >
> >> > My note here is about one more cluster mode. As far as I know
> >> > currently we already have 3 modes (inactive, read-only, read-write)
> >> > and the subject is about one 

Re: [DISCUSSION] Maintenance Mode feature

2020-09-01 Thread Ivan Pavlukhin
Sergey,

Actually, I missed the point that the discussed mode affects a single
node but not a whole cluster. Perhaps I mixed terms "mode" and
"state".

My next thoughts about maintenance routines are about special
utilities. As far as I remember MySQL provides a bunch of scripts for
various maintenance purposes. What user interface for maintenance
tasks execution is assumed? And what do we mean by "starting" a node
in a maintenance mode? Can we do some routines without "starting"
(e.g. try to recover PDS or cleanup)?

2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov :
> Hi Sergey.
>
> As I understand any switching from/to MM possible only through manual
> restart a node.
> But in your example that look like a technical actions, that only possible
> in the case.
> Do you plan to provide a possibility for client where he can make a
> decision without a manual intervention?
>
> For example: Start node and manually agree with an option and after
> automatically resolve conflict and back to topology as a stable node.
>
> On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov 
> wrote:
>
>> Hello Ivan,
>>
>> Thank you for raising the good question, I didn't think of Maintenance
>> Mode
>> from that perspective.
>>
>> In short, Maintenance Mode isn't related to Cluster States concept.
>> According to javadoc documentation of ClusterState enum [1] it is solely
>> about cache operations and to some extent doesn't affect other components
>> of Ignite node.
>> From APIs perspective putting the methods to manage Cluster State to
>> IgniteCluster interface doesn't look ideal to me but it is as it is.
>>
>> On the other hand Maintenance Mode as I see it will be managed through
>> different APIs than a ClusterState and this difference definitely will be
>> reflected in the documentation of the feature.
>>
>> Ignite node is a complex piece of many components interacting with each
>> other, they may have different lifecycles and states; states of different
>> components cannot be reduced to the lowest common denominator.
>>
>> However if you have an idea of how to call the feature better to let the
>> user easier distinguish it from other similar features please share it
>> with
>> us. Personally I'm very welcome to any suggestions that make design more
>> intuitive and easy-to-use.
>>
>> Thanks!
>>
>> [1]
>>
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java
>>
>> On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin 
>> wrote:
>>
>> > Hi Sergey,
>> >
>> > Thank you for bringing attention to that important subject!
>> >
>> > My note here is about one more cluster mode. As far as I know
>> > currently we already have 3 modes (inactive, read-only, read-write)
>> > and the subject is about one more. From the first glance it could be
>> > hard for a user to understand and use all modes properly. Do we really
>> > need all spectrum? Could we simplify things somehow?
>> >
>> > 2020-08-27 15:59 GMT+03:00, Sergey Chugunov
>> > :
>> > > Hello Nikolay,
>> > >
>> > > Created one, available by link [1]
>> > >
>> > > Initially there was an intention to develop it under IEP-47 [2] and
>> there
>> > > is even a separate section for Maintenance Mode there.
>> > > But it looks like this feature is useful in more cases and deserves
>> > > its
>> > own
>> > > IEP.
>> > >
>> > > [1]
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
>> > > [2]
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>> > >
>> > > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov
>> > > 
>> > > wrote:
>> > >
>> > >> Hello, Sergey!
>> > >>
>> > >> Thanks for the proposal.
>> > >> Let’s have IEP for this feature.
>> > >>
>> > >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov <
>> sergey.chugu...@gmail.com>
>> > >> написал(а):
>> > >> >
>> > >> > Hello Igniters,
>> > >> >
>> > >> > I want to start a discussion about new supporting feature that
>> > >> > could
>> > be
>> > >> > very useful in many scenarios where persistent storage is
>> > >> > involved:
>> > >> > Maintenance Mode.
>> > >> >
>> > >> > *Summary*
>> > >> > Maintenance Mode (MM for short) is a special state of Ignite node
>> when
>> > >> node
>> > >> > doesn't serve user requests nor joins the cluster but waits for
>> > >> > user
>> > >> > commands or performs automatic actions for maintenance purposes.
>> > >> >
>> > >> > *Motivation*
>> > >> > There are situations when node cannot participate in regular
>> > operations
>> > >> but
>> > >> > at the same time should not be shut down.
>> > >> >
>> > >> > One example is a ticket [1] where I developed the first draft of
>> > >> > Maintenance Mode.
>> > >> > Here we get into a situation when node has potentially corrupted
>> > >> > PDS
>> > >> > thus
>> > >> > cannot proceed with restore routine and join the cluster as usual.
>> > >> > At the same time node should not fail nor be stopped for manual
>> > >> > 

Re: [DISCUSSION] Maintenance Mode feature

2020-08-31 Thread Vladislav Pyatkov
Hi Sergey.

As I understand any switching from/to MM possible only through manual
restart a node.
But in your example that look like a technical actions, that only possible
in the case.
Do you plan to provide a possibility for client where he can make a
decision without a manual intervention?

For example: Start node and manually agree with an option and after
automatically resolve conflict and back to topology as a stable node.

On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov 
wrote:

> Hello Ivan,
>
> Thank you for raising the good question, I didn't think of Maintenance Mode
> from that perspective.
>
> In short, Maintenance Mode isn't related to Cluster States concept.
> According to javadoc documentation of ClusterState enum [1] it is solely
> about cache operations and to some extent doesn't affect other components
> of Ignite node.
> From APIs perspective putting the methods to manage Cluster State to
> IgniteCluster interface doesn't look ideal to me but it is as it is.
>
> On the other hand Maintenance Mode as I see it will be managed through
> different APIs than a ClusterState and this difference definitely will be
> reflected in the documentation of the feature.
>
> Ignite node is a complex piece of many components interacting with each
> other, they may have different lifecycles and states; states of different
> components cannot be reduced to the lowest common denominator.
>
> However if you have an idea of how to call the feature better to let the
> user easier distinguish it from other similar features please share it with
> us. Personally I'm very welcome to any suggestions that make design more
> intuitive and easy-to-use.
>
> Thanks!
>
> [1]
>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java
>
> On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin 
> wrote:
>
> > Hi Sergey,
> >
> > Thank you for bringing attention to that important subject!
> >
> > My note here is about one more cluster mode. As far as I know
> > currently we already have 3 modes (inactive, read-only, read-write)
> > and the subject is about one more. From the first glance it could be
> > hard for a user to understand and use all modes properly. Do we really
> > need all spectrum? Could we simplify things somehow?
> >
> > 2020-08-27 15:59 GMT+03:00, Sergey Chugunov :
> > > Hello Nikolay,
> > >
> > > Created one, available by link [1]
> > >
> > > Initially there was an intention to develop it under IEP-47 [2] and
> there
> > > is even a separate section for Maintenance Mode there.
> > > But it looks like this feature is useful in more cases and deserves its
> > own
> > > IEP.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> > >
> > > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov 
> > > wrote:
> > >
> > >> Hello, Sergey!
> > >>
> > >> Thanks for the proposal.
> > >> Let’s have IEP for this feature.
> > >>
> > >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov <
> sergey.chugu...@gmail.com>
> > >> написал(а):
> > >> >
> > >> > Hello Igniters,
> > >> >
> > >> > I want to start a discussion about new supporting feature that could
> > be
> > >> > very useful in many scenarios where persistent storage is involved:
> > >> > Maintenance Mode.
> > >> >
> > >> > *Summary*
> > >> > Maintenance Mode (MM for short) is a special state of Ignite node
> when
> > >> node
> > >> > doesn't serve user requests nor joins the cluster but waits for user
> > >> > commands or performs automatic actions for maintenance purposes.
> > >> >
> > >> > *Motivation*
> > >> > There are situations when node cannot participate in regular
> > operations
> > >> but
> > >> > at the same time should not be shut down.
> > >> >
> > >> > One example is a ticket [1] where I developed the first draft of
> > >> > Maintenance Mode.
> > >> > Here we get into a situation when node has potentially corrupted PDS
> > >> > thus
> > >> > cannot proceed with restore routine and join the cluster as usual.
> > >> > At the same time node should not fail nor be stopped for manual
> > >> > cleanup.
> > >> > Manual cleanup is not always an option (e.g. restricted access to
> file
> > >> > system); in managed environments failed node will be restarted
> > >> > automatically so user won't have time for performing necessary
> > >> operations.
> > >> > Thus node needs to function in a special mode allowing user to
> connect
> > >> > to
> > >> > it and perform necessary actions.
> > >> >
> > >> > Another example is described in IEP-47 [2] where defragmentation is
> > >> > being
> > >> > developed. Node defragmenting its PDS should not join the cluster
> > until
> > >> the
> > >> > process is finished so it needs to enter Maintenance Mode as well.
> > >> >
> > >> > *Suggested design*
> > >> > I suggest MM to work as follows:
> > >> > 1. Node enters MM 

Re: [DISCUSSION] Maintenance Mode feature

2020-08-31 Thread Sergey Chugunov
Hello Ivan,

Thank you for raising the good question, I didn't think of Maintenance Mode
from that perspective.

In short, Maintenance Mode isn't related to Cluster States concept.
According to javadoc documentation of ClusterState enum [1] it is solely
about cache operations and to some extent doesn't affect other components
of Ignite node.
>From APIs perspective putting the methods to manage Cluster State to
IgniteCluster interface doesn't look ideal to me but it is as it is.

On the other hand Maintenance Mode as I see it will be managed through
different APIs than a ClusterState and this difference definitely will be
reflected in the documentation of the feature.

Ignite node is a complex piece of many components interacting with each
other, they may have different lifecycles and states; states of different
components cannot be reduced to the lowest common denominator.

However if you have an idea of how to call the feature better to let the
user easier distinguish it from other similar features please share it with
us. Personally I'm very welcome to any suggestions that make design more
intuitive and easy-to-use.

Thanks!

[1]
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java

On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin  wrote:

> Hi Sergey,
>
> Thank you for bringing attention to that important subject!
>
> My note here is about one more cluster mode. As far as I know
> currently we already have 3 modes (inactive, read-only, read-write)
> and the subject is about one more. From the first glance it could be
> hard for a user to understand and use all modes properly. Do we really
> need all spectrum? Could we simplify things somehow?
>
> 2020-08-27 15:59 GMT+03:00, Sergey Chugunov :
> > Hello Nikolay,
> >
> > Created one, available by link [1]
> >
> > Initially there was an intention to develop it under IEP-47 [2] and there
> > is even a separate section for Maintenance Mode there.
> > But it looks like this feature is useful in more cases and deserves its
> own
> > IEP.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
> > [2]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> >
> > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov 
> > wrote:
> >
> >> Hello, Sergey!
> >>
> >> Thanks for the proposal.
> >> Let’s have IEP for this feature.
> >>
> >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov 
> >> написал(а):
> >> >
> >> > Hello Igniters,
> >> >
> >> > I want to start a discussion about new supporting feature that could
> be
> >> > very useful in many scenarios where persistent storage is involved:
> >> > Maintenance Mode.
> >> >
> >> > *Summary*
> >> > Maintenance Mode (MM for short) is a special state of Ignite node when
> >> node
> >> > doesn't serve user requests nor joins the cluster but waits for user
> >> > commands or performs automatic actions for maintenance purposes.
> >> >
> >> > *Motivation*
> >> > There are situations when node cannot participate in regular
> operations
> >> but
> >> > at the same time should not be shut down.
> >> >
> >> > One example is a ticket [1] where I developed the first draft of
> >> > Maintenance Mode.
> >> > Here we get into a situation when node has potentially corrupted PDS
> >> > thus
> >> > cannot proceed with restore routine and join the cluster as usual.
> >> > At the same time node should not fail nor be stopped for manual
> >> > cleanup.
> >> > Manual cleanup is not always an option (e.g. restricted access to file
> >> > system); in managed environments failed node will be restarted
> >> > automatically so user won't have time for performing necessary
> >> operations.
> >> > Thus node needs to function in a special mode allowing user to connect
> >> > to
> >> > it and perform necessary actions.
> >> >
> >> > Another example is described in IEP-47 [2] where defragmentation is
> >> > being
> >> > developed. Node defragmenting its PDS should not join the cluster
> until
> >> the
> >> > process is finished so it needs to enter Maintenance Mode as well.
> >> >
> >> > *Suggested design*
> >> > I suggest MM to work as follows:
> >> > 1. Node enters MM if special markers are found on disk. These markers
> >> > called Maintenance Records could be created automatically (e.g. when
> >> > storage component detects corrupted storage) or by user request (when
> >> user
> >> > requests defragmentation of some caches). So entering MM requires node
> >> > restart.
> >> > 2. Started in MM node doesn't join the cluster but finishes startup
> >> routine
> >> > so it is able to receive commands and provide metrics to the user.
> >> > 3. When all necessary maintenance operations are finished, Maintenance
> >> > Records for these operations are deleted from disk and node restarted
> >> again
> >> > to enter normal service.
> >> >
> >> > *Example*
> >> > To put it into a context let's consider an example of how 

Re: [DISCUSSION] Maintenance Mode feature

2020-08-31 Thread Ivan Pavlukhin
Hi Sergey,

Thank you for bringing attention to that important subject!

My note here is about one more cluster mode. As far as I know
currently we already have 3 modes (inactive, read-only, read-write)
and the subject is about one more. From the first glance it could be
hard for a user to understand and use all modes properly. Do we really
need all spectrum? Could we simplify things somehow?

2020-08-27 15:59 GMT+03:00, Sergey Chugunov :
> Hello Nikolay,
>
> Created one, available by link [1]
>
> Initially there was an intention to develop it under IEP-47 [2] and there
> is even a separate section for Maintenance Mode there.
> But it looks like this feature is useful in more cases and deserves its own
> IEP.
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
> [2]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>
> On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov 
> wrote:
>
>> Hello, Sergey!
>>
>> Thanks for the proposal.
>> Let’s have IEP for this feature.
>>
>> > 27 авг. 2020 г., в 10:25, Sergey Chugunov 
>> написал(а):
>> >
>> > Hello Igniters,
>> >
>> > I want to start a discussion about new supporting feature that could be
>> > very useful in many scenarios where persistent storage is involved:
>> > Maintenance Mode.
>> >
>> > *Summary*
>> > Maintenance Mode (MM for short) is a special state of Ignite node when
>> node
>> > doesn't serve user requests nor joins the cluster but waits for user
>> > commands or performs automatic actions for maintenance purposes.
>> >
>> > *Motivation*
>> > There are situations when node cannot participate in regular operations
>> but
>> > at the same time should not be shut down.
>> >
>> > One example is a ticket [1] where I developed the first draft of
>> > Maintenance Mode.
>> > Here we get into a situation when node has potentially corrupted PDS
>> > thus
>> > cannot proceed with restore routine and join the cluster as usual.
>> > At the same time node should not fail nor be stopped for manual
>> > cleanup.
>> > Manual cleanup is not always an option (e.g. restricted access to file
>> > system); in managed environments failed node will be restarted
>> > automatically so user won't have time for performing necessary
>> operations.
>> > Thus node needs to function in a special mode allowing user to connect
>> > to
>> > it and perform necessary actions.
>> >
>> > Another example is described in IEP-47 [2] where defragmentation is
>> > being
>> > developed. Node defragmenting its PDS should not join the cluster until
>> the
>> > process is finished so it needs to enter Maintenance Mode as well.
>> >
>> > *Suggested design*
>> > I suggest MM to work as follows:
>> > 1. Node enters MM if special markers are found on disk. These markers
>> > called Maintenance Records could be created automatically (e.g. when
>> > storage component detects corrupted storage) or by user request (when
>> user
>> > requests defragmentation of some caches). So entering MM requires node
>> > restart.
>> > 2. Started in MM node doesn't join the cluster but finishes startup
>> routine
>> > so it is able to receive commands and provide metrics to the user.
>> > 3. When all necessary maintenance operations are finished, Maintenance
>> > Records for these operations are deleted from disk and node restarted
>> again
>> > to enter normal service.
>> >
>> > *Example*
>> > To put it into a context let's consider an example of how I see the MM
>> > workflow in case of PDS corruption.
>> >
>> >   1. Node has failed in the middle of checkpoint when WAL is disabled
>> > for
>> >   a particular cache -> data files of the cache are potentially
>> corrupted.
>> >   2. On next startup node detects this situation, creates Maintenance
>> >   Record on disk and shuts down.
>> >   3. On next startup node sees Maintenance Record, enters Maintenance
>> Mode
>> >   and waits for user to do specific actions: clean potentially
>> > corrupted
>> PDS.
>> >   4. When user has done necessary actions he/she removes Maintenance
>> >   Record using Maintenance Mode API exposed via control.{sh|bat} script
>> or
>> >   JMX.
>> >   5. On next startup node goes to normal operations as maintenance
>> > reason
>> >   is fixed.
>> >
>> >
>> > I prepared a PR [3] for ticket [1] with draft implementation. It is not
>> > ready to be merged to master branch but is already fully functional and
>> can
>> > be reviewed.
>> >
>> > Hope you'll share your feedback on the feature and/or any thoughts on
>> > implementation.
>> >
>> > Thank you!
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
>> > [2]
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>> > [3] https://github.com/apache/ignite/pull/8189
>>
>>
>


-- 

Best regards,
Ivan Pavlukhin


Re: [DISCUSSION] Maintenance Mode feature

2020-08-27 Thread Sergey Chugunov
Hello Nikolay,

Created one, available by link [1]

Initially there was an intention to develop it under IEP-47 [2] and there
is even a separate section for Maintenance Mode there.
But it looks like this feature is useful in more cases and deserves its own
IEP.

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation

On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov 
wrote:

> Hello, Sergey!
>
> Thanks for the proposal.
> Let’s have IEP for this feature.
>
> > 27 авг. 2020 г., в 10:25, Sergey Chugunov 
> написал(а):
> >
> > Hello Igniters,
> >
> > I want to start a discussion about new supporting feature that could be
> > very useful in many scenarios where persistent storage is involved:
> > Maintenance Mode.
> >
> > *Summary*
> > Maintenance Mode (MM for short) is a special state of Ignite node when
> node
> > doesn't serve user requests nor joins the cluster but waits for user
> > commands or performs automatic actions for maintenance purposes.
> >
> > *Motivation*
> > There are situations when node cannot participate in regular operations
> but
> > at the same time should not be shut down.
> >
> > One example is a ticket [1] where I developed the first draft of
> > Maintenance Mode.
> > Here we get into a situation when node has potentially corrupted PDS thus
> > cannot proceed with restore routine and join the cluster as usual.
> > At the same time node should not fail nor be stopped for manual cleanup.
> > Manual cleanup is not always an option (e.g. restricted access to file
> > system); in managed environments failed node will be restarted
> > automatically so user won't have time for performing necessary
> operations.
> > Thus node needs to function in a special mode allowing user to connect to
> > it and perform necessary actions.
> >
> > Another example is described in IEP-47 [2] where defragmentation is being
> > developed. Node defragmenting its PDS should not join the cluster until
> the
> > process is finished so it needs to enter Maintenance Mode as well.
> >
> > *Suggested design*
> > I suggest MM to work as follows:
> > 1. Node enters MM if special markers are found on disk. These markers
> > called Maintenance Records could be created automatically (e.g. when
> > storage component detects corrupted storage) or by user request (when
> user
> > requests defragmentation of some caches). So entering MM requires node
> > restart.
> > 2. Started in MM node doesn't join the cluster but finishes startup
> routine
> > so it is able to receive commands and provide metrics to the user.
> > 3. When all necessary maintenance operations are finished, Maintenance
> > Records for these operations are deleted from disk and node restarted
> again
> > to enter normal service.
> >
> > *Example*
> > To put it into a context let's consider an example of how I see the MM
> > workflow in case of PDS corruption.
> >
> >   1. Node has failed in the middle of checkpoint when WAL is disabled for
> >   a particular cache -> data files of the cache are potentially
> corrupted.
> >   2. On next startup node detects this situation, creates Maintenance
> >   Record on disk and shuts down.
> >   3. On next startup node sees Maintenance Record, enters Maintenance
> Mode
> >   and waits for user to do specific actions: clean potentially corrupted
> PDS.
> >   4. When user has done necessary actions he/she removes Maintenance
> >   Record using Maintenance Mode API exposed via control.{sh|bat} script
> or
> >   JMX.
> >   5. On next startup node goes to normal operations as maintenance reason
> >   is fixed.
> >
> >
> > I prepared a PR [3] for ticket [1] with draft implementation. It is not
> > ready to be merged to master branch but is already fully functional and
> can
> > be reviewed.
> >
> > Hope you'll share your feedback on the feature and/or any thoughts on
> > implementation.
> >
> > Thank you!
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
> > [2]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> > [3] https://github.com/apache/ignite/pull/8189
>
>


Re: [DISCUSSION] Maintenance Mode feature

2020-08-27 Thread Nikolay Izhikov
Hello, Sergey!

Thanks for the proposal.
Let’s have IEP for this feature.

> 27 авг. 2020 г., в 10:25, Sergey Chugunov  
> написал(а):
> 
> Hello Igniters,
> 
> I want to start a discussion about new supporting feature that could be
> very useful in many scenarios where persistent storage is involved:
> Maintenance Mode.
> 
> *Summary*
> Maintenance Mode (MM for short) is a special state of Ignite node when node
> doesn't serve user requests nor joins the cluster but waits for user
> commands or performs automatic actions for maintenance purposes.
> 
> *Motivation*
> There are situations when node cannot participate in regular operations but
> at the same time should not be shut down.
> 
> One example is a ticket [1] where I developed the first draft of
> Maintenance Mode.
> Here we get into a situation when node has potentially corrupted PDS thus
> cannot proceed with restore routine and join the cluster as usual.
> At the same time node should not fail nor be stopped for manual cleanup.
> Manual cleanup is not always an option (e.g. restricted access to file
> system); in managed environments failed node will be restarted
> automatically so user won't have time for performing necessary operations.
> Thus node needs to function in a special mode allowing user to connect to
> it and perform necessary actions.
> 
> Another example is described in IEP-47 [2] where defragmentation is being
> developed. Node defragmenting its PDS should not join the cluster until the
> process is finished so it needs to enter Maintenance Mode as well.
> 
> *Suggested design*
> I suggest MM to work as follows:
> 1. Node enters MM if special markers are found on disk. These markers
> called Maintenance Records could be created automatically (e.g. when
> storage component detects corrupted storage) or by user request (when user
> requests defragmentation of some caches). So entering MM requires node
> restart.
> 2. Started in MM node doesn't join the cluster but finishes startup routine
> so it is able to receive commands and provide metrics to the user.
> 3. When all necessary maintenance operations are finished, Maintenance
> Records for these operations are deleted from disk and node restarted again
> to enter normal service.
> 
> *Example*
> To put it into a context let's consider an example of how I see the MM
> workflow in case of PDS corruption.
> 
>   1. Node has failed in the middle of checkpoint when WAL is disabled for
>   a particular cache -> data files of the cache are potentially corrupted.
>   2. On next startup node detects this situation, creates Maintenance
>   Record on disk and shuts down.
>   3. On next startup node sees Maintenance Record, enters Maintenance Mode
>   and waits for user to do specific actions: clean potentially corrupted PDS.
>   4. When user has done necessary actions he/she removes Maintenance
>   Record using Maintenance Mode API exposed via control.{sh|bat} script or
>   JMX.
>   5. On next startup node goes to normal operations as maintenance reason
>   is fixed.
> 
> 
> I prepared a PR [3] for ticket [1] with draft implementation. It is not
> ready to be merged to master branch but is already fully functional and can
> be reviewed.
> 
> Hope you'll share your feedback on the feature and/or any thoughts on
> implementation.
> 
> Thank you!
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-13366
> [2]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> [3] https://github.com/apache/ignite/pull/8189